Automate multiple runs of model outputs and use an LLM to grade them against a golden dataset for quantitative quality assurance metrics.