Use the SD bench dataset as an evaluation set to benchmark agent vs physician performance in a controlled study.