Implemented the entire diagnostic architecture using one- to two-shot prompts rather than heavy custom code, leveraging prompt engineering as the core development method.
Configured the system to generate three to five follow-up questions and loop back to patient interaction whenever diagnosis confidence falls below or exceeds set thresholds.
Built a diagnostic flow with five specialized physician agents (Dr. Hypothesis, Dr. Test chooser, challenger, cost controller, quality control) that debate a medical case to improve reasoning accuracy.
Using Shift+Tab in the CLI toggles between 'ask' mode (research only) and 'agent' mode (action execution) in Claude, allowing controlled orchestration of AI workflows.
A two-agent swarm architecture uses a gatekeeper agent to filter cases and a diagnostic agent to adjudicate final diagnoses, coordinating via a defined sequence outlined in the paper.
Their Mai DXO ensemble achieved 80% diagnostic accuracy at ~$2.5K test cost, versus $8K for a single O3 model, and 50% accuracy with only question-based diagnosis.
They overlaid standardized US medical test pricing plus a $300 consult fee per patient query to jointly evaluate diagnostic accuracy and incurred test costs.
A gatekeeper agent synthesizes and returns real or fabricated test results to prevent reward hacking when an LLM swarm infers lack of data as negative feedback.
They designed a swarm of O3-based LLM personas (challenger, checklist, hypothesis generator, test-ordering, stewardship) orchestrated via a chain-of-debate mechanism to iteratively converge on a diagnosis.
Researchers built SDbench, a 304-case sequential diagnostic benchmark from New England Journal of Medicine case proceedings to evaluate iterative AI diagnostic agents.
Use publicly available medical case datasets—such as those from the New England Journal of Medicine or Hugging Face benchmarks—and evaluate AI agent performance against clinician diagnoses.
Adopt a modular multi-agent pipeline where each agent specializes in steps like data extraction, reasoning, and diagnosis, as demonstrated by Microsoft’s applied medical AI paper to outperform both frontier models and doctors.