Implement a chain-of-debate node in Langgraph that holds a shared state and orchestrates interaction among multiple LLMs for tasks like medical diagnosis.
Implemented the entire diagnostic architecture using one- to two-shot prompts rather than heavy custom code, leveraging prompt engineering as the core development method.
Configured the system to generate three to five follow-up questions and loop back to patient interaction whenever diagnosis confidence falls below or exceeds set thresholds.
Built a diagnostic flow with five specialized physician agents (Dr. Hypothesis, Dr. Test chooser, challenger, cost controller, quality control) that debate a medical case to improve reasoning accuracy.
Using Shift+Tab in the CLI toggles between 'ask' mode (research only) and 'agent' mode (action execution) in Claude, allowing controlled orchestration of AI workflows.
A two-agent swarm architecture uses a gatekeeper agent to filter cases and a diagnostic agent to adjudicate final diagnoses, coordinating via a defined sequence outlined in the paper.
Their Mai DXO ensemble achieved 80% diagnostic accuracy at ~$2.5K test cost, versus $8K for a single O3 model, and 50% accuracy with only question-based diagnosis.
They overlaid standardized US medical test pricing plus a $300 consult fee per patient query to jointly evaluate diagnostic accuracy and incurred test costs.
A gatekeeper agent synthesizes and returns real or fabricated test results to prevent reward hacking when an LLM swarm infers lack of data as negative feedback.