Build a benchmarking platform that measures LLMs’ ability to select and call the correct tools under heavy toolsets, providing standardized performance metrics.
Developers can replace Claude models with custom ones like Kimmy by hacking existing CLI tools, speeding up experimentation without building new interfaces.
Use an agent evaluation methodology by overloading an LLM with 30–40 distinct tools to observe its decision-making and tool-selection accuracy under heavy tooling conditions.
If you have a high-quality evaluation set, you can iterate on prompts and inference strategies instead of fine-tuning the base model to achieve great user outcomes.
Create a developer-centric AI agent, analogous to a medical agent, trained on codebases, APIs, and engineering best practices to provide in-depth programming assistance.
Train or fine-tune large models on specialized domain corpora—like medical literature—to create agents with deep, expert-level knowledge in that field.
The developer community doesn’t just adopt AI features—they actively shape the direction of AI model development by providing feedback and building higher-level tooling.
Instead of relying solely on external dev tooling, embed tool-calling capabilities directly within the base AI model so it can act as an "intellectual grunt" able to invoke developer-built tools in context.
We’re running out of tokens. We need to figure out a way to generate synthetic data that’s effective for pushing the frontier of intelligence in these models out.