Multi-Model Collaboration with Claude Code: How to Measure What Actually Works
We built Claudish, a free open-source proxy that lets Claude Code work with any AI model. 15+ providers directly - Google, OpenAI, xAI, Kimi, MiniMax, and more. OpenRouter for even wider access. Or fully offline with Ollama. That was just the starting point. What came next was way more interesting.
When you can run any model through the same interface, you start asking real questions. Which model works best for which task? Does mixing models actually help or is it just expensive complexity? How do you find the right combination for your team? And the hardest one - how do you measure any of this when LLM output is non-deterministic? You can’t run the same prompt twice and get the same result. I’ll share what we learned running multi-model setups across 100+ projects with a 70-engineer team. How we approach measurement, what surprised us, and a practical framework for engineers who need to evaluate AI tooling with something more than “it feels faster.“
Jack Rudenko
The CTO at MadAppGang and founder of 10x Labs, Sydney, Jack works with real Australian businesses running AI on real workflows. Not pilots. Not demos. Actual stuff they depend on.
We’re deep in the experimental phase - learning more from what breaks than what works.