AI Engineer Melbourne

Observability and Evaluation for LLM Apps and Agentic AI with Langfuse

Shipping an LLM app is easy. Knowing whether it’s actually working is hard. Unlike traditional software, LLM systems and agentic pipelines can run without errors while producing wrong or degraded answers — and with agents making multi-step decisions across tools and APIs, a single bad output can cascade silently through your entire system.

This hands-on workshop shows you how Langfuse gives your team the visibility and control to ship AI with confidence. You’ll go from a blind prototype to a fully observable system — seeing exactly what your app costs, where quality drops, and which prompt changes actually improve things.

For agentic systems, Langfuse traces every step of an agent’s decision-making, so when something goes wrong you know exactly why, rather than losing hours to guesswork. Features like the Prompt Playground and LLM-as-a-judge evals mean faster iteration with less manual effort.

Prerequisites: Python 3.10+, OpenAI API key (optional), laptop. Langfuse account created on the day (free).

AI Hamsters: Circling Your Way to Success

Every AI engineer knows the loop — prompt tweaks, evals, regressions, repeat. It feels like going in circles. But that’s not the problem. The problem is not knowing if each circle is tighter than the last. This talk is about how Langfuse turns iteration from an act of faith into something measurable — traces, scores, and evals that tell you whether you’re actually moving forward or just staying busy.

But the loop isn’t just for the developer anymore. Tools like Cursor and Claude Code can tap into Langfuse mid-build, check whether recent changes moved things in the right direction, and keep iterating without waiting on you. The feedback loop becomes part of the development process itself — and that’s when the wheel really starts to spin.

Muhammad Ali

Muhammad Ali is an AI Engineer and Solutions Architect at ClickHouse, specializing in the intersection of real-time analytics and Agentic AI. As the Langfuse Lead for the APJ region, Ali bridges the gap between data engineering and LLM orchestration. Over the past three years, he has designed AI applications for the likes of Apple, Atlassian, and Amazon, focusing heavily on the AI development lifecycle.

Muhammad’s expertise lies in transforming “blind“ prototypes into observable, reliable systems. By combining ClickHouse’s high-speed telemetry with Langfuse’s evaluation frameworks, he helps developers solve the silent failures of multi-step agents. He previously served as the Principal Analytics Tech Lead (APJ) at AWS