Scaling Agentic AI with Chirag Agrawal

I spoke with The Digital Executive about scaling agentic AI systems — discussing the challenges of building autonomous AI agents that operate reliably at enterprise scale.

Agentic AIScalabilityEnterprise

Spotify YouTube Apple Podcasts

Transcript

Host: Welcome to the Digital Executive. Today's guest is Chirag Agrawal. Chirag Agrawal is a seasoned technology professional with over a decade of experience building large-scale AI platforms, distributed systems, and developer tooling. As a senior engineer and tech lead, he specializes in LLM infrastructure, advanced AI powered conversational systems, and multi-agent orchestration, including agent execution, system prompt engineering, AI tool use, and AI memory, along with software development kits and compilers. Chirag leads cross-functional architectural initiatives driving lower latency, reliability, and scale for Alexa and its developer ecosystem.

Host: Well, good afternoon, Chirag. Welcome to the show.

Chirag: Thank you for having me, Brian. It's great to be here.

Host: You focus on the infrastructure between models and applications — runtimes, orchestration, memory, execution graphs. How should product teams think about the gap between "we have a model" and "we have a system running it reliably in production," and what are the most common pitfalls you see?

Chirag: That's a great question. I think the model should be treated as a dependency, not the main product. Product teams should think about not the model, but the system around it.

One common pitfall I see is teams trying to build their AI agents from scratch. They end up spending most of their time doing undifferentiated work. This is because they often overlook the complexity of the system around the model that's required to ship the agent in production. There are many concerns like retrieval, tool calling, orchestration, context management, compression, caching, evaluation, and everyone doing it ends up duplicating the same work. It's undifferentiated and often handled by frameworks or developer tooling provided by platforms.

So what teams should do is build the agent once as a prototype because it's a useful learning exercise, and then use a framework going forward to ship the agent in production systems.

Another pitfall is teams not treating evaluation and guardrails as first-class citizens in their development lifecycle. They're often treated as a last step, like, "We'll evaluate it on some dataset after we build the agent." That's usually a mistake.

Host: One of your focus areas is developing tooling, typed function calling SDKs, binding model outputs to real APIs. How do you strike the balance between giving developers freedom to experiment and enforcing disciplined architecture so the system scale remains manageable?

Chirag: Developer freedom and architectural discipline can look like opposite forces, but they're actually not. Developer tooling provides good abstractions and often lowers the barrier to experimentation and speeds up velocity.

The goal of architectural discipline is to provide a safe playground where developers can move fast without breaking the larger system. For example, binding model output to real APIs. A well-architected developer platform provides abstractions so you as a developer can design your API and bind it to model outputs the way you see fit. But concerns like schema validation, error handling, safety guardrails, progressive context compression, and the things required to keep the user experience smooth and manage latency and cost, those are handled by the platform.

So a developer platform provides general solutions to essential problems. It doesn't curb developer freedom. This kind of platform can create incredible leverage for an organization. A single change in the platform can improve latency or cost for all teams or products running on the platform.

Host: Many organizations adopt AI with enthusiasm, but when you push into latency constraints, cost optimization, and real-world performance, you hit friction. What are the operational metrics you monitor closely in agentic platforms? And how do you trade off quality versus cost versus speed?

Chirag: There are three big buckets of metrics we monitor. First is latency, second is number of tokens, and third is quality.

For latency, you can start with how much time it took to construct the prompt, including the time you spend gathering context. Another is how much time it took from sending the prompt to the model to receiving the first token. That's related to the size of the prompt and the size of the model. Another is how much time it took to render the first word or the first element of the response to the user. Because it's generative AI, you can start streaming as soon as the first word is ready.

For number of tokens, the obvious metrics are how many input tokens you're sending and how many output tokens the model produces per user request. Those drive cost and also latency. Within that, you can monitor cached and uncached tokens and measure tokens for different parts of the prompt that are dynamic, like conversation history. Cached input tokens are valuable to monitor because they can guide prompting strategies. If the top parts of the prompt are unchanged over the conversation and the most dynamic parts are toward the bottom, you'll utilize more cached input tokens, reducing cost and latency.

Quality is harder to define and depends on the domain and task, but some high-level metrics that are common across many agents include accuracy of tool selection, accuracy of argument filling, and truthfulness relative to the provided context. These are often hard to monitor online because you need ground truth, so the quality bucket is often monitored offline through a test harness.

On the tradeoff between these three, they form a triangle. If you lean too hard on one, you give up the others. If you improve latency too aggressively, you may compromise intelligence or product quality. If you chase quality too aggressively, token cost can explode and that increases latency. It's an art. You have to continuously tune the system based on user feedback.

Host: As you build the foundational layers of agentic AI systems, how are you thinking about ethics, bias, transparency, and interoperability so agents from different teams and systems can coordinate? Looking ahead, what do you see as the next frontier in production AI infrastructure?

Chirag: Bias and transparency should be built at the foundational layer of the system itself. They can't be bolted on later. All requests flowing through agentic platforms should be auditable, observable, and traceable. Evaluation hooks should be built into the runtime platform so we can monitor agent behavior in real time through reflection, and design ways to prevent unsafe behavior, or at least mitigate it quickly.

Interoperability is the other side. Agents built by different teams need to communicate through open and typed protocols, similar to how systems integrate through HTTP. That's why I'm excited about emerging standards like MCP and A2A, which define how agents discover each other, authenticate unknown agents, and collaborate safely across system boundaries.

The current scenario reminds me of the early days of Android and iOS. We had mobile apps that were functional, but not that good. Over the next decade they became really good. I think the same will happen with agents. They're in a nascent stage now, but they'll improve dramatically over the next few years.

Looking ahead, I think the next frontier in production AI infrastructure is an internet of agents, or multi-agent systems, where agents built by different teams can share capabilities but still operate under their own governance.

Host: Chirag, it was such a pleasure having you on today, and I look forward to speaking with you real soon.

Chirag: Yeah, thank you for having me. It was a pleasure talking to you as well. Bye for now.

All podcasts