MLOps Community
+00:00 GMT
Sign in or Join the community to continue

From Spikes to Stories: AI-Augmented Troubleshooting in the Network Wild // Shraddha Yeole // Agents in Production 2025

Posted Jul 28, 2025 | Views 60
# Agents in Production
# AI-augmented Troubleshooting
# ThousandEyes
Share

speaker

avatar
Shraddha Yeole
Senior Software Engineer, Machine learning @ ThousandEyes part of Cisco

Shraddha Yeole is a Senior Software Engineer at Cisco ThousandEyes, specializing in backend systems and machine learning. With over five years of experience, she has contributed to building and deploying scalable data solutions, focusing on data cataloging and developing real-time and batch machine learning systems. Currently, Shraddha is engaged in AI/ML initiatives, leveraging large-scale network and application telemetry to enhance observability and provide explainability through intelligent agents. Passionate about the evolving landscape of agentic systems, she is keen to explore the integration of Large Language Models (LLMs) in real-world applications.

+ Read More

SUMMARY

It’s 2 a.m., and a critical service slows down. Dashboards scream red—packet loss, timeouts, delays. The clock is ticking. Eyes race across a maze of graphs, flipping through visualizations and route tables. One graph leads to another. A dozen tabs open. Fatigue sets in. You’re left guessing: Is it the network, the application, or something else? Welcome to the new normal in network operations—where telemetry is endless, but clarity is rare. This session explores how AI and large language models (LLMs) transform observability by evolving views from data presentation to intelligent data interpretation. Instead of manually piecing together clues, imagine asking, “What’s wrong here?” and receiving clear, contextual insights. AI-powered storytelling augments human reasoning, reduces noise, and accelerates fault isolation—lowering misdiagnosis risk and improving mean time to identify (MTTI) and resolve (MTTR). Join us to see how storytelling is reshaping digital operations.

+ Read More

TRANSCRIPT

Demetrios [00:00:00]: [Music]

Shraddha Yeole [00:00:10]: I'm excited to take you on a journey where we'll be talking about how confusing spikes are converted to AI augmented stories which helped our customer troubleshoot faster and it helped them in their day to day life. So let's consider you want to play some music, you're trying to stream stream your favorite playlist on Spotify but it skip buffering or the worst it's not loading. Now where is the issue? Is the issue in your wi fi or is the issue with the Internet service provider or the issue with the cloud provider where the Spotify is hosted? Or the issue is in between the network transit and that's where the Thousand Eyes steps in Thousand Eyes provides the path trace through throughout your digital experience. So consider going from your like own application server to Internet service provider and towards the cloud platform and the backbone of the whole Internet. So thousand I sees the whole Internet backbone. That being said now consider if you want to understand what's going wrong with the Spotify, what if we have an agent that explains the issue even if it's not in your network. And that's where we say like Thousand Eyes provides the visibility and AI powered insurance across your own and unknown environments. Now that being said with Thousand Eyes we have very rich telemetry data across application and network metrics.

Shraddha Yeole [00:01:50]: So we have this great visual dashboards where we have a lot more data insights related to network application layers. Now consider looking at this dashboard. Imagine if we can convert this standard dashboards moving from the standard dashboards to AI augmented stories which will help users troubleshoot the issues faster. So let me take you through an example. So there was a Google cloud outage happened last month on 12 June and the Spotify was hosted on the Google cloud. So you're playing, you're listening to the music and you user complaining about low availability about the Spotify. Now as a user if I have to troubleshoot this issue as a Thousand Eyes user I have to go click through this multiple layers like HTTP layer or network layer or BGP routing layer where the how the packets are routed and even I have to click through n number of metrics to understand and piece together where is the exact issue coming from. Like the issue is at the application that is the web layer or is the issue with the network layer.

Shraddha Yeole [00:03:01]: So we have this N number like good visuals and the dashboard and user has to click through this n number of clicks to understand and come up with the fault domain assessment. Now that being said these are the key pain points we observed like there is a Cognitive overload when we are trying to troubleshoot like it's very hard to isolate to fault to mind where the exact issue is. And oftentimes there is a contextual amnesia like the context is lost during the inveting investigation. And sometimes for folks who are not very expert in the network ops, they find it very difficult to interpret this views data. And that's where we deployed our AI agent where with a single click you can get the insights happening across all these layers which correlates the data and provide a great insights. So as I click on this explain button the AI assistant is coming up and now it's connecting to the model and now it's going through each layer by layer data and trying to give me a fault domain analysis. Here we can say the Spotify was not working fine because there was issue at the HTTP layer. That means the application layer was impacted with where we have less availability compared to the baseline data and multiple agents that means multiple locations were impacted.

Shraddha Yeole [00:04:28]: Here we can see in the map multiple agents were impacted with different fees. Like the RACO HTTP phase we can see there are different this error code. So when we click on this button the AI agent comes up, it goes through this investigation, it tries to piece together all these layered metrics and correlate the data and provide an impactful fault domain assessment. And this is helping our customer to troubleshoot faster and go towards the final leg where they are trying to find the issue. Now user can also click on the follow up prompts to get more understanding about specific layer. Okay, what's happening in this specific layer and so on. So that being said now what we have seen. So this is where with the AI agent in production we are trying to shift from data presentation to the intelligent interpretation mode where we can faster have troubleshooting and fault isolation performed and how it is impacting the customer where we are trying to improve the meantime to resolution and meantime to market.

Shraddha Yeole [00:05:35]: And in this through AI agent we are trying to capture the key observations across these layers like HTTP network and partners data. Now how are we deploying this AI agent in the productions? So when I clicked on the explain this button the AI assistant pop up came up. Now it routes the request through a semantic kernel. So it's the open source framework used by Microsoft. And then that semantic kernel is routing the request to our views agent. Now views agent underneath makes a call to the LLM model and provide the data and get the response back and display the response on the AI assistant. So in this quick architecture diagram we can see like it goes to the views agent, makes a call to our API endpoints and before providing the response like hitting the call to the LLM agent, we do lot of pre processing and we get the response from the LLM agent. And for the faster explanations we are also storing the chat history.

Shraddha Yeole [00:06:42]: Now the key part how we are implementing this agent tick behavioral in our agents and we really use different techniques to come up with the prompts. So the model which we are using for this is anthropic cloud 3.7 sonnet model which is a good reasoning model. And we use different techniques for the prompting like role prompting. We are assigning a role like consider you are a network engineer going through solving these scenarios. We really invested some time understanding how we can employ the chain of thought reasoning which gives the domain expertise while resolving the step by step analysis. We work very closely with solution analysts or subject matter experts who helped us form this chain of thought prompt reasoning. We also included some few short prompting where we demonstrated how should the input and output look like. For example to give the full domain analysis like users should be able to click on the links very quickly and what should be the response structure you should adhere to.

Shraddha Yeole [00:07:48]: So these were the few prompt related techniques which helped us and how. Now the main part like how did we evaluate the response of our prompts and overall the evaluate the AI agents. So to start with we started having the manual evaluations done. So we worked with our subject matter experts. We got some ground truth responses from them where we can see like the Spotify there was a well known outage so we had the whole block analysis done. So we used that data, we ran our POC with the agents and that served as a ground truth data. We have a human in feedback loop to evaluate the responses. Now we had also employed the LLM Azures to track different metrics like what is the hallucination rate or how we can capture the answer relevancy.

Shraddha Yeole [00:08:46]: The prompt alignment or the data being fed to the LLM is right and it's adhering to that overall matrix. Now another part is to come up with the right LLM output structure. We followed some standard guidelines from the anthropic model where we figured out the good response when we use the XML tags like instructions or examples and output format. We also made sure to guide the structure with the JSON schema to provide it into our system prompt which really helped LLM to understand what kind of data is coming in. In the previous panel discussion with Discuss like FAST API or pydantic model that's where we also embedded the data model on annotations which really helped LLM to understand the data model and learn more about what are the fields being injected into the LLM model. That being said, this was the overall strategy used to perform the prompt engineering. Yes, there were a few challenges we observed with specific to our scenario. As you saw we have wide variety of telemetry data across application and network and that's where we have a sheer volume of data and complexity.

Shraddha Yeole [00:10:05]: We had to do lot more of pre processing enrichment on top of it in order to LLM guide better results. We have this different matrix like HTTP related timing, KPIs or latency. So we had to look into the broader concept of baselining those data or use some predefined ML strategies before we provide data to the LLM. Yes if you provide the raw API data as is to LLM there were few like issues leading to high token counts and weak correlations. So we had to pre process and enrich data in a certain format very quickly. Talking about the next steps we are employing how we can have a continuous learning or human in feedback loop incorporated with ground truth data with the outages data and the subject matter review. Currently we are using the anthropic model but going forward for these AI agents how we can fine tune the model based on our network data and provide more training data and fine tune the model and host that model within our environment. Thank you.

Demetrios [00:11:11]: Excellent. Thank you so much. And there's some super practical tips to take away from this. I really appreciate this talk. We are closing in though. Shraddha. I'm gonna keep it moving and if anybody has any questions for you, I'll let them direct it to the chat or to hit you up on LinkedIn so we can keep the conversation going.

Shraddha Yeole [00:11:41]: Sounds good. Thank you.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
Comments (0)
Popular
avatar


Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 6.3K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Lessons From Building Replit Agent // James Austin // Agents in Production
Posted Nov 26, 2024 | Views 1.4K
# Replit Agent
# Repls
# Replit
Exploring AI Agents: Voice, Visuals, and Versatility // Panel // Agents in Production
Posted Nov 15, 2024 | Views 1.3K
# AI agents landscape
# SLM
# Agents in Production