Sign in or Join the community to continue

GenAI Traffic: Why API Infrastructure Must Evolve... Again

Posted Mar 14, 2025 | Views 116

# GenAI

# API Infrastructure

# Tetrate

Share

speakers

Erica Hughberg

Community Advocate @ Tetrate

Erica Hughberg is a technical leader and community advocate passionate about helping engineering teams build scalable, secure, and human-centric application platforms. With a background in software engineering and a deep understanding of cloud-native technologies, she specializes in driving the adoption of open-source projects like Envoy Gateway, Istio, and Kubernetes Gateway API, which enable organizations to simplify traffic management, security, and API distribution.

As a maintainer of Envoy AI Gateway, she plays a key role in shaping the future of API infrastructure. She focuses on features to ensure organizations can securely and efficiently integrate AI-powered services while simplifying traffic management, security, and API distribution. In the Envoy community, she drives collaboration, mentorship, and contributions that advance the project and its adoption.

Lastly, as a believer in the power of storytelling, Erica enjoys translating complex technical concepts into engaging, accessible narratives in the form of social media posts, conference talks, podcasts, and educational content.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

The way we handle API traffic is broken for GenAI.

We've spent years optimizing for microservices—fast, stateless, and lightweight API calls. But GenAI changes everything. Requests are slower, heavier, and more complex, requiring long-lived connections, massive payloads, and streaming responses. Suddenly, traditional API gateways are struggling—timeout limits are too short, rate limiting models don’t fit, and payload constraints are blocking innovation.

In this episode, we unpack the new challenges of GenAI traffic and why infrastructure must evolve—again. We look back at previous API shifts, from the C10K problem to the monolith-to-microservices revolution, and how they reshaped networking. Now, AI-driven workloads demand a new kind of API gateway—one that handles token-based rate limiting, cost-aware request shaping, and scalable AI inference traffic.

+ Read More

TRANSCRIPT

Erica Hughberg [00:00:00]: So my name is Erica Hughberg, and I work as a community advocate at a company called Tetrate, which means that I listen to people out in the industry and ensure that the solutions that are being built actually address people's real needs. And how do I take my coffee? Unfortunately, I normally take it cold. Not by choice, but I do take it black. So I do often end up with a cold cup of coffee on my desk because I didn't drink it fast enough.

Demetrios [00:00:30]: We are back in action for another ML Ops community podcast. I am your host, Demetrios. And talking with Erica today, I learned a few things about the evolution of the Internet and how we do things on computers in general, but also her theory on how we've been plowing towards one reality of the Internet and because of large language models, that's all been flipped on its head. I'm not going to spoil anything. I just want to get right into the conversation. I thoroughly enjoyed talking with Erica. She has the best sense of humor and I appreciate her coming on the pod. Let's do it.

Demetrios [00:01:18]: I want to jump into the tofu and potatoes on this. So it sounds horrible. We'll just go with it.

Erica Hughberg [00:01:27]: Yeah. I'm not a big fan of tofu. I'm very particular about how I eat tofu. Only in a pad Thai when it's crispy.

Demetrios [00:01:34]: Yeah. But there's so many ways you can make tofu.

Erica Hughberg [00:01:38]: I've made some really bad tofu in my life. I tried to make some scrambled tofu. That was not a good idea.

Demetrios [00:01:44]: Yeah, that's hard.

Erica Hughberg [00:01:47]: Yeah. I failed, so I'm not doing that again.

Demetrios [00:01:50]: So, anyways, let's get into the meat and potatoes of this conversation, which is the Internet is changing a little bit. You've been spending a lot of time on gateways and thinking about different gateways. I think.

Erica Hughberg [00:02:06]: Yeah.

Demetrios [00:02:07]: What? Can you break it down for us and then we'll, like, dive in deeper on it.

Erica Hughberg [00:02:11]: Yeah. So I do think it's helpful to sometimes take many steps back to how the networking of the Internet and how our applications are communicating over, like, the last. Since 2010. Really? Or even before. No, before then. Since the early 2000s. I got the years wrong. Look, I'm getting old.

Erica Hughberg [00:02:33]: I was like, oh, maybe it's not that long. So the last 25 years, roughly, has been really interesting in the evolution of the Internet. So if we go back to the early 2000s. So 2001. 2002, 2003. 3. What was happening during that time and how this is relevant to Gen AI will hopefully become obvious in a little bit. But if you were around then, I do appreciate that everyone are older than 25 years old.

Erica Hughberg [00:03:03]: There are people who are younger than 25. But for those of us who were around, it was a very exciting time. We were going away from dial up to broadband and more and more people were getting access to the Internet and it was getting more common that companies had websites. We saw the rise of the social media and forums. We were on forums back then. We were on web forums talking to people we didn't know. Very exciting. And IRC channels.

Erica Hughberg [00:03:38]: Very cool. If you know what that is. I hope you take your supplements. But I, I am taking.

Demetrios [00:03:46]: So it rings too true. Oh my God.

Erica Hughberg [00:03:50]: Well, so if you do remember these things, what was happening was that with more people getting devices and connecting to the Internet, that also meant that we had in greater need of handling many, many concurrent connections to web servers. So imagine Facebook, right? You couldn't really have a social media app with only like a few thousand people on it. That wouldn't be very fun. Like when I say a few thousand, we're talking single digit thousands, not hundreds of thousands. So as we were scaling up and having this more interactive and multi connection on the Internet, we were hitting a problem that the traditional thread based proxies were struggling. So what does that, what is a thread based proxies? It would mean that you had a single connection per thread. So imagine you went to a restaurant and you were ordering food. So you come in and you get a waiter for your food.

Erica Hughberg [00:04:55]: You order your tofu or whatever you like, maybe you find your fancy tofu dish and at that point, instead of the waiter just going around serving other tables while you're waiting for your food to cook, the waiter would stand there by your table and wait for your food to be ready and then come back, go back to the kitchen and get your food and deliver it to you when it was ready so that waiter would be busy the entire time your food was cooking. But when we moved into, when the Internet exploded as a problem that is known as the C10K problem of handling 10,000 concurrent connections. And this just then continues to explode as the Internet grows, right? This problem then isn't just isolated to 10,000 concurrent, but that problem continues to grow as the Internet grows and the more users are on there. But that was the beginning of that problem. So then we got waiters that actually didn't have to stand by the table anymore and wait for your food to be ready. Like instead of you just sitting Waiting for the food, the table to be ready, the waiter had to stand there with you. But then we got a better model where waiters could just put in the order in a system and go and wait other tables. And even if your waiter was busy, there was a runner who could go and deliver your food when it was ready because they know what you ordered and what table you're at.

Erica Hughberg [00:06:19]: Because there was a system, this was the move to event driven proxies. So now we were able to get orders in. So that's the event. The order came in, the request came in and we sent it to the kitchen. That's your backend target server you're getting to. They're processing it, cooking, cooking, cooking. And then you are sharing threads. So many connections could share threads and different workers could pick up and deliver the response.

Erica Hughberg [00:06:49]: You didn't have to have your specific waiter. So that was the cool thing that happened during that time and when we started. And then what happens after that? So cool. We solved the connection problem. We can handle lots and lots of connections. That was exciting, right? No one has to sit at the and wait. Well, we still have to wait, but the waiter doesn't have to wait. And then what happens is this move to break down all of our apps.

Erica Hughberg [00:07:15]: So this happens around the early 2010s. So this movement really starts kicking in around 2015. And that is, we used to have big monoliths. And what is a monolith really? In software engineering, I like to think of it as imagine you have a big box and all the little components that make up your software is in this box. So how you log in is in the box. It's a little. Maybe imagine it like a little teddy bear you shoved into that box and then you have another box full of hair ties in the box and you have lots of little functionalities in the box. You throw it all in one big box.

Erica Hughberg [00:07:56]: The problem with these big boxes was as the number of users increased, we needed to start to scale horizontally. And this is when we come into the world of cloud engineering and like scaling horizontally. But imagine in this big box with many different features, from viewing images to logging in to if you think about the social media part, like loading your feed, like, get all the posts from my contacts or connections or friends and show them that is one feature of this big box. But this box did so many things and we're like, oh, if we're going to scale horizontally and with more and more uses, this is getting more and more expensive. So scaling this big box. So cloning this big box over and over again started to get computationally expensive because we needed to reserve so much memory to the.

Demetrios [00:08:53]: There's a bunch of waste.

Erica Hughberg [00:08:54]: Yeah, yeah. So we're like, okay, well, what if we took out the teddy bear out the box, and instead of cloning the whole box, we just clone the teddy bear that we put in the box. Right? So now we can have with. That's where we start breaking down the boxes into smaller boxes. So the teddy bear now has its own box. It doesn't have to be in the big box with all the other junk you threw in there. Now it's got its own box and it's a much smaller box. So now we can scale up that much more resource efficiently.

Erica Hughberg [00:09:27]: So that drove us to the move to change to microservices. So we went from monolith big box with lots of stuff shoved into it to lots of smaller boxes with just one or a few items in each box that could scale together. So it became much more resource efficient. Ah. But as we started breaking apart these boxes, we ran into a networking problem, of course, because when we had the monoliths, it was all very straightforward from a networking and proxy perspective. So, yeah, we'd sold the multiple connections to the proxy. Well done, everybody. We'd done that.

Erica Hughberg [00:10:05]: But now, well, where are we going to send the traffic to? So now we had a new problem because as we tried to make all of our systems more resource efficient and we were scaling up all of these services, we heard, had this fascinating problem that, wait a minute, stuff's moving around. They're on different addresses now. And as you scale up and have more teddy bear boxes, and then the teddy bear boxes are moving around as well, because someone came up with this bright idea called kubernetes. And you really know what Kubernetes does? Really, what it does is moving your teddy bear box and your other boxes around on servers like actual computer hardware. They are called nodes in Kubernetes. But what it really does is playing box Tetris. That's what Kubernetes is really good at. It's just stacking these boxes to maximum and most efficient resource utilization on your nodes.

Erica Hughberg [00:11:04]: So now you may be moving your teddy bear around as it feels like, because maybe it fits better over there. Because, yeah, Kubernetes is just about playing box Tetris with computing resources and putting them in. So, okay, so now stuff's moving around. So the network engineers, they're pulling their hair out. They're like, we can't keep up with all of this moving around and changing addresses, this is also what becomes interesting because we didn't now just need to handle multiple connections. Now we have to be able to dynamically and quickly update the proxy to where it's pointing traffic to from a logical perspective. Oh, you want to reach the Teddy Bear API. Okay.

Erica Hughberg [00:11:44]: These are all the places to serve the Teddy Bear API and now we have to. So this is again a big shift in how we handled proxies because up until this point proxies like nginx were statically configured. They were not dynamically updated as your situation changed. So that's when we saw that that is actually what drives the introduction of proxies like Envoy Proxy that can dynamically reload configuration. So it doesn't need to restart at all. So as your targets are moving around, your teddy bars are moving around under different boxes, they Envoy proxy doesn't need to restart, it just gets dynamically updated. And now we know say Envoy Proxy was really created for that. It was created in the era of breaking things apart.

Erica Hughberg [00:12:38]: So now we're taking a log or step back. And then during this microservices era was also cool. Well, we start to optimize. We want to do loads of little requests and they're light and they're fast and these little services like our now Teddy Bear service, I imagine this like little cute pink teddy bear in my head by the way, this Teddy Bear service, it's, it was really, it had to be really fast and it needed to be really lightweight. Because if we want to do also really good horizontal scaling, we need things to start fast response. I mean the response fast kind of comes with it if you can build a really self contained and thought through service. But so we, we now optimized how we think about networking in this microservices era to be fast, the process itself should respond within single digit milliseconds. Right? It shouldn't take longer.

Erica Hughberg [00:13:38]: Keep this in mind now because what's happened with Genai like a service, a model service is fast if it returns a response in 100 milliseconds. So it's if you think that the Tadibar API service responded inside of itself in the space of 1 or 2 milliseconds or 5 maybe. But we're talking about anywhere from like a hundred times slower to 10 times like between 10 to 100 times slower at its fastest. When we're looking at LLM services for example, the time. No, we're not talking network latency, we're just talking about the process itself responding. So it's a lot slower at responding from first bytes. Then we have all the fun stuff. We talk about the first byte time to first byte because LLM stuff also tends to be streaming and we really then focused on time to first bite.

Demetrios [00:14:35]: That was the best explain it like I'm 5 description of the evolution of software engineering I've ever heard. That is amazing. From the restaurant analogy to the teddy bear to kubernetes just being Tetris. That is amazing. And I love each little piece of that. And so now in this world of everything being an API and our microservices paradigm that we're living in, one thing that is fascinating is not only that, okay, these models are a bit slower, but they're gigantic. Right? Yeah, that's a little different too. I think you were mentioning that we've been building for super small.

Erica Hughberg [00:15:19]: Yeah.

Demetrios [00:15:19]: Super fast type of workloads and APIs and now we're looking at really big and really low type of things.

Erica Hughberg [00:15:31]: Yes. Also what's really interesting is the workloads for that's an entire topic on itself like how to manage the LLM workloads in your compute. I think there's also this very interesting aspect where people are looking how small can they be? Because again, we are very much gravitating to how can we make things smaller and more scalable. There's some really interesting ideas on how you can even maybe run really lightweight models towards your edge stack of your networking. For those who may not know, in the world of the Internet we didn't just have to deal with lots of many connections. We have something called content delivery networks out on the Internet. So when you go on a website and there's images on there, so imagine you just go to Facebook and on Facebook or Instagram there are lots of photos and every time you are loading your phone and you are looking at photos on Instagram, when I go and look at it on Instagram in the United States, those photos were most likely cached on an edge through a content delivery network system. Whereas when my mother back home in Sweden, because I'm from Sweden, when she opens Instagram and looks at photos, they don't come all the way from some data center in the US or in Germany.

Erica Hughberg [00:17:02]: Most likely a lot of them would be coming from a CDN from really close to her in Sweden so that it can have that perception of fast delivery. So there's a lot of interesting thoughts of how can we bring at least lightweight gen AI services closer to those edges to be closer to people, so things are faster. But I am not an expert on that. I just know that a lot of people are thinking about how small can you make these actual workloads, how efficiently can you get them closer to people. Even people are talking about what can you realistically run, even on a client device.

Demetrios [00:17:42]: Yeah, exactly. And how you see that a little bit with what the Apple intelligence wanted to do. Right. Yeah. All of the simple queries we're going to do on your own device and then if you need a bit more, going to bring it to the cloud and it will get done, but it's going to be all private and in our cloud type of thing.

Erica Hughberg [00:18:05]: Yeah. And yeah, I have the. I think my favorite problem is how big the requests and responses have become. The actual network traffic is getting heavier.

Demetrios [00:18:17]: I guess the, the, the interesting question there is like, what happens now with the network traffic?

Erica Hughberg [00:18:25]: Yeah. So first of all, we were really optimizing for small requests. A lot of gateways out there, by the way, have request and response size limits, especially if you want to be able to interrogate the request or response body and the actual content of the request.

Demetrios [00:18:45]: And why would you want to do that? For security reasons.

Erica Hughberg [00:18:48]: Yes, specifically for security. And in the world of large language models, people are very interested in interrogating both request and response content to either protect information from leaking, protect information from coming in, like malicious information coming into your system. But that's very common as standard web application firewall changes. Another interesting thing that people want to do as traffic comes in. So imagine you are making an LLM request, a query. So you have an application developer who's like, I'm going to build cool app that people can write cool stuff into. But the person is building the application. They are no LLM expert.

Erica Hughberg [00:19:40]: They're like, I want to have this cool thing.

Demetrios [00:19:42]: Yeah.

Erica Hughberg [00:19:43]: So then imagine that they're like, I just want to have a simple like LLM API to hook into. I don't want to care about picking a model or service, provided that's like not my expertise. Imagine that their developer just wants to send a LLM request to a API. Then why would someone want to interrogate the actual content of this request? So I seen some really interesting things that people are doing out there, which is like the, like literally using a lightweight LLM almost to pick a model based on the request. So to be able to run an analysis on what is the most appropriate model for this type of the request, you need to have access to the body to do that decision. Otherwise you can't Right. You need to be able to access the body. So this obviously differs on how big this body is where you start running into problems.

Erica Hughberg [00:20:39]: But you can't assume that everybody's going to be small. Right. So being having architecture and infrastructure that allows to be able to do this for more unpredictable sizes of the request body and then you have the response body being able. That can get really complicated. Especially with streaming. How you're going to try to protect that. But yeah, I find it very fascinating. So that's like one problem in terms of how the network traffic shapes up differently with it being bigger.

Erica Hughberg [00:21:14]: Both requests and responses and the fact that the APIs we are dealing with are incredibly open ended. They are not our old traditional microservices APIs. When I say they're old like now, that sounds sad because they're fairly new as well. But I think we can call them old now because they were very much about being deterministic, having very clear. This type of request is always going to take X amount of milliseconds and you want it to be very clear on that. The slowest one is going to be 50 milliseconds for example too. Then you may be able to do ones that are faster but having a very clear idea. Whereas in the world of LLMs this is incredibly unpredictable.

Erica Hughberg [00:22:00]: And then you go beyond LLMs, you go into inference models for images. You get even more mental. Start even working with bigger amounts of data or go into images. Then you can go into video. Now we're dealing with even larger data. So LLMs, even though LLMs are pushing us, imagine where we're going with the media that isn't just text.

Demetrios [00:22:26]: Yeah. And what kind of infrastructure we will need to build out for the future that we all want. And you hear AI influencers on the Internet talking about how, oh well, we're just going to not have movies be made anymore. They'll be made for us and our individual wants and what we're looking for off of a prompt or whatever. If we do want a world where there's going to be a lot of these heavier files being shared around or heavier creations from the foundational models and that's more common. What kind of infrastructure do we need for that? Because it feels like we've been going in one direction for the past 15 years and we've really been trying to optimize for this fast and light.

Erica Hughberg [00:23:20]: Yeah.

Demetrios [00:23:20]: But now we're needing to take a bit of a right turn.

Erica Hughberg [00:23:28]: Yeah. And think a bit differently around how do we, how do we deal with Long running connections, long iPad connections. Because there's a difference. When you look at what we did with streaming, like when you're using Netflix and you are streaming video, when you have signed into Netflix and you are connecting and you are streaming, when Netflix is streaming out there like media to your TV or your device, they don't have any need at all to interrogate the output. So therefore, you can bypass a lot of the things that you may be. You wouldn't have to worry about when you do. When you look at output from LLMs or large visual models and things like that, you actually want to interrogate the output before giving it to a user, which is different from what you see with streaming Gilmore Girls on the Internet, right? They don't need to check that the Gilmore Girls episode is actually what it is, right? So therefore, because at one point I thought, hmm, Erica, I said to myself, this streaming problem and doing that at large scale fast has been solved, hasn't it? So then I was like, well, that has been solved. In the scenario where you're having known controlled data, that the data you are sending is known and controlled, you know exactly what it is.

Erica Hughberg [00:25:04]: You are not worried that it's going to send out something that you don't know what it is, right? So therefore, that kind of streaming is very different from what we're looking at when we're looking at streaming responses from Genai services, because the security layers and control of the what we're sending becomes different.

Demetrios [00:25:24]: Now I was thinking about how we move towards an Internet that is supporting more Gen AI use cases. What does that look like from the networking perspective, from the gateway perspective. I know you all are doing a ton of work with the Envoy AI gateway. I imagine it's not just all, hey, let's throw an awesome gateway into the mix and that solves all our problems, right?

Erica Hughberg [00:25:53]: Yeah, yeah, it doesn't. Because even the existing foundation, like if you look at what how Envoy proxy itself operated, we had to make enhancements to Envoy Proxy itself. And then we started the Envoy AI gateway project to further expand on the Envoy proxy capabilities and control planes. So I'm going to explain a little bit what that is in a moment. But if we take one step back and look at what happens in the world of gateways, what I think is super fascinating was with the rise of Genai, there was a lot of Python gateways, Python written gateways that came out because it seems natural, right? Like, oh, we wrote all of these other cool stuff in Python and a lot of people who came from the machine learning and AI world are incredibly comfortable with Python. And if you want to do cool stuff like automatic model selection on an incoming request, you kind of got to do that in Python because now you're back into the machine learning and Genai space. So it becomes this weird thing like how do we do this smart stuff we want to do, but also don't run into the single waiter per table problem. So because of Python gateways fundamentally runs into a problem and it's not about how smart the people are writing this Python gateway, it's not their fault.

Erica Hughberg [00:27:20]: The problem comes back into like the foundations of the Python language where there is only effectively there's an interpreter because Python is an interpreted language, it is not a compiled language. So therefore you have this process that is interpreting Python and that is threaded. So when we come into threads now we come into the single weight upper table problem. And therefore even if you try to, you can start trying to be clever and simulate the what is referred to as event driven architecture. Don't confuse that with Kafka. But when you look at the event driven architecture of proxy is when you can have multiple, you can have one waiter serving multiple tables and just putting things into the system. So it's like this little event notification within the restaurant, within the proxy. So you can simulate a lot of that with Python, but you are going to eventually run up into the issues and the constraints, well, we can call them issues.

Erica Hughberg [00:28:28]: The issues aren't. Python itself isn't the issue. It's just you will run into issues because of the constraints of the Python language. If you want to do large scale of handling loads and loads of connections. So that's where we run into problems. And how do we solve this problem? How can we like combine the multi, you know, the restaurants where waiters can serve multiple tables with the smart stuff that Python can do. Then we start, that's how we end up leaning into Envoy Proxy which fundamentally can handle, which is very much the restaurants where waiters can serve many tables. That is what Envoy Proxy really is great at and has this event driven architecture.

Erica Hughberg [00:29:16]: Envoy Proxy with Envoy Gateway, Envoy AI Gateway allows us to create an extension mechanism where we can bring in the cool logic like automatic model selection that's written in Python and still have the, you know, one waiter serving multiple tables proxy and then being able to only when we need to go and get that special order from the Python extension to do smart stuff, only if we need to and only if we want to. So that is where getting the network benefits as well as Being able to bring some of the smart stuff in. And that is what makes it really interesting on how we can bring these worlds together. But we are definitely going to see, I still believe we're going to start seeing challenges with some of the time, the response times like time to first bite and the complete streaming. I just imagine that that's going to continue to grow as a need and we are going to have to start looking further into the internals of our proxies. I'd be surprised if we're not going to continue growing in that space.

Demetrios [00:30:36]: You know what it reminds me a lot of is the folks who build Streamlit apps. It's like, yeah, I built this streamlit app because I just code in Python and I like Python. It's. You're like, streamlit is probably not the best for your front end, but it gets you somewhere. And then when you want to take it to really productionizing it and making it fancy and doing all that cool front end stuff, then you can go to React or whatever you, yeah. Bust out your vercel jobs or your next JS and you actually make something that is nice for the end user or nice on the eye and has a bit of design to it. But sometimes you just are like, yeah, all right, cool. Well, streamlit gets me this MVP and that's what I want.

Demetrios [00:31:29]: And in a way it's almost like the parallel here is that there's been proxies made with Python. They get you a certain way down the road and you can validate ideas and you can see if it works. And then at a certain point you're probably going to say to yourself, okay, now we want to productionize this and Python is not the best way to do that.

Erica Hughberg [00:31:59]: Yeah, like, I think it's interesting because if you have like a small user base just building something internally for your company, you can probably be fine with your Python gateway. Like you got a few hundred users just inside your own company, you're probably fine. But it. But if you have, even internally, if you have a lot of concurrent connections and you start needing to deal with this, you can't have one waiter per table that is standing there waiting. So you gotta have to figure how you can make more use of all of the connections. So it's sort of interesting, as we talked about what happened at the beginning of the 2000s with the 10,000 concurrent connection problem, the Python gateways are effectively, as we are bringing more genai features to the mass market ultimately runs into that problem. And the good news is the answer of the concurrent connection problem has already been solved. So then this is how can we bake these things together to get the cool features.

Erica Hughberg [00:33:03]: But at least the good news is the concurrent connection problem has been solved.

Demetrios [00:33:10]: So there's. There's almost like two levels that you're talking about here, which is the trafficking and the network trafficking. But also then understanding the request and being able to leverage some of this cool stuff, like, oh, well, this request probably doesn't need the biggest model, so let's route it to a smaller model. Or maybe it's open source and we're running it on our own. Maybe it's just the smallest OpenAI model, whatever it may be. And so the levels that you have to be playing at are different. Right. Or at least in my mind, I separate the whole networking level with the actual LLM call and what is happening in that API.

Erica Hughberg [00:33:54]: Yeah, and then, but then you have the smart stuff that has to happen before you send it to the LLM service. And that's where the exciting part, like combining things like Envoy proxy with a Python service, a filter. A Python filter to be able to do smart stuff, to make those rafting decisions.

Demetrios [00:34:13]: And you don't think that that's just kicking the can down the road, like you're still going to run into problems if you're passing it off to Python at any step along the way.

Erica Hughberg [00:34:25]: Good question. So when you are, we don't have the connection problem in the same way at that point, because now we can have dedicated are the Python services that can run and we can scale them independently of the proxy. So now we are not dealing with client to proxy or proxy to upstream connections. Now we have the proxy that is making a call out to another service. So like a Python service and that service. You know how we talked about boxes in Kubernetes earlier? That little Python service box, we can scale it up and scale it down based on demand, so that thing now becomes horizontally and elastically scalable on its own. And it's not going to have any. They won't be any noisy neighbor problems because they're not going to block each other.

Erica Hughberg [00:35:14]: And we only run into the waiter problem when we have a limited set of waiters in our restaurant that are associated with one request response lifecycle. Because now we just go to this little Python service, when it's done, it's out of the picture. And also we weren't blocking other requests because of how the envoy architecture works. We weren't blocking the other requests while we were consulting this Python process about what to do.

Demetrios [00:35:43]: I understand what's going on then. And.

Erica Hughberg [00:35:48]: We call them and ask, hey, little Python service, what do you think about this request? I'm going to hang up now and when you have an answer, I'll pick up the phone and we'll continue this journey.

Demetrios [00:35:58]: Yeah, and that bottleneck only is happening potentially at the proxy area. It's not.

Erica Hughberg [00:36:07]: Yeah, that's a connection point.

Demetrios [00:36:09]: Once it goes through the proxy, then it's like free range and there's all kinds of stuff that you can be dipping your fingers into.

Erica Hughberg [00:36:17]: It says that the request lifecycle inside of Envoy is the important part in how you can include external filters so that you can have this basically event driven. Let's talk about like you basically put stuff into the central system of Unwin and then knows what to do. Okay, so it's. I hope that this is an oversimplified explanation by the way. So people who are really in the depths of Envoy Proxy could probably be like, oh, Erica, that is not entirely exactly in detail. True. But if you want the like 10,000ft view, I hope that is enough. And if you really want to dive into it, like you can spend a long time learning about the internals of this.

Erica Hughberg [00:37:00]: But the good news is it's very clever in how it handles resources and manages connections. So that is really cool. But then Envoy proxy is really hard to configure. So I'm going to go on a slight tangent, but an important tangent in like why Envoy AI Gateway is really interesting because Envoy AI Gateway does two key things for when it comes to helping you leverage Envoy proxy to handle traffic and you have two. Every time you have a proxy you have a two side problem. You don't just have you need to configure this proxy and you want to be able to configure it in a way. So even if it scales up and down, so you have like many proxies, so one proxy you have a control plane that can effectively and resource efficiently configure this fleet of proxies. So Envoy AI Gateway brings you this control plane that is extending Envoy Gateway control plane.

Erica Hughberg [00:37:59]: And then it's like really interesting. I know you can't have time for spending on that, but how that control plane is being very efficient in how it configures Envoy proxy and helps you propagate configurations across all the proxies that are running. And then the other part that we've added into Envoy AI Gateway is an external process that helps with specific genai challenges like transforming requests. So one of the things we've done is to have a unified API so that if I'm an application developer, I don't have to learn all these different interfaces to connect to different providers.

Demetrios [00:38:40]: Smart.

Erica Hughberg [00:38:41]: Because we don't have to put that cognitive load on people who want to build cool apps. Let them build cool apps and then we can worry about the pipes.

Demetrios [00:38:51]: Yeah. And do you need one to be able to leverage the other? I guess. Do you need Envoy to be able to leverage Envoy AI Gateway?

Erica Hughberg [00:39:00]: So when you install Envoy AI Gateway, it actually installs Envoy Gateway which installed. Which gives you Envoy Proxy and the Envoy Gateway control plane. And when. So when you go through installation steps, you actually first install Onward Gateway and it would run on a Kubernetes cluster. I call this like a gateway cluster by the way for reference, if you ever look at any of my diagrams and blog posts. But and then you've deployed that, then you, there's a helm chart and you just like install onwe AI gateway and that deploys an external process and an extension of the control plane. So it expands on the functionality of Envoy Gateway and Envoy Proxy. So you don't really have to know that you are deploying all of those things.

Erica Hughberg [00:39:45]: But that is what happens if you deploy it. It's all part of the Envoy CNCF project. It's not like a separate. It's part of that fab ecosystem. So it's not like a separate entity per se, like the. It's part of the Envoy CNCF ecosystem.

Demetrios [00:40:06]: Have you noticed a lot of AI or ML engineers starting to come into the Envoy AI Gateway project and like submit PRs? Do you look at that type of stuff? Because I wonder how much of the tasks or of this scope falls onto ML engineers, AI engineers or if those folks just throw it over to the sres and the DevOps folks.

Erica Hughberg [00:40:39]: It's a good question. Where we really need the people who have good understanding of machine learning and gen AI. It really becomes, I mean look at these intelligent Python, maybe Python based extensions. If there are people out there who have ideas on how to do for example semantic routing to be able to decide that model, not doing it in Python, please let me know. Very interesting. This is outside of my area of expertise, so if there's someone who can write it in Rust, please let me know. That would be amazing. But.

Erica Hughberg [00:41:19]: But yes, we are seeing people who come in who are like hey, let me show you this Python extension and how it fits into Envoy AI Gateway and the smart stuff that is definitely outside of my expertise I find it really fascinating and interesting. So if people have that expertise, I would love people to come and help build those extensions into Envoy AI Gateway and bring those features to the community. Because fundamentally, when we look at the Envoy Gateway initiative within the Envoy project, the Envoy AI Gateway Initiative is really about what we're facing right now and how this traffic is changing and shaping up. It is not a single company's problem. It is not a single user's problem. We are all running into this. So coming together and solving these things together in open source and maintaining it together, I think is really exciting. And seeing both vendors and users coming into the space and collaborating and yeah, really sharing knowledge, as you said, like sharing the networking knowledge with.

Erica Hughberg [00:42:34]: Sharing the knowledge of, well, what does, what LLM functionality do we have out there? How can we actually bring some functionality into the gateway to make it even smarter? And people who really understand the challenges of how they are running LLM workloads I think is really, really interesting combination. And I've learned so much over the last six, seven, eight months from collaborating with people. So it's been really cool, exciting.

Demetrios [00:43:04]: So I will say that what you are talking about here, it has been 100% validated by the community. And even the way that I know it's been validated is when we do our AI and production surveys and we try and do them every time we do like a big conference or almost once a quarter and we ask people what's going on in your world and what are the biggest challenges? What are things that you're grappling with right now. A lot of people have written back and said some of the hardest things are that there's this new way that we are working with software and working with models. And almost like what you were talking about, where these models are so big, so it makes things much different to handle and it makes everything a little bit more complex. And then on top of that, you don't really have anywhere that you can turn to that has definitive design patterns and folks who have figured it out and are sharing that information with the greater developer community. And so I find it absolutely incredible that a, you're working on it in the open. It's great that like the Open Source project is trying to do that. But then B, you've thought about this idea of, okay, the models like traffic to models.

Demetrios [00:44:43]: The traffic is so different, they're so different. They bring so many other ways of having to deal with software engineering, not just the, the fast and light way that we've been looking at it and trying to get to, but now there's a little bit heavier one and now we get these timeout problems or the model doesn't fit and there's like payload constraints, all of these, like these con constraints or these problems that people are running into because you try to bring this new paradigm onto the old rails and you see where there's a few cracks.

Erica Hughberg [00:45:29]: Yeah, like it's. I. So I, in my, my actual work, in my day to day work, like I, I am a community advocate, which actually means a lot of what I do is understanding what's happening out in the technology community. And so I work at a company called Tetrate and we are very invested in the Envoy project, for example, with having engineers and people like myself being out there building in open source. But as a community advocate, a lot of what it is for me is like talking to people, listening to what's really going on and the challenges they're running into and advocating for the community when we are looking at how we are building things going forward. So I think what could be maybe misinterpreted of being a community advocate is I'm advocating for the solutions to the community. No, no, no. I am advocating for the community so that the solutions that are being built are addressing needs in the community and maybe sometimes even needs the community haven't realized they got yet.

Erica Hughberg [00:46:38]: Maybe they're running into interesting problems. Have I heard people say they are running into challenges with their Python gateways? Like why do you, why does it seem like the gateway itself is adding latency? Well, let's talk about event room proxies, but this is where it's interesting. How can you advocate and understand the challenges of the community so that the solutions are being built, meet those needs? So yeah, I found that really interesting for me specifically, so I find it really nice to hear what problems people are running into. So it's really exciting to hear like maybe what you're seeing as well in your community and advocating for people's real needs out in the real world so we don't build science projects. And I think that is also so exciting. You know you said you had Alexa Unhee earlier, right? Like collaborating with Alexa and the team in open source and really bouncing ideas together and really getting to real needs. I think that's really fun and nice and you know, we don't work at the same companies, but we get to collaborate and drive solutions together. Oh, it's really fun.

Demetrios [00:47:47]: Yeah. The thing that I'm wondering about is do you feel like you have to Be technical to be that type of a community advocate?

Erica Hughberg [00:48:01]: That's a good question. I am very technical. I started coding when I was 12 and I been in the gateway space for many years. I started being in the platform engineering and gateway space back in 201516 as we started 2015 when we started breaking down those monoliths into microservices and the need of gateways became apparent. So I've been in that space since 2015 and I was in the fintech space. And I need to be honest, this Gen AI and this type of traffic patterns we're seeing now, I feel really validated because definitely between 2015 and about at least 2023, I was in a situation where people said Those financial analysis APIs that I was dealing with, that was incredibly open ended. And you can be like, oh, I have this portfolio and I want to analyze it. But there's a big difference between analyzing a portfolio with 10 US equity holdings in it versus a multi asset multinational portfolio with 10,000 holdings in it.

Erica Hughberg [00:49:18]: Analyzing those, even if it's the same API endpoint you're hitting, I hope you can understand that one of them is going to be very fast and easy to respond to and one is going to take a lot of time processing. And both the input potentially and the output, especially the output, we're talking very big outputs that were starting to hit the limits of what API gateways. You know how we talked about the body, the response body limitations. We started to run into those problems because they were so big, big responses that were hitting the 10 megabyte limits of many, many gateways out there. So I was like, for years I felt that people were telling me, Erica, you are doing your APIs wrong. Clearly those financial analysis APIs, is something wrong with your design? That's the problem, not the gateway. No. And to be honest, I truly believe that we must be doing something wrong.

Erica Hughberg [00:50:15]: Right? That we have these APIs that were so unpredictable in the time to respond and the size of the response. But then Gen AI and LLM APIs came around and I'm like, wait a minute, I've seen this before at API. You could put stuff in and it'd be very different what comes out. And it could be slow, it could be big. I'm like, and now everyone seems to be on board with like the, this is a real challenge. But actually a lot of the things we are dealing with now with gateways in the world of LLMs is actually the problems we are solving benefits these financial analysis APIs as well. So these Changes in how we're looking at both limiting usage. Like in LLM world, we're using token quotas.

Erica Hughberg [00:51:07]: So maybe you're allowed to use 10,000 tokens in the space of whatever time frame you like. But if you want to say a day or a week or a month, and you can then start putting these, but those are not number of requests. You can make just a couple of requests and hit your quota, or you can make loads of small ones and then you'll hit your quota. So that is different. So we actually had to change how we did rate limiting, like normal write limiting and usage limiting in Envoy proxy to be able to allow for this more dynamic way of measuring usage. So you don't just measure usage based on number of requests, but you can measure it on another data point. So in this case in LLM world it would be tokens, like word tokens. So that is fascinating to me.

Erica Hughberg [00:52:02]: And even just our observability in how we measure how fast we are responding. So if you think about an LLM and you're streaming a response and you're streaming tokens back, what you're actually interested in from a performance point of view is response tokens per second, because then you know it's working like stuff's happening, responses are streaming through. If you then correlate that to some of the challenges we saw in financial analysis APIs. We didn't have word tokens, but we were very concerned about time to first byte. So the first byte we started streaming in the response. So instead of looking at the time it takes for the entire request to process in the financial analysis API, imagine we can monitor bytes per second because that will tell us stuff is moving, it's not standing still. So similar to how we are interested in response tokens per second when we're streaming LLM responses, if we take the parallel and look at the financial analysis APIs, just shifting the way we're thinking about performance and observability is also important when we look at this infrastructure to understand the health and of our system.

Demetrios [00:53:17]: Yeah, you really are. I like that idea of, hey, the performance metrics have to almost you. You want to try and get more creative with them. And it reminds me of a conversation we had with Krishna, who works at Qualcomm, and he was talking about how sometimes when you're putting AI onto edge devices, you want to optimize for battery life. And ways that you can do that are by streaming less tokens or making sure that the tokens aren't streaming at 300 tokens per second. Because people can't really read 300 tokens per second. So why stream them that fast if people aren't going to read it that fast? And if it can mean less battery consumption than you stream at 20 tokens per second or whatever it may be, whatever that happy medium is.

Erica Hughberg [00:54:15]: Yeah, absolutely. That's a really good perspective. So, yeah, like they all add up. They were like. I think it's fascinating in this space that how so many things are related. Like I normally don't think about battery times because I work in places where, you know, power is plugged in.

Demetrios [00:54:34]: Yeah.

Erica Hughberg [00:54:35]: It's off the mains. Right. There's no charged up battery. We don't have battery driven service. Right. But yeah. I do think though that coming back to your question, do you have to be technical to be able to be advocating for a community in this space? Yes, I do think it would been very hard for me to not have my technical software engineering and software engineering manager leadership background.

Demetrios [00:54:59]: Interesting.

Erica Hughberg [00:55:00]: I think that that would be very hard because I'm in such a technical space. Right. It's so deeply technical. And I believe it would be very hard to not have experience in this space because sometimes I can hear problems people are describing like, oh, this Python gateway is somehow very unpredictable. In the latency, the gateway itself adds to my requests. Cool. I hear that they necessarily can't express that the challenge they're running into is that they have ran out of waiters in the restaurant. So their requests are waiting outside for an available table and waiter.

Erica Hughberg [00:55:48]: And that is why sometimes the request takes 100 milliseconds and sometimes the request takes, oh, interesting, 700 milliseconds. They can explain the problem they're observing, but you have to understand, okay, are you using a Python gateway? What are the limitations of the Python language then? Maybe the actual foundational cause of the problem that you are observing. So be able to advocate for the user in that case. I need to understand the course of the pain to then advocate for solutions to be built in the industry to address their need. Hopefully that makes sense. I think it'd be very hard if you don't. Often you are dealing with users that are observing challenges but aren't in a position to see the opportunities as well and the causes of the challenges they're observing.

Demetrios [00:56:44]: It's almost like you're part product manager, part like customer success or support in a way and in part community or externally facing. I don't know how you, how you would qualify that. And so it's interesting to think about how these deep questions, deep technical questions or discussions are coming across your desk and you're seeing them and then you recognize, wow, there's a pattern here. And maybe that means that we should try to build it into the product so that we can help these users with this pattern. Because I keep seeing it come up. So let's get ahead of it. Let's help create a feature or whatever to help the users so that they don't have to suffer from this problem anymore.

Erica Hughberg [00:57:42]: Yeah. And I think on that, like being ahead of a problem, because I used to be in platform engineering, in like, in like internal building, API platforms, service platforms and such in my old roles. And often the challenge always is that you need to try and build things before people are really in pain.

Demetrios [00:58:05]: Yeah.

Erica Hughberg [00:58:06]: Because when people are in true pain, you are too late. So being able to see that someone, see the symptoms of something that's going to become really painful later, early enough that you can address them, so you have the cure when people come around. But what has been really great is that fortunately, unfortunately, however you want to see it, some people had already started running into these problems. But if you can work to solve those problems for a small set of people early and then being able to showcase that and then help the sort of second wave of adopters not have to experience the growing pains that the first sort of pioneers had to experience. But yeah, generally it's like sometimes hard to even explain to the wider community the purpose of doing something. So they might be like, this problem you're talking about, I don't have it, and be like, that's great for you. That's amazing. That's awesome.

Erica Hughberg [00:59:18]: I'm happy you don't have it.

Demetrios [00:59:20]: Yeah, talk to me.

Erica Hughberg [00:59:22]: So that's the other tricky part of advocating for a community, because sometimes you have to advocate for a community, for people in the community that they don't feel like you need to advocate for them yet they're like, well. And also they don't need to know what I do, you know, when I.

Demetrios [00:59:43]: Talk to people behind closed doors.

Erica Hughberg [00:59:44]: Exactly. But I do think it's an important read to see whether community and the industry is at how they are, where are they now? How long do we think it is until the pains will start being felt?

Demetrios [01:00:02]: One thing that's gotta be hard about your job is the shifting sands that you're building on. And I say that because a friend of mine, Floris, was saying how he spent in the beginning of the LLM boom. He did so many hacky things to allow for greater context windows on their requests. And he spent so much time on that problem. And then next thing you know, context windows just got really big. And so he, he was sitting at his desk like, damn. All that time I put into this, I could have just waited three months or six months and it would have happened on its own. And so I think about, in the world that you're living in, as you're thinking through some of these problems, how do you ensure, if at all, that you're working on the right problem and you're not going to get into a situation where something that you've worked on for the past six months now doesn't matter because of another piece of, of the puzzle changing and totally making it obsolete.

Erica Hughberg [01:01:19]: In a way, I think for me, when I look at how the, when I started looking at features to aid and enable Genai traffic, I got to be honest, the first time someone said like AI gateway to me, I was like, really? Like, it's just network traffic. I was like, really? Like, what are you talking about? It's just network traffic. We've been dealing with network traffic for a long time. Like, what's up with you? Like this is just a someone, you know, sticking an AI label on a.

Demetrios [01:01:55]: Network component and because they need to raise some funds.

Erica Hughberg [01:01:58]: Yeah. And I was just like, this is ridiculous. It's just. And I even was like, I even had that moment where I looked at their problems of these like the big payloads in requests and responses and the unpredictable response times and the high compute utilization. I was like, yeah, I had that problem for years. And I like looked at it. I was like, yeah, that's not a Gen AI problem. I've had that problem in fintech for a long time.

Erica Hughberg [01:02:26]: Congratulations, you joined the party. That was my initial reaction. I was like, okay, fun for you. Welcome to the club with these problems. But it's almost that what made me really excited when I saw that the problem that was being ran into in the Genai space I had experienced for over five years and I was like, well, if I've experienced it for five years. And everyone told me that the problem I had was my self inflicted problem and then now we had this problem, the amount of people who was experiencing the problem had just grown. That made me feel like, okay, these features, even though we can drive the innovation at this point because of the explosion of genai, these features don't just benefit Genai, they benefit beyond Gen AI and therefore I do not currently fit. This feels like fundamental problems in how we Handle network traffic connections and processing of request and response payloads.

Erica Hughberg [01:03:39]: And I said therefore, in this particular space, I think we're just actually late. We should have solved this a few years ago.

Demetrios [01:03:49]: Yeah, I guess a lot of the energy too, and the advancements and the money are going into the model layers and application layers, but not necessarily the nuts and bolts and the piping and tubing layers. It's almost like this is a bit of an afterthought when you do hit scale and when you do realize that, oh yeah, we want to make this production grade, how can we do that? And you start to think about that. Hopefully you don't think about it once you already have the product out and it's getting requests. Hopefully you're thinking about it before then.

Erica Hughberg [01:04:35]: Yeah. Like when people come out of their POC workshops.

Demetrios [01:04:39]: Yeah.

Erica Hughberg [01:04:40]: How do they get it industrialized?

Demetrios [01:04:43]: Exactly.

Erica Hughberg [01:04:44]: It's a. And I do think that people should start in their POC workshops. Right. They should be playing around, seeing what's possible. There's no. I don't think people should scale their systems before they need to scale them. That is like, what if you have a bad idea? Like, leave it in the. In the POC workshop.

Erica Hughberg [01:05:04]: Don't take it out. Can stay there on a shelf. You can be like, I built that once. That was cute. No one needed it. And that's okay. So it's. Don't scale too early.

Erica Hughberg [01:05:17]: But now at least I feel that there's enough people out there that need the scale. And I'm really excited about how we don't just bring these features into Envoy AI Gateway, but really this notion of having usage limits that isn't just number requests. That feature is available in Envoy Gateway now. You don't need to use Envoy AI Gateway to leverage that. That is now part of Envoy Gateway as of the 1.3 release. So if you have a financial analysis API, you can actually now you don't have to go and install Envoy AI Gateway to have more intelligent. Intelligent is the wrong thing to call it. Dynamic is the right thing to call.

Erica Hughberg [01:06:05]: Call it more dynamic way of enforcing usage limits beyond number of requests.

+ Read More

Watch More

How to Systematically Test and Evaluate Your LLMs Apps

Posted Oct 18, 2024 | Views 15.1K

# LLMs

# Engineering best practices

# Comet ML

Small Data, Big Impact: The Story Behind DuckDB

Posted Jan 09, 2024 | Views 13.3K

# Data Management

# MotherDuck

# DuckDB

Building LLM Applications for Production

Posted Jun 20, 2023 | Views 11K

# LLM in Production

# LLMs

# Claypot AI

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io