Sign in or Join the community to continue

Real-Time Event Processing for AI/ML with Numaflow // Sri Harsha Yayi // DE4AI

Posted Sep 18, 2024 | Views 1K

Share

speaker

Sri Harsha Yayi

Product Manager @ Intuit

Sri Harsha Yayi is a Product Manager at Intuit, focusing on the company's Modern SaaS Kubernetes platform. He leads the development of Numaflow, an open-source, Kubernetes-native platform designed to simplify event processing applications development.

+ Read More

SUMMARY

At Intuit, our machine learning teams encountered significant hurdles in event processing and running inference on streaming data. The process of integrating with various messaging systems such as Kafka etc was both time-consuming and complex. Additionally, our teams needed capabilities for intermediate processing before executing inference as part of their workflows. The need to scale event processing and inference in response to fluctuating event volumes added another layer of complexity. To address these challenges, we developed Numaflow, an open-source, Kubernetes-native platform designed for scalable event processing. Numaflow streamlines integrating with event sources, and enables teams to perform event processing and inference on streaming data without a steep learning curve. This talk is geared towards ML engineers, data engineers, application developers, and anyone interested in event processing or inference on streaming data. We will demonstrate how Numaflow overcomes these challenges and simplifies the process

+ Read More

TRANSCRIPT

Adam Becker [00:00:05]: Next we have Sri Harsha. Let's see Sri a year round. Okay, we can hear you. Hello Sree. And I think you have your slides up as well. Where are they? I can't see them.

Sri Harsha Yayi [00:00:17]: Now let me share the screen. Can you see them?

Adam Becker [00:00:22]: They are right here. Nice. You're going to tell us a little bit about what you guys are up to at Intuit. One of my favorite companies. So take it away. I'll be back soon.

Sri Harsha Yayi [00:00:32]: Sounds good. Hi everyone, I'm Sri. I currently work as a product manager at Intuit on the platform team. Today we are going to discuss about real time event processing for AI ML and then I'm going to share about NeMA flow, which is the open source project that we have developed and then open source for the community before we dive deep in a very high level agenda where we'll just share a little bit about our platform team at Intuit. What do I mean by event processing and the event driven architecture? Like what are the challenges that the ML teams that face on a day to day basis and then introduce about name of and then a quick demo. So intuit is a global fintech company which has multiple product offerings, right from like Turbotax, like Credit karma, Quickbooks and Mailchimp. These are something like people might be aware of in the US and even across the world, like especially Mailchimp and turbotax and various other products. And you can see some of those numbers, the kind of scale that we operate in, either in terms of money movement or in terms of the predictions that we do on a day to day basis.

Sri Harsha Yayi [00:01:32]: So it's all of the platform is developed and run on kubernetes and we are a strong believer in open source. Some of you folks, especially in the ML ecosystem, might be using like argo workflows and Kubeflow. Kubeflow is built on argo workflows, if I'm not wrong, and Argo was incubated and open source buy into it. And some of the learnings that we had from all of these projects is how we have worked upon pneumo flow. And you'll learn about the platform and the next couple of slides. So moving on, let's discuss quickly about what do I mean by event processing or event driven architecture. In simple terms, you receive a lot of events and you just want to act on those events. That's in short about event processing or even driven architecture where the events can be from sensors or mouse clicks and all of those.

Sri Harsha Yayi [00:02:18]: And as a developer or as an ML engineer or as a data engineer, you want to analyze these events and then send it to your downstream. So in short, that's like a very high level intro about like event driven architecture. But why do people tend to have like event driven architectures in generally either to do real time processing or like have some asynchronous workflows where let's say, for example, you want to do inference. But sometimes inference can be time taking. Like how do I go about taking? You generally take an event driven architecture or asynchronous architecture there, and it's kind of more scalable and flexible because you can scale your components the way you would want to, add more features or anything like that, as you wish. And then it's reliable because you can replace some of your events if it's more of like a batch or near real time use cases for real time, you always want to process them real time. So these are like one few reasons why on a day to day basis you would see like even driven architectures across the board. And let's see like some of those examples where we see even driven architectures, right? Let's say, for example, in the e commerce side of things, let's say you want to do real time recommendations for your users based on their mouse clicks or search history and all of that.

Sri Harsha Yayi [00:03:25]: That's like more of a real time event driven approach. And let's say in the industrial use cases, probably you receive a lot of data from IoT devices and then you want to take actions based on the data that you receive from IoT devices. And then the other could be is like dynamic pricing, either like for ride share apps or any of those like real time analytics that you do. And then lastly, fraud detection is another good example that fintech products generally focus upon. But as a whole, if you look at it right from e commerce to fintech, from industrial and across the board, you have a lot of event driven systems that has been the backbone for the modern tech that you would see on a day to day basis. But let's try to see how does the architecture look like generally for all of these event driven systems or event processing systems, the first is you have multiple event sources is what I would call them as the event sources could be where a simple sensor or your mouse clicks receiving the data through Kafka or Apache Pulsar SQS kinesis. And it could be multiple data streams where you receive all of these events from. And the next thing is like you would want to process these events like probably do some kind of map reduce or even simple inference calls probably call an LLM model or whatever it may be.

Sri Harsha Yayi [00:04:38]: You want to do some kind of processing or do some kind of predictions on top of the data, and then finally you send the events downstream to a DB or maybe another messaging like kafka or whatever it may be. So in short, you receive the events from sources, you do the processing, and then finally you send the events to downstream wherever you want to do. But let's take a look at some of the challenges that ML teams specifically face when you are thinking about these use cases of real time event processing or connecting to multiple sources and all of that. So the first example or the first challenge that ML teams on a day to day basis face is like hey, how do I go about doing real time event processing? I am more of a python, and if I want to do mapreduce and all of that, probably I should have to learn Java data engineering frameworks, which is not the bread and butter on a day to day basis. Rather they would want to focus on can I use Python to simply do the event processing real time and all of those use cases? So that is one of the pain points that you would see on a day to day basis for ML teams. I don't want to really learn the huge Java frameworks. Can I just focus on Python, which is very easy, focus on that itself. The second is complex integrations.

Sri Harsha Yayi [00:05:47]: You saw that you have multiple sources where you can receive the data from and then probably would want to write the data to. But as an ML engineer or as data scientist or anyone, you ideally want to focus on your event processing and then doing the inference activities or anything like that, but rather not worry about how do I connect to these multiple sources and things and all of that. That's another key pain point that you would see if you talk to the teams. And lastly, scaling complexities, either serving your models on at scale like as containers, that's one. And the second is maintaining them, debugging them. The real time event, even processing infrastructure is complex. So if you talk to the platform teams that are running it at scale, they'll complain about hey, our costs are really high, we don't know how to scale it properly because on kubernetes, ideally when you think of auto scaling, it's mostly based on resource utilization like CPU memory and things like that. But they don't really consider metrics like Kafka consumer lag, or the number of pending events that are upstream and things like those.

Sri Harsha Yayi [00:06:49]: So there are complexities in terms of scaling in itself as a package. If you look at it, it's like simple event processing challenges, connecting to multiple sources and things. And then lastly like complexities, all of these kind of add up from different, different Personas within your ML teams to start with. So how did you go about solving that using our platform called as pneumo flow. So pneumo flow is a completely open source project that is developed by intuit and you can actually check it out. So let's dive deeper and then see what is pneumo flow and how can you actually go about using it for real time event processing or even simple like doing inference on like streaming data, predictions on streaming data and things like that. So in the previous slide I've mentioned about some of the challenges that the teams generally face. The first thing is you have to deal with a lot of the infrastructure.

Sri Harsha Yayi [00:07:37]: How do we go about abstracting that infrastructure for ML teams and just focus on either data processing needs or probably just the inference needs and things like that. The second is decoupling the sinks and the sources from the event processing in itself, and then lastly making it scalable, reliable and secure. So these are the three things that we have aimed to solve for using the platform. So let's dive deeper into like what does Nema flow look like and all of that. So let's, I'll try to explain the platform using a simple example, like in the world of like LLMs and probably I'm just going back to the good old days where I'm just picking a sentiment analysis as an example and then try to explain where. So, so far we have seen that hey, you need an event stream to start with. So for, even for sentiment analysis use case that I'm going to say that hey, I just have a stream of events, probably Twitter or wherever it may be some kind of a piece of a text that you are receiving it as event stream through HTTP or Kafka or whatever it may be. And the second is you want to do some kind of prediction on that, like sentiment analysis in this case.

Sri Harsha Yayi [00:08:37]: And then lastly you would want to write these results to some, some kind of a sync. So what does new flow offer here? So if you look at this within Neumaflow, all of your entire processing is defined as a pipeline. So this is a pipeline here, and each of the pipeline basically consists of your vertices and nothing but connected by different edges. Now each vertex is a simple container. And what does Nemo flow as a platform offer you? Right. The first thing we said is hey, we don't want the teams to focus on connecting to multiple sources, like how do we provide them out of the box? So for example, we have a lot of out of the box sources and sinks. You can use them and then quickly go about doing your event processing. For this example, I'm just going to use a hugging face model that is already available for sentiment analysis.

Sri Harsha Yayi [00:09:25]: So I just receive events through HTTP source. I just do a sentiment analysis and then just pass it on. But as if you look at this primary in the next couple of slides I'd show you like where as a ML team or like probably ML engineer or anyone, you would be focusing only on your prediction, Stephen, where you serve your model and not worry about your sources and sinks. That is the operational complexity that we alleviate for your ML teams primarily. So let's actually take a look at the demo directly and then see how does it look like a pipeline. So we have a completely baked in UI that is part of the product in itself. So here is a sample pipeline that I have. I've sent already few events, so if you look at this one, you have an input source and then I'm just doing a sentiment inference on that and then finally writing the events to some log.

Sri Harsha Yayi [00:10:18]: For the demo purposes I've used log, but you can choose any sources that you like and any scenes that you like and then just focus on writing that inference logic and I'll show you how does that piece of code even look like in the next couple of slides. So if you take a look at it, I've sent bunch of events already where you can see that this is a great conference and then the sentiment is positive. And then the weather was a little bit gloomy on Monday here. So I just use that. As you can see that it's a negative sentiment, but in short, you can see that you are receiving the events from your input source. You are doing the inference and just writing your events to a log. Let's see, how does that piece of code actually look like for the inference step. Right? So let me go to the slide here.

Sri Harsha Yayi [00:11:06]: So if you take a look at this particular piece of code on the left side, it is nothing. But as a developer this is a Python example, but we offer multiple SDKs from Java. Go. Even rust is available. But if you look at this piece of code, all you need to do is you just need to implement this method where you are receiving bunch of messages, you don't know where the source is from. All you can see is hey, I'm receiving a data and then I'm just going to call the sentiment analysis model here and then forward that message to the next step. So it's as simple as this one. You don't need to really worry about, hey, am I connecting to Kafka or HTTP source or probably like pulsar sqs, any of those that's already taken care by the platform.

Sri Harsha Yayi [00:11:47]: But as a engineer, you'll just focus on like writing your processing logic. It could be a map function, it could be reduce, it could be anything that you'd want to do within this small piece of code here itself. Now how does that pipeline spec look like? We saw that you have an input, you have a inference step, and then you have like a sync, right? So here is how you define a pipeline within Yuma flow. It's a simple YAML file where you have your vertices, like you are just saying that source, I'm just using HTTP source for this example. And then the sentiment inference is just a container with that hugging face model and which is nothing but this piece of code packaged as a container and available here. And then the third one which we saw is like a log sync where you can, I'm just sending out the results to a log so that I can quickly visualize them. So in short, if you look at it, we have alleviated like all of the challenges with connecting with multiple sources and sinks. And then as ML team you can just focusing, you can just focus on writing your processing logic.

Sri Harsha Yayi [00:12:46]: If you're a data engineer, just focus on writing your Mapreduce or any kind of real time aggregations that you would want to do. Or if you are just simple trying to serve your models as an ML engineer, you can just package your container and just serve it. So we take care of the data movement from left to right, and then you focus on like what is most important for you. Now let's take few more examples like how are we using it into it? I'm just going to pick another example which is like text summarization example where I'm just sending bunch of events, sending it to multiple like vertices here, right? If you want to do some kind of a b testing of models at the same time and then trying to compare and contrast your results. So these are three different, like one is a bot model, like a Pegasus model, and then I think this is another one which I've just picked up from the open source. All of those are hugging face itself and then just writing the results to my log sync. So in short, you can create as complex dags and graphs that you would want to do using the pipeline spec in itself. Let's take even more complex pipeline.

Sri Harsha Yayi [00:13:44]: This is like a real time production pipeline that we are currently running it into it, which does anomaly detection. Let's say you can see it here that we are receiving close to 700 odd events per second, and then we are just doing bit of pre processing, then inference and then bit of post processing and then sending the data to different, different sinks. Right here you can see how complex this entire workflow you can build using the entire platform. And we are also using the same pipeline to even trigger some of our training workflows. We have just had a trainer vertex which will just go trigger training workflows in the backend, but it's just even driven training workflows. So we are just using one pipeline to do both predictions, anomaly detection real time, and also do the training trigger the training workflows. But one thing I would want to touch upon is so far I have not mentioned about auto scaling. We said that, hey, one of the biggest challenges, how do we go about autoscale serving part of it, or even processing workloads? So if you'll take a look at it here, each of these are containers, so you can see that there are four pods that are provisioned for like the input source, which is in this context, it's like a Kafka one.

Sri Harsha Yayi [00:15:00]: And then you can see it's like two pods, two pods for pre processing and three pods. So each of those steps we understand what is your current processing rate and what is the upstream set of events that you are receiving. And then we auto scale each of those individually so that there is no backlog of events and we are able to serve your request like a real time, real time. So we do a lot of what we call is like queued up analysis where we understand from, from left to right and right to left, basically to understand how, how many, how is your downstream able to process, and then accordingly scale your upstream steps or even scale your downstream components. So in short, you can create complex dags like the way you wish to do any kind of processing in each of these steps. Another thing which I want to touch is this is a very polyglot pipeline. Inference is actually written in Python and some of the preprocessing steps are written in Java and even some of them are actually written in go because we have a team of engineers which has software developers and ML folks. So each one prefer to use their own language.

Sri Harsha Yayi [00:16:08]: So it gives that flexibility to come together and then do any kind of processing that you have individually as a team. And you just use whatever the language that you would like. So this is like a real time anomaly detection pipeline. Then we have high throughput pipelines which are processing close to 15,000 tps and even more, we can scale more. And the platform takes care of auto scaling and all of those pieces. So just quickly to see, as I said, you have seen you can create complex pipelines, but if you want to read from multiple sources in one pipeline, you can do that. You can have multiple conditional forwarding joins if you'd want to do. We can do that too.

Sri Harsha Yayi [00:16:49]: So you can see that you have multiple udfs. UDF is nothing but like a user defined function or the processing logic. And then you can do joins and have multiple joins if you wish to. You can do some kind of reprocessing if you want, for some of the events. And then finally if you want to inject some kind of dynamic variables or input data that is required for your UDFC can do that too. So in short, whatever I've shared so far, let me try to summarize. The first is it's a completely Kubernetes native event processing platform with fully stream processing semantics in any language that you'd like. It can run, it's so lightweight that it can run even on edge devices.

Sri Harsha Yayi [00:17:29]: I'll share some of those examples too, like how the community is actually using. And then it's completely a language agnostic framework where you have SDKs available for Java, Python, Golang and Rust. And we have provided like a lot of out of the box source and sinks. But if you want to implement your own source, like we have seen from the community where they want to implement custom sources for some ZMQ and RabbitMQ and like whatnot, they were able to do it because we provide all of the SDKs and the framework to create your own custom source and syncs and write the functions the way it is. And the third is like scalable and cost efficient. And you have seen that scalability where we can scaled down to zero to x based upon the throughput that we are receiving and the processing rate that you as a function can do, and it's cost efficient. As I said, we completely deployed on edge devices too. That is something that we have observed in the community.

Sri Harsha Yayi [00:18:21]: So let me go to show you some of those examples. These are some of the companies that are using it for different different kinds of use cases, not just like ML, even for data processing or real time event processing. So at intuitive use it for anomaly detection and a bunch of other event processing use cases. Like, we know that Lockheed uses it for like some kind of IoT data processing. B cube, they use it for like radio frequency signal processing and seeker use it. Seeker and Velega, they use it for like ML use cases, fraud detection use cases. But in short, if you look at it, it's as a platform, we support different, different use cases primarily for real time event processing. Either you're a data engineer or a software developer or even ML engineer.

Sri Harsha Yayi [00:19:05]: If you want to serve your models or do data processing, you can do it because as a platform, we provide all of the SDKs in your favorite languages that you can actually use and not worry about deploying complex data engineering platforms, real time event processing platforms, and then having that huge learning curve of learning all those concepts and all of that. I think that is where we really shine and that is what really resonates with the community. So if you want to be part of the community so you can quickly scan this QR code, you can find the GitHub repository and do star us if you really love the project so that you can get updates and all of that. But if you want to contribute or use, feel free to drop me a note. More than happy to help you out. But yeah, in short, that's like pneumo Flow is the real time event processing platform that we have developed and open source that into it. So with that, let me stop sharing and any questions?

Adam Becker [00:20:05]: Riyarsha, thank you very much. This is fascinating. It looks beautiful too. It looks quite mature already. So very nice. How long has this been in production and how long have people outside of intuit been using it?

Sri Harsha Yayi [00:20:21]: I'd say we have done a 1.0 last kubecon. It's been more than a year that we have been using it at production at scale. And there are a lot many companies. I've just only shared some of the examples. But if you talk to the users who are on the community, the wide variety of use cases that they actually use is mind boggling, to be honest. They use it for audio processing, radio frequency signals. These are some of the use cases that we never like thought of, but it's really resonating with the community and that's what they actually use it for. Very cool.

Adam Becker [00:20:52]: We got a couple of questions here. Ankar is asking, how does puma flow ensure scalability and performance in real time sentiment analysis pipelines?

Sri Harsha Yayi [00:21:02]: So the way we actually do auto scaling is based upon a couple of things. One is we understand how many events are pending from your upstream. Basically, Kafka or any source, whatever it may be. We understand those metrics. And then the second is like we try to understand at what rate can you process? Let's say are you able to process like three events per second or probably four events per second? So we based on looking at all of your upstream, downstream and youre current processing rate in itself, we auto scale individual components such that there is no lag as such. Let's say, for example, sometimes there are use cases where maybe a sink can be completely down, a database is completely down and you cannot write the events. So we scale the entire pipeline back to zero if need be, or you can also write it to a fallback sink if you can provision for that. So we have all of that flexibility that we provide where you can package, you can do any kind of processing and then also package your models for inference on streaming data.

Adam Becker [00:22:00]: We have another question here from Rohan. Does it support zero downtime?

Sri Harsha Yayi [00:22:03]: Dag update yes, that is something that we are actually currently working on where we want to be able to provide you the flexibility to do no downtime upgrades. Let's say you change your container. How do we make sure that we are not stopping your processing altogether rather than dynamically be able to solve for it? That should be available very soon. That there's a pr out for it I think should be available soon for the no downtime upgrades.

Adam Becker [00:22:28]: Sri Harsha thank you very much. If folks have more questions, maybe you can join us for the chat.

+ Read More

Sign in or Join the community

Watch More

Real-Time Exactly-Once Event Processing with Apache Flink, Kafka, and Pinot

Posted Apr 29, 2022 | Views 1.4K

# Uber machine learning platform

# Uber Machine Learning

# Real-time Machine Learning

Real-Time Data Streaming Architectures for Generative AI // Emily Ekdahl // DE4AI

Posted Sep 18, 2024 | Views 944

On Juggling, Dr. Seuss and Feature Stores for Real-time AI/ML

Posted May 30, 2022 | Views 962

# Juggling

# Dr. Seuss

# Feature Stores

# Real-time AI/ML

# Redis.io