Sign in or Join the community to continue

Chronon: Airbnb's Open-Source Data Platform for AI & ML applications // Nikhil Simha // DE4AI

Posted Sep 18, 2024 | Views 2.2K

Share

speaker

Nikhil Simha

CTO @ Zipline AI

Nikhil is CTO and co-founder of zipline.ai. Before that, he was a Senior Staff Engineer on the Machine Learning infrastructure team at Airbnb - where he built and open-sourced chronon.ai. Before that, he built a stream processing scheduler called Turbine and a stream processing engine called Stylus that powers real-time data use-cases at Meta. Nikhil got his Bachelors degree in Computer Science from Indian Institute of Technology, Bombay.

+ Read More

SUMMARY

This talk will introduce the open source Chronon project, authored and maintained by Airbnb and Stripe. It will cover the technical problems that Chronon solves, and how it can be used to organizations to accelerate their AI/ML efforts.

+ Read More

TRANSCRIPT

Skylar [00:00:04]: All right, all right, all right. We're about to get started. Sorry for the technical difficulties, so you can't see my face, but it's not as pretty as Nikhil's, so we'll have to deal with just his.

Nikhil Simha [00:00:20]: Here we go.

Skylar [00:00:21]: Awesome. Welcome, Nikhil. Let me kill the music here and let you get started.

Nikhil Simha [00:00:27]: Hey, thank you. All right. It's an absolute pleasure to be here to talk to you guys about Cronon. So Cronon is a data platform we built for machine learning and AI workloads at ABNB and stripe. So a little bit of background about me. I, until two months ago, was supporting a few teams in ML, infra at Airbnb, feature platform embedding, platform rag and ML observability on the teams that I was supporting there. Before that I was, I was the second engineer on the stream processing team at Meta. And before that I was essentially an ML engineer.

Nikhil Simha [00:01:09]: Before that title existed, I was embedded with research scientists who are improving content quality of item pages in Amazon and Walmart. And most recently I co founded a company called zipline, which makes it easy for people to generate production grade ML systems. So Cronon is used very widely at Airbnb for many kinds of use cases, rankings for content ranking and search ranking for account fraud and payments fraud for all kinds of customer support use cases and marketing technology type stuff. And more interestingly, also some non ML use cases like rule engines or metrics. So the impact of Cronon has been pretty far and wide. At Airbnb, we went from using 3000 features to close to 30,000 features in about three years. And as a result, not only the number of models that were built increased, but also the number, the number of features that got used in each of these models also increased. Right.

Nikhil Simha [00:02:27]: And as a result, the ML systems are faster to build and out of the box, more scalable and performant. And more importantly, practitioners are independent, more or less. So a data scientist typically had to work with a team of systems engineers to bring this ML system online. And even though the prototyping phase took only a couple of weeks or a month, the ML system building phase took many, many months or even a year with a large team of systems engineers. So that's no longer the case. And we also open sourced Cronon a couple of months ago. It's battle tested at Airbnb and Stripe. You can use a QR code here or go to the link to check out the product for yourself.

Nikhil Simha [00:03:15]: So I'm going to walk through how Cronon operates with a simple example, it's a chatbot, but you can imagine it being a regular traditional ML system tool. Let's say we are building a chatbot for a ecommerce website and the item of a particular user didn't arrive and the user wants to talk to the chatbot and understand what's going on. So if you just pass in the user question into the LLM, the LLM cannot help, usually cannot help with the issue. So what we need to do here is enrich the prompt with more context. So one example is to get a sense of the percentage of issues with this merchant recently. So let's say we want to build that metric, which is the percentage of issues of the orders of this merchant in the last week. We basically need to. So the default way of doing this, the quick and dirty way of doing this is just hitting the production databases.

Nikhil Simha [00:04:30]: So if there is an orders table and issues table in the database, we just run some SQL on it and get account and divide them and get this percentage out. But this stops scaling really fast because there are usually merchants with a large number of orders, and doing this range can stop scaling very fast. So everything becomes problematic if the range scan is problematic. The scalable way of doing this is very involved. But the idea behind the scalable way of doing this is pre aggregation. So instead of reading raw order information on every request, we are going to read the counts directly. So to build such a system, we need two kinds of pipeline. The first one is a batch pipeline that looks at all the historical orders and creates a count out of it using something like spark maybe, and a streaming system, which is reading data from a flink.

Nikhil Simha [00:05:37]: But like looking at the most recent data and creating a new count for today and storing all of that together in a key value store such as dynamo. And then there is a service that is supplying all of this information to the ML service or to the prompt. And to orchestrate all of this, you need to use something like airflow and to monitor, you need to use something like Rafa. So it's a simple metric, but as you can see, the amount of infrastructure you need to support this is pretty massive. Let's take another example. It's slightly different, which is to find similar issues to this current issue, and this is a retrieval problem. So to find similar issues, we need to look at all the issues of all users and find the most relevant issues and then pass it on to the prompt. And this also follows a similar story.

Nikhil Simha [00:06:34]: But the biggest difference is that now we are generating embeddings embeddings. We need to call out into something VLM if you want to deploy your own model. Or you can call out to a vendor such as OpenAI to generate meetings. And finally, you need to store vector store something like, and again, a service that pulls this information out and supplies this to the ML service, or the prompt, and again, airflow and grafana for orchestration and monitoring. So, and this is just two elements of the prompt, and there can be more elements. So you can think of similarly measuring item issue, percentage, or issue with the delivery person, or you can think of enriching the context with relevant policy documents for that geographical area and for that market, or for that item category. So once you have all of these elements in the prompt, the infrastructure needed to support all of these elements becomes super difficult to build. So it's a multiplication of infrastructure components, and finally, that results in frustration, like, a long time for us to build these systems.

Nikhil Simha [00:08:06]: So even though, like, the lion notebook is easy to read, quick to build, when it comes to deploying this in production, it's just super painful for a single data scientist to manage. And even if there's a team of like four or five systems engineers, it's just a huge, huge system to hand build. So how do these challenges materialize? One big symptom that we have noticed is people simply don't use enough data in their models or in their prompts, so the context is not as powerful and model is not as good with its responses. And the other symptom is, you know, it's, as I said, it's very easy to build a prototype, and you throw it over the wall to mles and software engineers, and it takes a long, long time for anyone to build this system. So this is the main reason behind founding this company and building the Cronon project, which is to make the. So we want the ML system to be jetted, not hand built. That's the main goal of both the Opus project and the company. So what that means is, from a user point of view, they're defining their computation over their raw data.

Nikhil Simha [00:09:33]: We generate the infra, and users use that to train models and evaluate this and iterate on them quickly. And finally, we expose production endpoints that serve applications. So from a user experience point of view, it looks like they have written some python code, and they're able to, like, get to this production endpoints without any intermediate team that is building systems. All right, so that's my entire presentation. And, you know, any questions, feedback requests are all welcome. You can also reach out to me on this email address.

Skylar [00:10:19]: Awesome. Thank you so much. I didn't see any questions come through in the chat, but maybe just as a last question, since we have a few minutes, how would you compare or contrast this cronon with a comparator? What is the closest comparator to this, and what are the differences?

Nikhil Simha [00:10:43]: Yeah, so if this is a broad area, roughly, there is like five categories or five boxes, and there is competitors in each of these boxes. One is the offline training, data generation and offline model training. The second box and the third box is feature serving and model serving. Those are like the other two boxes and final boxes are like observability and orchestration. So in each of these boxes there are many vendors. And so, like, for example, for offline training, data generation, people just use databricks or snowflake or like write hive pipelines. And for online feature serving they use systems like Tecton or fennel, etcetera. And for model training and serving, you know, they use something like Vertex AI or sagemaker.

Nikhil Simha [00:11:38]: And for monitoring, there are vendors like arise and fiddler, I think. And for orchestration, there is like many orchestration vendors, I'm just naming a few, but they all like address different areas of the different pieces of the puzzle. Our approach is slightly different. We think there are really good open source projects that solve each of these things really well. We want to instead make it easy for people to deploy these open source projects and use them to generate infra seamlessly. I don't know if that answers the question, but.

Skylar [00:12:17]: Yeah, yeah, no, that's great. Cool. We're at time, but really appreciate you coming by and sharing everything. Awesome.

Nikhil Simha [00:12:27]: Thank you for having me. Yes.

+ Read More

Sign in or Join the community

Watch More

Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses // Vinoth Chandar // DE4AI

Posted Sep 17, 2024 | Views 1.3K

Leveraging Open Source Tools in ML

Posted Dec 22, 2022 | Views 692

# Geofences at Scale

# Spot EC2

# project44

# Cars.com

# Toast Inc

Data Engineering for ML

Posted Aug 18, 2022 | Views 1.5K

# Data Modeling

# Data Warehouses

# Semantic Data Model

# Convoy

# Convoy.com