MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Building a Python-Centric Feature Platform to Power Production AI Applications

Posted Feb 27, 2024 | Views 247
# AI Applications
# Python
# Tecton
Share
speakers
avatar
Matt Bleifer
Group Product Manager @ Tecton

Matt Bleifer is a Group Product Manager and early employee at Tecton. He focuses on core product experiences such as building, testing, and productionizing feature pipelines at scale. Prior to joining Tecton, he was a Product Manager for Machine Learning at both Twitter and Workday, totaling nearly a decade of working on machine learning platforms. Matt has a Bachelor’s Degree in Computer Science from California Polytechnic State University, San Luis Obispo.

+ Read More
avatar
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More
SUMMARY

In this talk, Matt walks through Tecton's journey to build a platform that can reliably power large-scale real-time AI applications while requiring nothing more than Python.

+ Read More
TRANSCRIPT

Building a Python-Centric Feature Platform to Power Production AI Applications

AI in Production

Slides: https://docs.google.com/presentation/d/1arz9f94vuGoP8UapBRBSMWiPmectZaLIiXziKwOvhrg/edit?usp=drive_link

Adam Becker [00:00:00]: Next we have Matt from Tecton. Matt, are you around?

Matt Bleifer [00:00:05]: I'm here.

Adam Becker [00:00:07]: Okay. Matt, good to have you here. Thank you.

Matt Bleifer [00:00:11]: Good to be here.

Adam Becker [00:00:13]: Very few people picked you for. What was the question? Do you remember that question?

Matt Bleifer [00:00:17]: I think they were wondering if perhaps I was from South Africa, which is kind of close to California where I grew up, but not quite so close, but no cigar for the people that voted it. But I thought that it wasn't too many.

Adam Becker [00:00:30]: So people are surfing in California at all? No. Where in California?

Matt Bleifer [00:00:35]: A little bit. I grew up just north of LA, so Santa Monica was the kind of closest hometown beach for me. Not a bad place.

Adam Becker [00:00:43]: Matt, you're from Tecton? I've been a big fan of Tecton for, I think, since day one. You have 30 minutes. We would love to learn everything about how to build a full on feature platform using Python alone, which is the title of your talk. Do you have your screen that you'd like to share? Excellent. Okay, Matt, I will come back in about 30 minutes. Take it away.

Matt Bleifer [00:01:10]: All right. Can you see the slides? All right.

Adam Becker [00:01:15]: Good.

Matt Bleifer [00:01:16]: Cool.

Adam Becker [00:01:18]: Awesome.

Matt Bleifer [00:01:19]: Cool. Hey, everybody. As you now know, my name is Matt. I'm a group product manager at Tecton, and today I'm going to run through with you how we went about building a python centric platform for powering production AI applications. First, a little bit about me. I've spent just about a decade working across companies like Workday, Twitter, and Tecton to put AI systems into production. So I've seen it at large enterprises like Workday, at super sophisticated companies like Twitter, or formerly known as Twitter, and then now at Tecton, where I've had the opportunity to help dozens of customers like Roblox or Platt or Atlassian, put AI into production. And after so much time of working in this space, there's really one key takeaway that's not surprising to anyone here, and that is that production AI is hard.

Matt Bleifer [00:02:19]: There are just a suite of requirements that have never before been tackled that make this a very difficult problem. And in fact, I'ventured to say it's not just hard, it's really hard. And I spent some time trying to reflect and figure out what is it that's really at the core of this that makes this such a difficult problem? Why is it so hard to put these systems into production? If I had to reduce it down to one thing, and I would say it comes down to really, that data science and software engineering are fundamentally just two completely different worlds. And data science and software engineering is really the kind of intersection where AI meets production, right? We're not talking about necessarily models that are running offline for analytics. We're talking about now doing ML as part of a production application. And that means merging these two worlds of data science and software engineering. On one hand, we have data science, which is all about rapid iteration and experimentation. Users are used to working inside of kind of notebook centric environments where they're shift entering all throughout any cell they can get their hands on, and their state's getting all crazy.

Matt Bleifer [00:03:30]: But that doesn't matter because they're free and they're iterating and they're doing their data science. They pip install just about anything that they need in order to be able to get the job done. Everything's centered around this python ecosystem, and ultimately what we're trying to do is optimize for model quality, right? We want to find the best possible features and build the best possible model so we can get the most predictive results. Now, on the other hand, we have software engineering, which is a totally different animal from this. Instead of iterating rapidly, we needed to figure out how to iterate reliably. So we developed systems like git with version control, or CACD and DevOps, best practices like code reviews, to make sure that every single step that we take is reversible, it's monitorable. We can blame specifically who committed a particular line of code to make sure that ultimately we're not taking down a production application. Second, we have to meet production slas, right? So we have to think about things like scale, we have to think about things like latency, reliability, et cetera.

Matt Bleifer [00:04:41]: Because ultimately, in software engineering, what we're trying to do is optimize for user experience, right? We don't want to leave our users waiting. We don't want to have them try and sign into their favorite application and they're not able to use it. And that's why we have these different requirements in the world of software engineering. And so we have this really big divide between the two of these two completely different worlds that need to figure out how to come together in order to build a production AI application. And in recent years, feature platforms have emerged as one of the main tools that helps bridge this divide and kind of marry these two worlds together in a way that they can be really productive and get AI applications into production as fast as possible. What's a feature platform, you might ask? If I were to break it down as simply as possible, I'd say it's a system that connects to raw data, helps you transform that raw data into features that can then be fed into models, both for model training and online model inference, which goes to the key requirement about a feature platform is that it helps bridge these online and offline environments. So the online environment being where your production software application runs, imagine where users are swiping credit cards to make transactions or logging into their favorite game to see recommendations. All of that is in the online real time environment.

Matt Bleifer [00:06:10]: And this contrasts to the offline environment, which is where we're doing our data science and our ML engineering. This is where we're iterating, we're training models, et cetera, that are ultimately then going to power that online production application. And so this is where feature platforms fit in, because they help bridge those two environments all the way from your raw data to your models. But actually, what does a feature platform do? Well, there's three key things that every single feature platform does. First, it provides a way for users to define features, so they need some way of taking an idea that they have in their head, such as, hey, I think that a good predictor of fraud would be the user's average transaction amount in the last seven days. And then they need some way of being able to express that feature. They need to be able to point to the raw data sources that they have and connect to them. They need to be able to express the transformation logic, and then they need to be able to define where they're going to need those features.

Matt Bleifer [00:07:12]: The second thing that every feature platform has to do is it has to provide a way to actually compute features, right? So we might have a feature definition, but what's the thing that's actually going to go ahead and orchestrate these transformation pipelines, run these compute jobs in different contexts, whether it's batch stream or real time transformations, manage backfilling, et cetera. We need something that's actually going to take our feature definitions and turn our raw data into our features for our models. And then the last thing that every feature platform has to do is it has to provide a way to be able to retrieve features again across those two environments. So we need to be able to retrieve features online at low latency, so we can feed that into our model inference, which then powers behaviors in our production application. And then second, we need to be able to retrieve offline features with historical accuracy so that when we train our model, we can understand, hey, what did the state of the world look like at the time of any given event, so that we can predict it the next time we see it in the future. So all three of these things have to be accomplished by a feature platform in order for it to work. But the question is, how do we go about actually doing this? What does our feature platform look like in a way that bridges, again, these two completely different worlds with two very different requirements? What's the approach going to be? And so I'm going to talk a little bit about how over the last four or five years, we've approached this at Tecton, and some of the learning lessons along the way and solutions that we arrived at that help kind of bridge these two worlds while delivering those three key jobs to be done that every feature platform has. So first we started out and we said, hey, we need a way for users to be able to define features, but we don't want just that kind of wild west.

Matt Bleifer [00:09:04]: Everything's inside of a notebook, and then we're delivering all of those to production because that's not reliable, right? It doesn't give us those software engineering best practices that we wanted. And so we said, hey, what if we can actually define features as code? And so what we did is we said, hey, we'll give users a declarative API where they can express their feature logic so they can say, hey, I have a batch feature. I'm computing a user's average transaction amount. Here's some transformations I need to run, or aggregations, et cetera. And they actually store that all as code inside of a repo that's backed by git. And then when they want to make changes or test out these features or apply these features to production, then they go through a terraform like apply step where they go into a cli, they run Tecton apply. Tecton compares what they have defined in code to what's registered in production, spits out a plan that says, hey, here's the differences that you're going to ship these changes, do you want to make them? They say yes. And typically what our users do is they integrate this whole flow into their CI CD pipeline.

Matt Bleifer [00:10:13]: And so what that looks like is I might make some changes to a feature, I commit it to git and push my changes. Somebody does a code review that merges into the main branch and then that kicks off the CI CD pipeline to ultimately ship these to production. And so this was great because it allowed our users to finally take kind of all of the data science work that they were doing and be able to iterate reliably. Right? We introduced git and version control for features, we introduced CI CD and these DevOps, best practices like code reviews. And for the software engineering folks on an ML engineering team responsible for putting something into production. This is really great, right? It makes them happy. They finally have software engineering best practices in ML. But it didn't make the data scientist happy, right? The data scientist is not used to using git, not used to iterating through kind of declarative workflows, and no one wants to have to push changes to a git repo just to be able to test out some feature transformations that they wrote.

Matt Bleifer [00:11:18]: And so we had to kind of go back to the drawing board and say, okay, how can we iterate on this UX in order to provide something that would make these data scientists happy and successful and able to iterate rapidly? So we said, hey, what if you can actually experiment with new features just directly inside of a notebook? So what if we made our declarative API function where if you fire up a Jupyter notebook, you can take any of these definitions that we know how to declaratively interpret with our CLI and just put them directly inside of your notebook and start executing queries against them. So hey, here's my feature definition. I want to see what the features are for a particular time range and understand what that output feature data would look like. I want to take that, I want to train a model, et cetera. And so you can iterate freely inside of a notebook just like you're used to as a data scientist. And then when it comes time to actually productionize it, you can take those definitions, put them into your main feature code repository that's backed by git, and then go through that same workflow to push changes to production. So instead of your iteration loop being part of that declarative Gitops workflow, we say, hey, iterate inside of your Jupyter notebook where you're happy. And only as a final step should you store and manage this all as code and push that to production.

Matt Bleifer [00:12:40]: And that's kind of that gateway that guarantees you still always have those DevOps, best practices for anything that's going live. So we solved that requirement. Data scientists were able to iterate rapidly inside of their notebooks, turning these folks back into happy campers. Next up, we needed to figure out how do we let people compute features? And again, we need something that's going to be reliable and meet our production slas. We need to make sure that we use something that's tried and true, good at transforming features, works across batch and streaming, and that we can count on in order to actually get our feature data that we need. And so we said, all right, we'll go with Spark. And so across our system, whether it was a batch pipeline, a streaming pipeline, or a training data generation job, we said, you get Spark and you get Spark, and you get Spark across the board. And then we said, hey, we'll worry about actually orchestrating all of these spark jobs for you.

Matt Bleifer [00:13:44]: So you don't have to worry about setting up an orchestrator, you don't have to worry about handling backfills, you don't have to worry about job retry logic. The system's just going to handle that for you. So as soon as you say, hey, this is my feature definition, I need it recomputed every day, or I need it computed in streaming. And then you hit that Tecton apply yes step, that's when Tecton says, great, we'll go spin up all of these spark clusters, we'll manage these jobs for you, we'll do the backfills, et cetera. And so this was good because we had a way to be able to reliably transform our features and get those into production again, making our software engineers quite happy. But the data scientists were less than happy, right? Because all of a sudden, for a bunch of people that weren't familiar with Spark, we introduced Spark directly into their workflow. So this beautiful world of python that they were used to now meant, hey, you actually need to think about spark jobs. Debug Spark pipelines.

Matt Bleifer [00:14:41]: If anyone here has worked with Spark, you know that there's a million different knobs and dials that you might need to tune in order to get something right. You're stuck iterating inside of maybe an EMR notebook or a databricks notebook, so.

Speaker C [00:14:52]: Suddenly you're more constrained in terms of really quite main development. Again, we went back to kind of data set work.

Matt Bleifer [00:15:20]: Tender.

Speaker C [00:15:22]: What if you just cute feed?

Matt Bleifer [00:15:27]: That'd be great. Yes, it would.

Speaker C [00:15:32]: Python scale streams and pound is not for data warehouses.

Matt Bleifer [00:15:38]: And so we're like, how in the.

Speaker C [00:15:39]: World are we going to be able to make this happen? Fundamentally meet the requirement if we just say like, python is your system, these features. And so we system that we call Rift is tech engine, running stream and real time. It allows for Python transformations at any point. So whether faster or a real time feature, you can in the entire world.

Matt Bleifer [00:16:21]: Python or pandas or whatever you like, just to be able to transform features. Also iterate inside of any Python environment. Need a notebook with access to a spark context? Fire up Jupyter, fire up whatever you want for your notebook environment. All you need is Python and you can build features. The second cool thing about Rift is we directly integrated it into data warehouses. So we have a really strong data warehouse community that's maybe centered around things like Snowflake or Bigquery that says, hey, I want to be able to build features using something like Snowflake SQL, and I want to use my warehouse and take advantage of that in order to compute those features. And then I also want those features to be made available offline inside of my warehouse so that I can do analytics on it after the fact. And so with Rift, we made it so that all of our batch features can actually be expressed using Snowflake SQL or Bigquery SQL.

Matt Bleifer [00:17:19]: And then when we run those transformation pipelines, we will actually push all of the relevant compute up to the warehouse and do all the work that we can there before moving on. And then lastly, rift scales to be able to quickly process millions of events. So whether you're executing a backfill or running a training data job or running a streaming pipeline that needs to be able to aggregate across millions of events while keeping features fresh up to less than a second ago, Rift will totally have you covered under the hood. In order to pull this off, we use a lot of different cool technologies. We use ductdb, we use ray, we use arrow, et cetera. And then also central to all of this is Tecton's aggregation framework, where we take kind of a new approach to computing and running aggregations that can be super fresh across long time horizons. And you can read more about that in our docs and on our blog. Cool.

Matt Bleifer [00:18:27]: So now we got this part covered too, right? Data scientists are back in the world of Python, where they're quite happy and they can pip install whatever it is that they need. Now, the very last step that I talked about here, that every feature platform has to do is it has to provide some way for users to be able to retrieve features again in those two different contexts. And so what Tecton does is we offer a concept called a feature service, which defines the set of features that you need for a given model. And then when you need features online at low latency, you can hit our HTTP API in real time, get back a feature vector, feed that feature vector into your model inference, and then be able to take actions inside of your production application. And when you need features offline with historical accuracy, you can use the Python SDK to be able to generate a training data set with time travel so that you can say, hey, for any event in the past, what was the value of this feature? And again, this Python SDK just runs on top of Rift, which means you can run all of this inside of any Python environment to be able to generate training data. And so this means that we can, a, optimize for model quality, and b, we can also optimize for the end user experience. We can deliver predictions in real time at low latency without making users wait. And this leads to a very happy and productive ML engineering team across the board.

Speaker C [00:20:10]: Cool.

Matt Bleifer [00:20:11]: So I'll leave you with some parting thoughts that I think are top of mind for me now and top of mind for a lot of people. And I said at the beginning, production AI is hard. We talked about some of the challenges here and how we went about solving them. But what's cool that's become very apparent in the last year is that production AI is also evolving. If we look at a typical application, like our predictive AI applications, fraud detection, recommendations, et cetera, and we kind of think about, like, what do these look like today? Well, we take data, we turn that data into structured features. We take those features, we put them in a key value store to be able to look them up online. We run some sort of XgBoost model, some decision tree to be able to make predictions for a given user, and then we use those predictions to be able to adjust our application behavior. And now we have this whole new suite of applications that are going to be coming online that are these generative AI applications.

Matt Bleifer [00:21:14]: And the more that I thought about this, something really interesting stood out to me, especially if you look at kind of a typical rag style application, you have data. You turn that data into embeddings. You take those embeddings and you store them inside of some sort of vector database. You then retrieve relevant documents and you feed that as context into LLM models. And then lastly, you use the output of that in order to change your application behavior. And so just like our predictive AI applications, all of the data pipelines and all of the workflows across the board here are quite similar, which leads me to believe that we will also see feature platforms emerge as really the common bridge across all of these types of applications. Food for thought and more to discuss. Thanks for your time and happy to take questions now.

Matt Bleifer [00:22:10]: Thank you.

Adam Becker [00:22:19]: All right, Matt, that was awesome. Thank you very much. Let's see if we have any questions from the audience, and if not yet, then I'll direct you to join the chat. Let's see, we're getting nothing yet. I have been very curious for a long time about how tecton and other companies and tools that are doing feature stores and how they're going to react to this generative AI paradigm. And I have a feeling that you guys have probably been thinking about this long and hard as well. But it's very good to see this sort of like coming together and rift too. I think it would be excellent to try to get my hands on.

Adam Becker [00:23:03]: So thank you very much for building this. I think you guys are detecting the exact sort of problem many data scientists have, and very often you see their spark first. When they do do spark, it's with a lot of antagonism. And then oftentimes I've had to spend what feels like years debugging spark jobs and pipelines.

Matt Bleifer [00:23:26]: Anything to try not, I'm with you, I've spent years of my life on that one and I'm happy to have a good alternative now. And yeah, stay tuned. We're doing a lot of stuff to make it easier and easier for people to get their hands on Tecton and try this out directly, especially because it no longer requires anything more than Python to get started. And I think that's really exciting. And then, yeah, to your point, it was interesting when all the gen AI stuff started coming up, everyone's like, what is Tecton going to do? I think everyone has to answer this question and instinctually you're like, oh, maybe we'll use natural language to define features and adjust our system. Maybe we have like a docs chat bot that helps you out. But what became more and more interesting is kind of what I was talking about there, especially as I started to understand more about these retrieval augmented generation pipelines. And I looked at it and squinted and was like, wait, that's just a data application.

Matt Bleifer [00:24:16]: That is a standard ML application. It's all the same steps, it's all the same workflows. And I think that a lot of the same challenges and a lot of the same kind of bridging of the worlds of offline experimentation and online production are really going to be the same across it. And so I think that it will be prudent to kind of have a unifying system for all of these rather than reinventing the wheel.

Adam Becker [00:24:38]: Yeah, I also see, and help me to clarify this for me in case I'misperceiving, but I think we used to speak more about feature stores and less about feature platforms, and it feels like feature platform has become a bit more of like a recent sort of addition to the discourse. Can you say a little bit more about that?

Matt Bleifer [00:24:56]: Yeah. So early on, we kind of branded this whole thing as a feature store. But then as we started working more and more with customers and just trying to solve any of the related problems to get them into production, it was like all sorts of things that didn't have to do with storing features necessarily, right? It was like orchestration or complex aggregation, streaming pipelines or version control, et cetera. And we were like, really? This feature store name is becoming increasingly less relevant to try to describe what it is that we're actually solving for. And so I think feature platforms kind of emerge as a term to explain, really the system that takes you all the way from raw data to models, which is really like a data flow centric way of looking at it. It's going to be the same way of looking again at these gen AI applications. It's like you have raw data, you have a bunch of complex pipelines that need to run in different contexts and have all this stuff setting them up and manage it. And so I think that the visual of a platform makes more sense, and we kind of now have grown to refer to the feature store as definitely a very central component inside of a feature platform, the thing that kind of handles online and offline serving and time travel of historical features, et cetera.

Matt Bleifer [00:26:11]: But more and more, I think that we take kind of like that data pipeline centric look on what we're actually solving for.

Adam Becker [00:26:19]: Matt, thank you very much. You're still based in California, or did you just grow up there?

Matt Bleifer [00:26:24]: I grew up in California most recently, actually. I headed over to the east coast, spent some time in New York, and moving around a little bit. I'm figuring out where my settling place is going to be. So hit me up in chat. If you want to promote good ideas.

Adam Becker [00:26:38]: I will just pitch to you. New York City has been very nice, and Tecton has good presence there. And I'm one of the organizers for the Mlops community in New York, and we do a lot of events, and we're doing a workshop with Tecton at some point in the next couple of months.

Matt Bleifer [00:26:56]: So I think maybe I'll see you there.

Adam Becker [00:26:59]: Maybe I'll see you there. Matt, thank you very much.

+ Read More

Watch More

Building LLM Applications for Production
Posted Jun 20, 2023 | Views 10.1K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io