MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Relational Foundation Models: Unlocking the Next Frontier of Enterprise AI

Posted Nov 25, 2025 | Views 5
# Structured Data
# Relational Deep Learning
# Enterprise AI
Share

speakers

user's Avatar
Jure Leskovec
Professor and Chief Scientist @ Stanford University and Kumo.AI

Jure Leskovec is the co-founder of Kumo.AI, an enterprise AI company pioneering AI foundation models that can reason over structured business data. He is also a Professor of Computer Science at Stanford University and a leading researcher in artificial intelligence, best known for pioneering Graph Neural Networks and creating PyG, the most widely used graph learning toolkit. Previously, Jure served as Chief Scientist at Pinterest and as an investigator at the Chan Zuckerberg BioHub. His research has been widely adopted in industry and government, powering applications at companies such as Meta, Uber, YouTube, Amazon, and more. He has received top awards in AI and data science, including the ACM KDD Innovation Award.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Today’s foundation models excel at text and images—but they miss the relationships that define how the world works. In every enterprise, value emerges from connections: customers to products, suppliers to shipments, molecules to targets. This talk introduces Relational Foundation Models (RFMs)—a new class of models that reason over interactions, not just data points. Drawing on advances in graph neural networks and large-scale ML systems, I’ll show how RFMs capture structure, enable richer reasoning, and deliver measurable business impact. Audience will learn where relational modeling drives the biggest wins, how to build the data backbone for it, and how to operationalize these models responsibly and at scale.

+ Read More

TRANSCRIPT

Jure Leskovec: [00:00:00] That people don't even realize, or we seem to have forgotten how valuable structured data is because it's kind of the ground truth of the business and it's, it, it, it captures the blueprint, the of, of the business, and we seem to have forgotten all that. And, you know, we are now excited about answering questions about documents.

Demetrios: Now, this is a bold claim, being able to create machine learning models faster and better. What is this? Break that down for me a little bit more.

Jure Leskovec: Um, yes. Um, I can, I can, I can give you an, I can give you an explanation, right? Um, if you think about. How we are building that's, say, these predictive machine learning models today, right?

Jure Leskovec: This, these models would be, let's say, churn models, recomme commander systems, um, any kind of risk scoring models, fraud models, [00:01:00] uh, across all kinds of industries, you know, from, from hospitals to social media to manufacturing, to supply chain optimization and so on. The way we are doing this is right that fundamentally these models, uh, use the internal.

Jure Leskovec: Uh, data that is usually structured in tabular form, right? It sits in some database, in some data warehouse. It usually has this relational structure. You have multiple tables, uh, interlinked with primary foreign key relations, right? So it would be a user table, a product catalog, and, and, uh, record of all the orders of these users.

Jure Leskovec: Maybe the, the table that includes all the website behavior. So all the clicks that says, this user clicked on this webpage at this time, and so on, right? And yeah, all the features, um, we want Exactly, exactly right. So this would be the data. And the way we, we build these models, these models today, is that we would basically now be joining these tables, generate features, and then train, you know, our favorite neural network or whatever it is, uh, on this data.[00:02:00]

Jure Leskovec: And I can go deeper. What what I mean by, by, by all this, but basically it takes us, I know several months to build these models. Um, we need to engineer the features. Features need to be up to date. They must not be stale. There is this time travel issue. We then put these models, let's say, in production to actually run more than just in our private notebook, and that's another can of worms and so on.

Jure Leskovec: Uh, but if you look at what's kind of been the trend in ai, let's call it this way, right? The trend in AI has been. Let's learn on neural data. Okay, so in big revolution in computer vision was let's not feature engineer from the pixels in the image. Let's train a neural network directly on the pixels.

Jure Leskovec: Okay. And we get these super great neural networks for computer vision total revolution. Now we are in the age of large language models, right? In the old days when we did, let's say, natural language, uh, processing, natural language understanding, even if you think when, you know, IBM won Jeopardy the competition.

Jure Leskovec: It took [00:03:00] 300 people, uh, two years, and a bunch of very careful feature engineering. So the computer was able to answer, uh, jeopardy questions and be better than humans. So in some sense, we had AI at that time, but it was just kind of hand engineered. Right? Today, transformers kind of, they just learn over a sequence of tokens, um, and it's kind of, they learn overall data.

Jure Leskovec: So what the interesting thing is to ask, okay, so it seems like that the world of machine learning, let's say. Predictive modeling has been stuck in the past. We don't learn on the raw data, we learn on this ized data. So, um, what I'm, what I was very excited about was, was to say, okay, how could we change this?

Jure Leskovec: How can we develop methods that can learn directly on the raw data? Um, and the hard part here is that your, your data in machine learning does not come from a single table. It comes from multiple tables, right? Yeah. So, and it's usually very messy too, right? It's usually not exactly. Pristine data that gives [00:04:00] you a clear story.

Jure Leskovec: It's not, it's not pristine, it has missing information and all that. Right. So, um, what we have developed is a set of neural networks, and I can talk about them. They are based on, um, they're basically based on graph representation of the data that allow you to learn directly on the raw database data. So they allow you for the first time to learn over multiple tables at once.

Jure Leskovec: And now because of that. You completely sidestep the feature engineering process, which means you save humongous amount of time. And compute and compute, uh, compute, maybe not so much. Actually, you need GPUs because these bottles are now a bit bigger. So, so, you know, you need, you need to train on GPUs. Um, but you know, the computer works when, when you sleep.

Jure Leskovec: So, so it's, it's very fine. Uh, but there is another benefit, which is because now the neurons are learning how to combine the data, uh, the fidelity, the, the nuance of those learned signals, it's much, much [00:05:00] higher. So your models get more accurate.

Demetrios: Uh huh. So now I've heard folks talk about recommender systems with LLMs and just basically the inherent understanding that LLMs have.

Demetrios: You can suggest certain things to an LLM, and it's going to be able to have a better view of what a user will want. Is this kind of what you're talking about, or is this a whole separate thing?

Jure Leskovec: I would say it's a separate thing, right? Like, uh, if you want through LLMs with for recommendations, it kind of.

Jure Leskovec: Works okay. It's not super terrible, but it's not good either because it's all kind of based on some kind of common sense. It's not really learned from the data for your specific users, for your specific understanding and so on. Um, and I can give you a simple story. For example, um, [00:06:00] I was chief scientist with Pinterest for six years, right?

Jure Leskovec: Took Pinterest from a hundred people to post IPO, um, and I built, uh, a recommender system platform there. Uh. Three generations. Um, and what was, for example, a super hard problem with p with Pinterest is that as you see these images, maybe you have an image of the rug on the floor and you have an image of a tapestry on the wall.

Jure Leskovec: They practically look the same. The neural network could not tell the two apart, but humans could, and the behavior on those things were, was very, very different. So at Pinterest, we very quickly realized it's not just enough to treat a user as a sequence of, let's say, a sequence of images they visit.

Jure Leskovec: Because the neural network would get confused. What you need to think of it is, uh, is uh, basically you need to think of Pinterest as a set of interactions of users with content, and think of that as a graph. So if you say, oh, if I, if I interact with this image, and then other people interact with it as well.

Jure Leskovec: And, you know, [00:07:00] then I kind of know what this image is about, right? This, this tapestry can say, oh, I'm interacted with other tapestries, so I'm a tapestry. I'm not a, I'm not a rug on the floor. And a rug on the floor says, oh, when people click on me, they also explore other things that very much look like rugs.

Jure Leskovec: So I know myself, right? I'm not a, I'm not a tapesty, I am a rug. Right? Like, um, it's killed by association in this kind of, sorry. It's killed by association. Yeah. Exactly. Exactly. And this is what this kind of graph based approaches allow you to do because they allow you to kind of learn from this structured relational, relational information.

Jure Leskovec: And this is what sequence, sequence models, uh, sequence models will miss.

Demetrios: And if I'm understanding this correctly, for example, when you're doing the traditional machine learning models, you are really optimizing for one or two things. With what you are proposing now and what you've built it, does it have a more [00:08:00] holistic view?

Demetrios: Like if we have the two extremes where you have traditional ML models and then the other extreme is a large language model, where do you fall on that spectrum?

Jure Leskovec: Uh, that is a good question. So. The approach we developed, it's called Relational Deep learning, and it can take any database represented as a graph of relations between the entities in your database.

Jure Leskovec: And now we just learn over that graph of relationships. Um, and if you now train, uh. Transformer like architecture that is called a relational graph. Transformer that does not attend kind of over the tokens, like a typical transformer, but attends over the tables of that, of that schema of the database.

Jure Leskovec: You can, um, you can do both. You can train small tasks, specific models, kind of, uh. You can think of them as, you know, kind of classical machine learning models, but now really learned directly from the raw [00:09:00] database data. But you can also learn a large pre-trained model that is good at any task on any database.

Jure Leskovec: So you write, like if you go to chat GPT and ask it, Hey, do time series forecasting for me? Or you go and say, Hey, I have this transaction, how likely it is to be fraud. You, you will get terrible results, right? How would J GPT know the likelihood of a particular transaction being fraud? It won't, right? Um, but the, the large pre-trained relational graft transformer, which we call a relational foundation model, allows you now to ask these type of questions without any model training.

Jure Leskovec: So maybe the point is the same way as in. Let's say in biology, we accept that we need a DNA foundation model. We accept that we need a protein foundation model because protein is not natural language. I think the key thing we need to, um, accept is that for structured data, we need foundation models, trained and built for structured data.[00:10:00]

Jure Leskovec: We cannot justify a database, throw that into an LLM and, and hope it'll work because it does not.

Demetrios: Okay, so I'm starting to understand a bit more of the picture of what you've built, and now can we get into the nitty gritty of how this actually is built and what you're doing? Do I just throw in my database and then do I need a, a whole ton of data?

Demetrios: Like what, what does this look like in practice from mm-hmm. How I, how I interact with it, or how I train a model? That is now going to be one of these types of models.

Jure Leskovec: Yeah. So the way, the way you need to do is you, you need to, let's say first select the tables, the schema you wanna learn over. Um, usually, you know, it's five to 50 tables.

Jure Leskovec: Uh, what you need to do, you need to specify what are the primary foreign key relations between these tables, right? That [00:11:00] user ID in this table is user ID in that other table. Um, and you select kind of the semantic types of columns. So there is a bit of, let's say data registration, uh, data modeling, even.

Jure Leskovec: You can call it data preparation, uh, step, but it's, it's very small because you just need to say, this is my schema, here are the relationships. Um, and after that is done. You have two options. One option is that you use now the large pre-trained model that you just point to this collection of tables. You prompt it with a specific predictive question, and half a second later you will get an answer that is on average, as accurate as, uh, manually built model that takes, I know, a month to build.

Jure Leskovec: Um, what you can also do is then to say, okay, I am going to. Fine tune my model for a specific task and for a specific database and then you can fine tune and that will get you to superhuman. [00:12:00] Superhuman performance, right? Because now you have specialized model for specific database for a specific task.

Jure Leskovec: Um, and of course if you are, for example, doing fraud detection. Which is a high value task where you truly care about, you know, the last bit of performance because it's saving you so much, let's say your money. Um, then you wanna train a task specific model. Um, if you are in a more of a regime where you wanna do basically almost like ad hoc querying, ad hoc, predictive querying, you would use a, a large pre-trained model to do that.

Jure Leskovec: And on

Demetrios: these tax specific models, do you see folks creating an ensemble of models just like you would maybe in the traditional fraud detection use cases?

Jure Leskovec: Not, not really. Not really. What, what what we basically see is, uh, uh, people training, uh, a single, a single model that then learns how to attend [00:13:00] over all these, let's say different data from transactions to locations to time and all that, and combine that into the best possible predictive signal, right?

Jure Leskovec: That's the differences. If you think about maybe. A fraud model. In a fraud model, you say, okay, I have a, I have a customer. How many transactions did they do last week? You create a feature and then somebody, you know, some other data scientist wake, wakes up and says, it's not number of transactions. Last week is the number of transactions in the last, uh, you know, 10 days.

Jure Leskovec: Great. Let's add one more feature. Then somebody else wakes up and says, no, no, I, I, I know the answer. It's how much you spend in the morning, and I define morning as I know, six to 9:00 AM. And let's add that counter there, and then somebody else wakes up and says, Hey, but there is daylight savings time. We just changed the clock in us.

Jure Leskovec: Okay? So you know, in the MO in the summer, this is the morning. In the winter, that's the morning. But you see my point, kind of the point is that it gets kind of, we are computing this arbitrary statistics over the data. [00:14:00] Um, to, uh, to be able to say, okay, how much did this person purchase in the last time period?

Jure Leskovec: But if you have the attention mechanism, the attention mechanism can attend over each individual transaction. So it can learn to combine them in ways that humans will never, never will. And that's why it's able to kind of extract more signal from the raw data than, you know, a SQL query or a manually defined feature will.

Jure Leskovec: Yeah, because the neurons can do, can do so much more and they are basically trained to combine the data together, if that makes sense. It's,

Demetrios: yeah, it's basically like, uh, move 37 in the go. Uh, how, you know, in go how. What was it? It Wasm 37, right? I

Jure Leskovec: think it something, exactly, exactly, exactly. It's maybe something like this.

Jure Leskovec: It can be counterintuitive, it can be something that is very fine-grained. It could be some, you know, second order correlation that [00:15:00] you as a data scientist, as a machine learning engineer, never put into a feature and things like that. And because of that. You know, it's, I would say the same intuition as in computer vision, right?

Jure Leskovec: In computer vision, we are like, oh, here's the edge, here's the gradient of color. Nobody's doing that anymore. It's just the neural network figures it out. How to combine all the pixels to say what's in the image and, and basically we do the same, right? We take the database. Represent it as a graph. Uh, and then the, the transformer architecture, the graph transformer architecture, learns how to combine all these pieces of information into an accurate prediction rather than, you know, what is done today with machine learning, which is you manually figuring out how to combine transactions into something that allows you to predict something about the user.

Demetrios: Yeah. So you are wiping away all of that feature engineering. What are US data scientists and machine learning engineers gonna do now?

Jure Leskovec: [00:16:00] Uh, that's a great question. I think it's exciting. I think it's good because you are not a feature engineering machine. It's almost like saying, oh, you brought me a robot that helps me sweep, but I, you know, I wanna be sweeping the floor all the time.

Jure Leskovec: No, right. Like you, you are here to, to build models, to, to have impact on the business. Um, you are not here to clean data and feature engineer. Those are the two most boring parts of your job. So. What we see, for example, is that this really kind of opens the data scientists, uh, to be able to build models faster, to refocus on modeling, to focus on what should we be even predicting, right?

Jure Leskovec: You never, in, in, let's say in business, in enterprise, you never come. There is never a predictive problem that is given to you. It's a business problem that is giving to you. So how do I go from a business problem to a predictive problem? Asking yourself what is even predictable? What is modelable from my data?

Jure Leskovec: Because, you know, garbage in, garbage out. [00:17:00] So what we see is then, because now the, let's say the feature engineering step, you don't need to worry about, you are much more worrying about what's the raw data, how is the raw data structured, what is the information flow over the raw data and so on. And, and maybe let me just say this because I get asked, um, many times people say, oh, is this auto ml.

Jure Leskovec: Right. Like, remember this old Auto ml, uh, promise, right? It's what was Auto ml, right? Auto ML is you run a gazillion of SQL queries against your database. You join every table with every table, you aggregate everything seven different ways. You create this humongous, silly feature vectors, and then you train a bunch of models and hope it'll work.

Jure Leskovec: So it's like kind of throwing spaghetti against the wall and hoping that something sticks. What, what I'm talking about is fundamentally different. [00:18:00] It's a single neural network that has the ability to learn to attend over the collection of tables. You are, you're now training or tuning this neural network on that collection of tables.

Jure Leskovec: So it's faster, it, it scales, uh, to, you know, to the, to the largest, uh, use cases in the world. I can say more about that later. Um, and, uh, and it gives good performance.

Demetrios: Well, one thing that is very clear after hearing this is. The value of that raw data as if we didn't know that already. Right? But how much can you have messy, ugly, shitty data and still get value from this method, basically?

Jure Leskovec: Uh, that's a great question. So here is what I, what I would say. Um. In some sense, quality of the data, [00:19:00] of the data, uh, matters, but because these models are learning from the entire relational structure. They can implicitly impute missing labels. They can correct, uh, they, they can correct for, uh, mistyped or misentered information.

Jure Leskovec: Even if the linking information is a bit noisy, sometimes the models can recover from it because, you know, they're kind of learning holistically across the entire relational structure. So, another place, for example, that this is important, if you think about cold start problems. Call start in a sense that, let's say maybe in the recommender systems and so on, or you have a new user, you have a new item, you haven't seen anything about it, but then you have an old user or an old item that a lot of others have already interacted with.

Jure Leskovec: So you are very data rich, information rich about this item. What is the benefit of these models is that they can kind of trade off. The item is new. They will focus on the attributes of the item, properties of the item. I [00:20:00] know, description, image, and things like that, and they'll latch on that. But as soon as that item starts getting connected into the graph by users, interacting with it, buying it, and so on, now the model starting to use this relational structure, you know, what you kind of said earlier, like the the Guild by association to say, okay, who is interacting with this?

Jure Leskovec: What else are these people doing? Let me give an accurate recommendation. So, uh, it allows you kind of, or the models learn how to trade off between learning from attributes to learn and learning from structure. Just, um, it, it leads to very robust models. That's all I wanted to say. Um, both in terms of missing data, um, data ugliness, um, as well as the robustness,

Demetrios: do we need to have super clear ontologies and a knowledge graph set up, or is that something that it infers also?

Demetrios: Uh, good

Jure Leskovec: question. I would say, um, somewhere in between. All we need to know is [00:21:00] what table links to what table. Many times we can infer this automatically from column names. Um, and that's about all we need. So we don't need, we don't, the model does not rely on, um, semantic information, right? The model, the model relies on patterns in the data.

Jure Leskovec: So maybe here is the, the, the, the, the difference is that the model learns how to recognize, recognize the patterns in the data to make those predictions. So we don't need a super rich semantic model. We don't need to explain what is the true meaning of every column to the last detail, because from the, kind of the patterns from the past that predict the future, that doesn't matter so much.

Demetrios: So talk to me about scale.

Jure Leskovec: Um, scale. Uh, good question. So, uh, this, this scales to tens hun, a hundred billion, uh, billion nos. Um, so for example, I [00:22:00] can, I can tell you maybe to, just to illustrate, I can tell you some examples of how this technology runs in practice and who's using it. Okay. So, um, one interesting story was, um.

Jure Leskovec: You know, a few years back, uh, DoorDash came to us, right? The DoorDash, uh, during COVID was growing great, right? Everyone was ordering food, but post COVID kind of, right? So, so, so, so they, they, they, they came to us and said, Hey, can you help? Can this technology help us? So, uh, uh, we looked at restaurant recommendations.

Jure Leskovec: So, uh, recommend a restaurant you are going to order from next. Um, in particular, recommended problem is, uh, order, uh, recommend a restaurant you never ordered from before. So it's try something new. And this, the way this gets surfaced to you is through notifications, right? You get a notification saying, Hey, um, would you, do you think you wanna order from this restaurant today?

Jure Leskovec: Right? And, and this is one of the core [00:23:00] problems at DoorDash, right? So, so they've been kind of using traditional technology to build these types of systems with the best possible people and so on. Um, we trained our, um. Our transformer model over a collection of tables. It was a 30% increase in accuracy.

Jure Leskovec: Three zero. Wow. So me

Demetrios: as a user, when you suggest a restaurant for me to try, it's. Much more pointed and it's much more to my liking.

Jure Leskovec: Exactly. It was much more pointed. It was much more to my liking. And that resulted in several hundreds of millions of more purchases of, or orders on, on, on door rush. Right?

Jure Leskovec: So it was humongous business impact. Uh. So that's just one example, right? And now is, say what the scale of DoorDash is. I know several hundred million users. I know hundreds, um, you know, billion, uh, orders. All the [00:24:00] website behavior, all the searches, all the geographical location, all the cuisine information.

Jure Leskovec: So it gets quite, quite, uh. Quite, uh, uh, interesting, uh, as well. That's for example, one example where we saw like this humongous lifting performance over a flagship model, right? Something that is not one data scientist woke up and, and said, oh, can this, this, can this do better? No, it's like a team, several years of effort.

Jure Leskovec: It can do better. Um, I can give you another example that's even larger scale. Um, and this got, this is in advertising.

Demetrios: Oh, before, wait, sorry. Yeah. On that DoorDash one, how long did it take to go from inception to production

Jure Leskovec: with that? One benefit of this model is because they run on the road data. You putting them in production just means refresh the road data.

Jure Leskovec: Right. So, so for people who have put models in production, like just having that feature store, having UpToDate [00:25:00] features, making sure features are not stale, productionizing them, babysitting those work workflows is, is, is like, you know, babysitting a 2-year-old, uh, kind of, uh, you know, tantrums all the time, right?

Jure Leskovec: Like, uh, on call all the time. So, so it's very hard, right? Uh, but here it's much easier because you just say, here's my fresh data. Uh, make predictions on it. You refresh the data, make predictions on it. So, um, we put that in production, I would say, quite quickly, you know, maybe a few weeks. And of course, and that's, that was a set of,

Demetrios: it's irrespective if the data schema changes, as long as you refresh the data.

Demetrios: Even if the data schema changes, you are still able to infer.

Jure Leskovec: The right things. If the data, so the models themselves allow for data schema changes. Um-huh. So it's, it wouldn't be a problem, but if you, let's say drastically change the schema, change the composition of tables, you would, let's say retrain the model that [00:26:00] takes maybe a couple of hours and now you have, and now you are back in the game.

Demetrios: Yeah.

Jure Leskovec: Right. So, so, uh, that's, that's one benefit. Another benefit is if you think about user behavior changes all the time. Or if you think about fraud, right? Fraudsters are always inventing new ways to, uh, to, to commit fraud. And, and you as a, or we as data scientists, we are always a step behind because we're like, oh, my model performance is deteriorating.

Jure Leskovec: I need new features. But with this approach, you're just, let me retrain the model. And the model will figure out the new, the new signal and, and, you know, and, and, and so on. So it's also much easier to keep up to date, you can automatically retrain and really get the most value out of the data you have.

Demetrios: Okay. It makes a lot of sense. So, yeah. What's the other scale, the massive scale one that's even bigger than DoorDash?

Jure Leskovec: Uh, what's bigger than DoorDash? I can talk about this one. Um, uh, it's advertising models, right? So this is the bread and butter of the internet industry, right? It's like [00:27:00] predicting. How likely is a user to click on an ad because, uh, you know, every time you visit a website, there is a prediction for this user.

Jure Leskovec: What is the most likely ad they're going to click on? That ads get put in front of you. If you click it, the advertiser gets paid. So no click, no money, click, you get money. Um, right, so 1% lift in accuracy of this predictive model means 1% lift in your revenue. Um, so, uh, the use case here is uh, uh, Reddit. And Reddi has advertising models running on their website.

Jure Leskovec: Um, and you know, it's bread and butter of what they do and every, you know, single or 10 of a percentage point matters a lot in terms of absolute revenue. So there, um, I can say they, they usually increase the accuracy of the model by one to 2%, uh, year over year. Um, and with this approach was like five years worth of improvement.

Jure Leskovec: Wow. Inaccuracy, um, [00:28:00] with in a month or something, right? Again, over this flagship model that's been tuned and, you know, the latest from Research Incorporated and so on. Um, and I think this just shows kind of the power of this just led the neural network learn over the data, right? The trick is the data is split or sits in multiple tables.

Jure Leskovec: You need now a generalized transformer arch architecture that can attend across the tables and learn how to extract the signal automatically rather than us humans doing it manually.

Demetrios: Yeah, I was talking to my friend Sid at Andro and he was saying that they have a saying internally, which is let the model cook

Jure Leskovec: basically.

Jure Leskovec: Basically. Exactly. Exactly. Exactly. And, and I think my point is, uh, for it. Natural language, human-like tasks. LLMs are great, but LLMs totally fail on structured data, right? You cannot take a database and [00:29:00] ize it, put it in a big blob of Jason. Put that as a prompt and say, you know, now based on this, gimme a prediction.

Jure Leskovec: What will the user like? Do you see this working for Time series also? I see this working for Time series a lot as well. Exactly. There are two ways to think of time series, right? One is just to say, oh, I have an individual time series. I have a sequence I, and I predict the next token. So people have been, I would say, quite successful training sequence based transformers on time series data.

Jure Leskovec: Um, uh, our approach is a bit different because. The way we think of it is, it's not one time series, it's ati, it's a graph of time series, because these time series are usually connected, you know, so maybe you have one, uh, sales record for one, for one product, but products are related to each other, or maybe they're sto um, uh, I know sold in different stores and things like that.

Jure Leskovec: So by representing these time series as a. You actually get a further increase [00:30:00] in performance because the model does not only learn how to, how to make prediction from the single time series, but also learns how to attend across other related time series to better forecast, right? Because some time series might have different lags, some products might be correlated, some stocks might be correlated, and things like that, right?

Jure Leskovec: So. Predicting from a single time series is very hard because you don't have that information. But through this, let's say graph based approach, you can borrow, learn to borrow information from other related time series, which again leads to the increased performance increased. Yeah. It

Demetrios: feels like you're just giving a much richer picture for the model to understand.

Jure Leskovec: That is a great way, I think that is a great way to say it, right? You give a much richer. Set of information to the model, and then you give the neural network the freedom to combine this information in, in the best possible way for that forecast, that prediction, that risk score.

Demetrios: Mm-hmm. [00:31:00] Another thing that I'm thinking about is does it take a team to make this happen or is this something that.

Demetrios: If, like, how many people, how many resources? If I wanted to try and do this at my company tomorrow, what am I looking at actually like needing to dedicate towards this?

Jure Leskovec: Uh, that's a great question. Um, what we see is that with this, you need about 20 x fewer resources than traditional approach. So it means you can do much more with a much, uh, more, uh, linear team.

Jure Leskovec: You still, let's say, need. A data scientist or someone who, who kind of understands predictive modeling, but that person can build, you know, 10 models in a single day if they like to do, to do so. Um, and then what you also need, I think with predictions, it's always important that you're able to plug the mean.

Jure Leskovec: Into [00:32:00] whatever product, whatever decision making you are doing. Right. So predictions are only useful when you are making decisions, when you are making staking some action based on that decision. So we need some engineering resource that says, okay, now that we have prediction, here is what we are going to do with them.

Demetrios: Yeah, it's still the model building and then putting the model into production. It's just that the model building piece has been drastically cut.

Jure Leskovec: Exactly, or the, exactly the, the, the model, the time to build a single model has been cut, which means as a data scientist, you have much more time to explore, um, uh, to, to kind of really think about how am I modeling this business problem?

Jure Leskovec: What is the best way to model it, what is the most accurate way to model it? And allows you to kind of, to explore that space much more efficiently and faster than with, uh, let's say traditional feature engineering that takes so much time. Yeah,

Demetrios: and I can't stop thinking about [00:33:00] how two of the main feature stores have recently gotten bought or acquired for, you know, like Teton got bought by Databricks recently, and then I just heard feature form got bought by Redis.

Demetrios: And so it feels like that whole paradigm is it? Or was it. Maybe a dead end. If we can do things faster this way, why wouldn't we?

Jure Leskovec: Uh, that's a good question. I think this feature engineering approach to things, you know, uh, it got revolutionized first in computer vision, then in natural language, but kind of, it felt like that let's, you know, machine learning, predictive modeling was kind of stuck in the past.

Jure Leskovec: We were doing the same thing for 30 years. Like we were putting data in a single table and then, you know, we were training decision trees and then we were training support vector machines. And then I know logistic regression was [00:34:00] fancy and then neural networks were fancy, but it was all trained on the single table.

Jure Leskovec: Um, and all these feature stores were, uh. Built because, because of that need to compo could to kind of compress the data or join the data into a single table. I think with this, uh, new wave of kind of bringing AI to machine learning, yeah, the need for that is drastically reduced. Um, and, and uh, we see both in terms of productivity of building these models.

Jure Leskovec: We see it in building more accurate models. Um, and it just simplifies the stack. Right. Like I, it's actually the, the statistics from industries is staggering. You need about two full-time people per model. Okay? Right. So, so companies, you know, if they have 10 models in production, they have 20 people taking care of them.

Jure Leskovec: If you want 30 models, you need 60 people. If you want 400 [00:35:00] models, you need 800 people. That's, that's the ratio. It's like two people. You need two full-time employees to babysit a model and 30% of the model cost just goes to maintenance of running that model. Of course, then what also happens is, you know, people change jobs, so what happens is that in reality you have all these models running in production that somebody built who's no longer with the company.

Jure Leskovec: Nobody wants to touch those models because as soon as they, you know, even look at them, those models are going to break. Nobody knows what to do. Right.

Demetrios: Dude, it's even worse than that. I remember back in the day, we had somebody come on here and they were working at Yandex and they said, yeah, my last six months there I had to go and get rid of a bunch of zombie models.

Demetrios: That were just out there. Nobody wanted to touch 'em because nobody knew if they were making the company any money at all. [00:36:00] And people were afraid that, well, if we take 'em offline, and it did turn out to be one of those models that was making the company model or making the company money, that could be a really big problem.

Demetrios: Uh, e

Jure Leskovec: Exactly, exactly. That's, that's a, I think there's maintenance and, and tracking of these models is a huge problem. Um, but with this, let's say neural network technology, with this relational deep learning craft transformers, it's, it's much, it's much easier to maintain. It's much easier to audit. It's much easier to, to have them, um, to have them there, to retrain them automatically, uh, and take care of them.

Jure Leskovec: So the maintenance cost becomes much, much less.

Demetrios: And speaking of maintenance, if you're talking about the feature store, feature engineering way of doing things, a lot of the maintenance would just happen with the feature pipelines and that. I know so many folks who had. Headaches because of that and keeping [00:37:00] the data fresh is not as simple as it sounds.

Jure Leskovec: Oh, it's, it's super hard, right? Like it's super hard, actually. It's super, super hard. Right? Even if you think about, you know, basically it means that for almost like if you have these counters, this feature, like for every event. You have to update your features. So whenever you make a transaction, your feature needs to be updated.

Jure Leskovec: If it's not updated, it's stale and it's all the information. And if you made a bit of mistake and updated it a bit too far in the future, then you know you are kind of having information leakage time travel and your models are, are, are, are, you know, no good. So. It's in reality a super big problem that I think people in research or academia, we don't really appreciate that because you know, the data is always just given to us.

Jure Leskovec: Here's a training set, here's the test set. It's a fixed split. Who cares? You can kind of pre-compute it all, but in reality, just, yeah, updating those features is a huge pain.

Demetrios: Yeah. Well, this is fascinating. I [00:38:00] mean, you've gone from. Computing features to just training these models. And, uh, how have you seen the amount of, like how many GPUs do I need to train a smaller model?

Demetrios: Maybe take me through if it was fine tuned to, if I wanted to just build my own fa foundational model. What am I looking at as far as compute needs, costs, and then also time. Because if you're trained in a large language model, you need like a month, a month and a half type thing.

Jure Leskovec: It's actually interesting, um, the, the big surprise is how little resources you need.

Jure Leskovec: So it's actually very cheap, especially if you compare to large language models, then it's really, really cheap. Large models are really humongous. You know, they kind of read the entire internet and memorized it, um, to do predictions. You don't need to do that. You, you need to learn how to recognize [00:39:00] historical patterns and how they, how those predict the future.

Jure Leskovec: And that seems to require much less parameters. So these models, you know, you can do well with 50 million parameters, which is peanuts. You can do well with a hundred. Um, and if you want, you can train a billion parameter model. But those are, those are super tiny when we think of, uh, LLMs. So what does this mean?

Jure Leskovec: Is you need less resources. Training times are in hours, not weeks or months. Um, and very importantly, when you run things, when you put things in production, they are actually sustainable, right? LLM, putting L LMS in production is very, IS is kind of many times unsustainable because it's too expensive. Each LLM call is just too expensive.

Jure Leskovec: But with these types of models, because they are smaller. Actually, they're quite efficient to run. So your compute cost is low and, and there is then the benefit. It kind of, you know, the cost benefit analysis actually, uh, [00:40:00] works out, works out well. So you don't need tens of billions of parameters. You, you, you, these models can be actually quite, quite small from that point of view.

Jure Leskovec: And you can train them on a single GPU. Now if you want, um, the foundation model, maybe you need a couple, you need a couple of GPUs, but you know, not 10,000. Yeah, we don't need to build a data center to Exactly, exactly. Train this model. Exactly. One thing I wanted to say is that this structured data, uh, foundation models, as we said, are smaller than large language models.

Jure Leskovec: Um, easier to train, have less parameters, and this also means they're, uh, cheaper, uh, cheaper to operate. Um, one thing that is maybe fascinating here to me is that you can actually build a pre-trained model that allows you, that allows you to kind of answer tasks ad hoc. You don't have to even fine tune, you don't have to kind of train for a specific prediction task, but you can almost give a set of training [00:41:00] examples in context to the model.

Jure Leskovec: And the model is going into a single forward pass, make you an accurate prediction, if that makes sense.

Demetrios: And do you interface with these models through prompts

Jure Leskovec: like you do with large language models? Uh, that's a great question. So if you have a pre-trained model, the way you interface with it is kind is through some kind of prompting.

Jure Leskovec: It's a domain specific structured language we call Predictive Query. And it basically has two parts. It has a predict part, and the four part predict says I wanna predict this, this, uh, quantity for this specific entity. So we say, I wanna predict your purchases next week. I wanna predict some of your transaction values next month.

Jure Leskovec: I wanna predict, uh, probability of fraud for this transaction. So you specify that. And yeah, out of this is basically almost like a prompt. And based on this, the model is then making a, making a prediction. [00:42:00] If you're talking about a, a general foundation model, if you fine tune mm-hmm. Then the model just takes the data and gives you the prediction you want.

Demetrios: Mm-hmm. So it's not like I'm using SQL or Python too.

Jure Leskovec: It's not like to prompt the model you are using, almost like sql, where you specify what's the quantity you, you want to be predicted. Um, yeah, it's like pseudo sq. You be prompting this pseudo seql. Sure. It's a sequel that selects over the future that hasn't yet happened, so you need to predict that future.

Demetrios: That's nice. And you said something in a talk you gave that I wanted to dive into more, which I found fascinating and it was, agents need predictive AI tools. Can you break that down for me?

Jure Leskovec: Maybe to start at the beginning? Right? Like, why are we making predictions? Right. We are making predictions because we are making decisions based on those predictions.

Jure Leskovec: Right. Why am I predicting fraud [00:43:00] probability? Because based on that fraud probability, I decide whether to stop a transaction or not. You know, why am I predicting churn probability? Because based on that estimate of churn, I'm then taking some action to bring that customer back. Why in a hospital I'm predicting surgery readmission probability, because based on that, I decide whether to discharge the patient or not.

Jure Leskovec: Right? So I'm, or you know, in, in, if you think in finance, you could say, okay, why? Why? Why am I predicting your probability to default on a loan? Because then I decide, do I give you the loan or not? Right? So prediction is really a basis for decision making. So if now we, we believe in autonomous agents.

Jure Leskovec: Autonomous agents need to make decisions, right? And you don't make decisions based on common sense, right? It's okay. It's not terribly wrong, but it's not optimal as well. You wanna make decisions that are rooted in the data to make decisions that are rooted in the data. [00:44:00] You need to kind of predict the outcome of those decisions.

Jure Leskovec: Ah. And right now what's the bottleneck in, let's say, deploying this type of autonomous agents is their ability to make decisions. And this means that now you have, you have your beautiful LLM agent, super, super intelligent, and then you know, you have to build manually, some machinery model that's going to say, oh, what's the probability of churn?

Jure Leskovec: Right? Because imagine a simple agent that says. Let's go identify people that are most likely to churn. Let's identify the best offer to, uh, to send to, to each of those people. Let's write a nice personalized email and send it out, right? In this workflow, there are two big decision problems, which people are likely to churn.

Jure Leskovec: What is the best offer to give? And those are two predictive tools that you need to solve this use case end to end.

Demetrios: Damn. I see where you're going with this.

Jure Leskovec: That is, so agents will need [00:45:00] predictions. Maybe we have not realized that yet, because what agents do today is they query some document database and retrieve a couple of passages.

Jure Leskovec: But those are not, those are not real agents, right? The real agent is actually something that does something to you, right? Like or imagine. I'm a customer support agent, I wanna estimate. What's your lifetime value? What's your probability of churn? Because based on this, I will talk to you very differently.

Jure Leskovec: I will give you different offers. I may suggest different remedies. Those are predictive problems, right? Uhhuh.

Demetrios: So how can we leverage the predictive models that we've been doing? But like you said, make sure that they're faster.

Jure Leskovec: Exactly. I think the, the benefit is in basically using relational foundation models that the, that these LLM agents can, can use as tools to query for these predictions.

Jure Leskovec: So whenever you show up, they're like, how likely is this person to churn? How, you know, what is their lifetime value? [00:46:00] Oh, now that I have these estimates, this is what I'm going to do. Right. What's the next best offer to give to this person? I am, I'm, you know, I'm making this decision on the fly based on a predictive model that is rooted in my data, that gives me an accurate prediction rather than use some kind of common sense hallucination to say, oh, what shall we do here?

Demetrios: Yeah. Or rather than how it's being done right now, which is that we're creating these predictive models and maybe. There's some advanced teams that are offering up these predictions to the agents, but as we just have been talking about for the last 40 minutes, that is a cumbersome process the way that we're doing it right now.

Jure Leskovec: Exactly. That's the alternative. It doesn't scale. It means you have to hire two humans for every predictive problem you have, and that's, that's a lot of humans to hire.

Demetrios: It's so obvious, like why, why don't we have that right now? [00:47:00] Or maybe you've seen some teams that are actually doing this because like you said, you know, the step A, step B, step C is a super common use case and it's a pattern that happens within companies every day.

Demetrios: Can we, have you seen agents being used like that? Uh, if not, yeah. Like what is the blocker?

Jure Leskovec: Yeah, I would say we have seen, uh, agents, uh, being used like that. And we have, um, uh, a couple of, um, clients deployments where, where this is happening. I, I think the, the, the blocker right now, as I see is the, is the, is the knowledge, right?

Jure Leskovec: It feels like there is this AI delirium out there. Then there are all these kind of quiet, quiet people who are doing real work, uh, and they, you know, they're not being, being heard. And that seems to be kind of a big, uh, a big, uh, a big [00:48:00] disconnect to me. Is there anything that I didn't ask that you wanna talk about that?

Jure Leskovec: Yeah, there is, there, there is actually hope for machine learning as well. That right now is a bit on a, on a, on a side track or people don't, are they, people don't even realize, or we seem to have forgotten how valuable structured data is because it's kind of the ground truth of the business and it's, it, it, it captures the blueprint, the of, of the business and we seem to have forgotten all that.

Jure Leskovec: And, you know, we are now excited about answering questions about documents. But I think the, you know, the, the world is going to come back because enterprises are here to make, to make accurate decisions, to make business impact and to drive the business forward. And for that you need [00:49:00] prediction.

+ Read More

Watch More

DevTools for Language Models: Unlocking the Future of AI-Driven Applications
Posted Apr 11, 2023 | Views 3.6K
# LLM in Production
# Large Language Models
# DevTools
# AI-Driven Applications
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
Code of Conduct