Sign in or Join the community to continue

Foundation Models in the Modern Data Stack

Posted Jun 28, 2023 | Views 451

# LLM in Production

# Foundation Models

# Numbersstation.ai

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io

Share

speakers

Ines Chami

Co-Founder @ NumbersStationAI

Ines Chami is the Chief Scientist and Co-Founder of Numbers Station. She received her Ph.D. in the Institute for Computational and Mathematical Engineering from Stanford University where she was advised by Prof. Christopher Ré. Prior to attending Stanford, she studied Mathematics and Computer Science at Ecole Centrale Paris. Ines is particularly excited about building intelligent models to automate data-intensive work. Her work spans applications such as knowledge graph construction and data cleaning. For her work on graph representation learning, she won the 2021 Stanford Gene Golub Doctoral Dissertation Award. During her Ph.D. she interned at Microsoft AI and Research and Google Research where she co-authored the graph representation learning chapter of Kevin Murphy’s “Probabilistic Machine Learning: An Introduction” book.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

As Foundation Models (FMs) continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This talk describes our work on applying foundation models to structured data tasks like data linkage, cleaning, and querying. We discuss the challenges and solutions that these models present for production deployment in the modern data stack.

+ Read More

TRANSCRIPT

We've got another speaker coming on though. Where is, let's see. Inez Inez calling Inez to the screen. Where you at? There you are. What's happening Inez? Hello. Hello? Uh, all good. Great to have you here. I'm so excited. So now we went from these 30 minute talks and now we're gonna go lightning tile and we're gonna be going 10 minute talk.

And so if anyone has questions in the next 30 minutes over the next three talks, please hit them in Slack or in the chat because the speakers will be answering them there. Inez, I think you need to share your screen. I don't see it. Uh, you might need to share it again, and then I'll keep you the 10 minutes.

I'll be back in 10 minutes and let you go. Uh, of course I am a huge fan of what you are doing at Numbers Station and I'm so excited for this talk. So I'll let you go right now. Take over. Awesome. And thank you so much for having me, Dimitris. Appreciate. Um, all right, so today I'm gonna talk about, uh, some of our work on leveraging foundation models, uh, in the modern data stack.

Uh, and most people here, uh, know about generative ai. Uh, before I dive in, I just wanted to do a very quick recap of what foundation models are and why they're so exciting. So at a high level, foundation models are very large neural networks that are trained on massive amounts of unlabeled data, like text or images, uh, from the internet.

And the idea is to use a technique called, uh, self supervised learning to train the model. So, for instance, an important category of models or, uh, auto regressive language models, which are trained to predict. The next word in a sentence, given the previous words. And this idea has been around, uh, for a long time in N L P, but what makes, uh, foundation models really unique is their scale.

Um, and with this increase in scale, uh, model size and data volumes, uh, we started seeing some very exciting new model capabilities. So what are these model, uh, capabilities, which we call emergent capabilities? Well, because these foundation models aren't trained on so much data, they can generalize, uh, too many downstream tasks, uh, via capability called in context learning.

And the idea is to take any task, uh, and cast it as a generation task by crafting a prompt, and then use the foundation model to perform the task by completing the prompt, which is what the model is originally trained for. And I can then reuse the prompt on many more input examples, or I can craft many other prompts for, uh, different tasks.

But the key here is that the same underlying model remains so. All right. So we, we clearly see, and everyone here in this workshop knows that there's a revolution happening in this space, and there's a huge paradigm shift compared to traditional ai, uh, with foundation models. And it means that pretty much anyone, not just AI experts, can rapidly prototype ideas and build AI applications with this technology.

So what we're going to focus on today, uh, is application specifically to the modern data stack. And you may have heard this term, uh, but essentially used to describe all the. Set of tools to process, store, and analyze data, and starting from where the data originates, which is typically apps like Salesforce or HubSpot, the data gets extracted and loaded in data warehouses like Snowflake.

It can then be, uh, transformed and modeled using tools like D V T and ultimately, once the data is clean and prepared, it can be visualized with tools like Tableau, uh, or Power bi. And even if these tools are amazing, they've drastically improved things like, uh, scalability and knowledge sharing in organizations.

Uh, there's still a huge, uh, uh, problem, which is that there's a lot of manual work, uh, to do throughout this process. So the good news is that it's possible to automate a lot of this work with AI and even more so, uh, with foundation models. And that's exactly what we do at Numbers Station. Essentially, we're bringing this, uh, foundation model technology into the modern data stack to accelerate time to insights.

So now let's talk about how that works in practice. And I'll start with a simple example that you probably have seen before, which is generating SQL code. Uh, so in modern organizations, anytime a business user needs to answer a business question, they have to send a request to their, uh, data engineering team.

And these ad hoc requests, uh, usually require multiple iterations and can take a long time to fulfill. So with these foundation models, a lot of these back and forth, uh, can be avoided. By using the model to generate the SQL queries from natural language requests. And this works amazingly well for simple queries, but there are some caveats, uh, to use this for more complex queries, especially for, um, complex query.

When there's some domain specific knowledge, uh, uh, that the models large pre-trained model may not. Uh, capture. So for instance, if there's multiple date columns in my table, which 1:00 AM I supposed to use? That's not something that the models know out of the box. And I'll touch on this point, uh, later in this talk.

Uh, another exciting application is data cleaning, uh, to fix things like typos or missing values. And typically the way to address this problem is to create a bunch of, uh, SQL rules. So this works well overall, but it's a very long process, uh, uh, to derive all the cleaning logic and many times the rules break, uh, because of some edge case that was not captured during the rules development process.

So with foundation models, there's an exciting alternative, uh, to this rule development process, which is to use the model itself, uh, to do the correction. And the idea is to create a prompt with a few examples and then reuse this prompt over all the data records to clean the data, which is what is shown in, in this figure.

Uh, and this is obviously very exciting because the model can derive the patterns, uh, automatically from the in context examples that we provide. But there are some caveats to this approach as well, especially issues around, uh, scalability, uh, which I'll discuss later in this talk. And another exciting application is data linkage.

So the goal here is to find links between different sources of data that may not have a common identifier, say like my Salesforce data and my HubSpot data. I wanna link my customers. There's no, uh, idea to create a join on. And similar to data cleaning, engineers need to spend long iteration cycles, crafting rules, and sometimes these rules can be brittle, uh, and break in production.

So with foundation models, we can create prompt, uh, for the specific data linkage task and alleviate some of these issues. The idea is to feed, uh, both records to the model and then ask it. Are these two things the same in natural language? And in general, we find that both for this problem and the cleaning problem, the best solution is to compose rules with the foundation model.

So if there's some very basic rule that can solve 80% of the problem, we should use it. But then for these last mile examples that are more complex, we can call, uh, the foundation model. All right. So now that we've shown some of the possibilities for using foundation models, uh, in the data stack, and there's many more, I just, I just, uh, had to scope it for this lightning talk.

Uh, I want to share a few caveats as well as solutions that, uh, we've developed throughout our research at Number Station in collaboration with the Stanford AL Lab. So the first caveat, which we touched on, uh, briefly, is the scale. Uh, foundation models are extremely large models, and depending on the application and how we use them, they can be very expensive and slow.

And if I'm using a foundation model with a human in the loop for things like, uh, SQL copilot, Then the scale is not so much an issue. We care more about latency in these cases, but for other data applications where I need to use the model itself to make the predictions on, let's say an entire database that has millions or billions of rows, this is just impossible and it's gonna be extremely expensive, extremely slow, compared to using a rule-based solution.

So how can we address this? Uh, there's many possible solutions and one of them is model distillation. So essentially here the idea is to take the big model for prototyping and then use that big model to teach a smaller model to do the task. And this actually works really well, uh, with good prototypes and good fine tuning, and we can easily bridge the gap between these large and smaller distilled models.

Uh, there are also other solutions to address the scale issue, which I linked in this slide. One of them being, uh, as I mentioned, to only use the foundation models only when we really need to. So if the task is simple enough that it can be solved with the rule, we don't need to use the model itself to solve the task.

We can instead use the model to derive the rules based on the data, which is always better than, uh, handcrafting rules. Uh, all right. So another important challenge here is, um, as everyone know, prompt Brutalness. So for instance, uh, if we format a prompt differently, it'll predict two different responses in this cleaning task.

Uh, and the demonstrations that are used in the prompt are also really important. So in this example, on the right, we run an experiment and we saw that picking, uh, manual demonstrations versus random demonstrations had a huge performance gap. Uh, and this can be particularly problematic for data applications where users, uh, are used to deterministic outputs, which is the case, uh, in the modern data stack.

Uh, they're not comfortable with potential errors and brittleness. Even if that can save them, uh, a lot of hours of manual work. So, So how can we solve this? Um, there's a few methods again to, to approach this problem. One of them is a method that we proposed, uh, in the AMA paper, uh, which is linked here.

And the high level idea is to apply multiple prompts. To the input and then aggregate the predictions to get the final result. Uh, this worked quite well. We noticed some good improvements compared to the traditional prompting method. Um, and there are of course, other solutions to address this prompt brutalness issue, uh, such as decomposing the prompts, uh, with chains, as well as being smart about how we sample, uh, demonstrations.

And the last caveat I wanted to touch on here is the lack of domain specific knowledge. So foundation models are trained on public data and they lack some knowledge that is sometimes, uh, crucial to solving enterprise data tasks. So for instance, if I'm asking a foundation model to generate a query, uh, to compute the number of active customers in my database, there might not be a perfect is active column.

And then what I need to do is rely on some organizational knowledge that defines what an active customer is to generate, uh, that query property. And so to approach this, um, domain knowledge problem, there are two types, uh, of solutions inference time and training time. So for training time, the idea is to leverage the untapped knowledge that is stored, uh, in organizations, documents, logs, uh, metadata.

And what we can do is continually pre-train open source models on this data to make them aware, uh, of this domain knowledge. Uh, and we have some work on this, which i, I linked as well on this slide. Uh, another solution is to bring the knowledge at inference time by augmenting the foundation model with some external memory that can be either accessed through a knowledge graph, a semantic layer, or a search index over the the internal documents.

So that's pretty much it for, for this talk. I'm sorry. I know it was rushed. I tried to condense everything in in 10 minutes. Uh, I wanted to thank a few of my collaborators from Stanford and Numbers Station, and if you're interested about these applications, feel free to reach out to me. Uh, check out the blog, send uh, an email.

Uh, I'd be happy to discuss more. So good. I love it. Inez. Thank you so much. That was awesome. So, Two things. Uh, if anyone has a question, throw it into the chat or into Slack. Inez, I think is in both of them. I'm super excited about what you're doing at Numbers Station, and I believe that you're, you're working out of the factory offices or you stop by there every once in a while because I'm, yeah.

Uh, a good part of our team is there. So, all right, now that I've got you on screen right here, now I can catch you. I'm gonna be visiting Diego, who you're working with in at the end of the month in San Francisco, and I wanna stop by the factory office and hang out and maybe we can record a podcast or something and you can go deeper into this.

Does that sound good? Yeah, anytime. Sounds great. And thank you for inviting me again. Awesome. Well, I am glad we got that. I'll reach out to you on, uh, the good old email or slack and we'll make it happen. And now we'll keep it cruising. So thanks again.

+ Read More

Sign in or Join the community

Watch More

Building a Modern Data Analytics Stack

Posted Mar 24, 2022 | Views 915

# Building ML

# Analytics

# Data Stack

The Post Modern Stack

Posted Jun 21, 2022 | Views 693

# Post-modern Stack

# Snowflake

# Real-world Recommendation Pipeline

Solving the Last Mile Problem of Foundation Models with Data-Centric AI

Posted Apr 18, 2023 | Views 2.1K

# LLM in Production

# Foundation Models

# Data-centric AI

# Snorkel.ai

# Rungalileo.io

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com