Sign in or Join the community to continue

AI-Powered Data Unification for Data Platforms // Shelby Heinecke // DE4AI

Posted Sep 18, 2024 | Views 753

Share

speaker

Shelby Heinecke

Senior AI Research Manager @ Salesforce

Dr. Shelby Heinecke leads an AI research team at Salesforce. Shelby’s team develops cutting-edge AI for research and product, including Salesforce’s data platform. Prior to leading her team, Shelby was a Senior AI Research Scientist focusing on robust recommendation systems and productionizing AI models. Shelby earned her Ph.D. in Mathematics from University of Illinois at Chicago, specializing in machine learning theory. She also holds an M.S. in Mathematics from Northwestern and a B.S. in Mathematics from MIT. Website: www.shelbyh.ai

+ Read More

SUMMARY

A robust data platform is the first step to ensuring that downstream AI is grounded on accurate and relevant data. And while data platforms power AI, did you know that AI can also power data platforms? In this talk, we will discuss one of the most critical operations in data platforms, data unification, and discuss how we use small, efficient LLMs to power this step in Salesforce’s data platform.

+ Read More

TRANSCRIPT

Skylar [00:00:07]: Welcome.

Shelby Heinecke [00:00:08]: Awesome. Super excited to be here everyone. So we're all here today because data is the key for AI. We all know that today and after all building our data platforms. What I want to tell you today is that even for your data platform we can be using AI to automate the processes there. Actually at Salesforce today our customers on a daily basis are using small lms in our data cloud. So I'm going to dive into that today, give you a sneak peek. If you're interested.

Shelby Heinecke [00:00:40]: I'll give you links to learn more about this. Let's get started. Let me give you a little bit of background about myself and my team. I lead an AI research team. At Salesforce we focus on agentic AI, we focus on, on device AI, we focus on efficient aih. Today's talk is really in the efficient AI category but it's horizontal to on device AI. It's horizontal to agentic AI. Again you need efficient AI to power the data platform.

Shelby Heinecke [00:01:08]: To power all that great AI that's going to go on top. So we build AI for research at Salesforce but most importantly for Salesforce products. So I'm going to tell you a little bit about the Salesforce data platform that our models are actively running in the so if this sounds interesting to you and you'd like to connect, I'm happy to meet you and talk more. Feel free to scan this QR code to learn more about my background and my team. So what is Salesforce? Let's all get on the same page here today. What is Salesforce? So Salesforce as you may know is referred to as a customer relationship manager. And so we're really this software that is helping to manage the relationship between you and your customers. So that could be with regards to sales, helping you manage your sales, your sales leads, closing sales, helping you to talk to your customers through marketing, helping you market, build marketing campaigns, deploy those campaigns to your customers, helping you serve your customers, helping them through problems, helping them with customer service.

Shelby Heinecke [00:02:14]: Salesforce helps with all of that in one place. Now today though as you can imagine, every single one of these elements, service, marketing, sales so much it's all AI powered and in fact there's even agentic AI that's coming on top of all of these things. But none of that is possible. None of that is possible without the data. So no AI is possible without data. And at Salesforce we, our data platform is called data cloud. So I want to walk you through a little bit today because this is really where some of our AI sits. So as you can imagine.

Shelby Heinecke [00:02:50]: The first step to our customers using our data cloud is to actually insert and ingest their data. So we've got lots of connectors. No matter where your data is coming from, you can import, we can ingest your data. And our customers usually have data from lots of different sources on lots of different platforms. And data cloud can just simply ingest all that. Now here's the key. Here's the key. It's not just about ingesting that data, it's the harmonization of the data.

Shelby Heinecke [00:03:18]: That's where the value comes in. The first step. The data is all just one big pile, right? This is about organizing the data so that you eventually get this single source of truth. So for each of your customers, you can take a look at that customer and see comprehensively all data that is linked to them. That's where the power comes in because on top of that, that is where the AI will set. The AI can read from that organized that well managed data. So you can imagine it's going to be way easier to hydrate prompts if the data is harmonized. If it's organized, it's going to be way more effective to get analytics if the data is all mapped and organized.

Shelby Heinecke [00:04:02]: So the key here is that this data harmonization step is absolutely critical and that is where, that's what we're going to talk about today. So the key to making that data harmonization happen is a step that we call identity resolution. So this isn't a new problem, but we have new solutions for using AI. So let me tell you about this problem. This problem is imagine you have multiple data sources and they're the same person. Let's say a target person, a target customer is appearing in multiple of those data sources. So maybe you have social media data, maybe you got email data, maybe you've got purchase data, customer service data, you've got lots of types of data and that person is appearing. But here's the challenge.

Shelby Heinecke [00:04:48]: They're appearing differently across those data sources. And one data source, their name was a little bit different because it was a different point in their life. In one data source, their address is different, they moved around. In another data source. There's probably, there could be a lot of typos, there could be human error in the record. So in light of all of that, in light of all that complexity in the text, we still want to be able to merge those data pieces to map to the same person, as you can see here. So we've solved this problem at Salesforce using AI and in fact we've used small language models, which I'll get to. So the first step to solving this problem, after that data ingestion step, is to do a candidate selection process.

Shelby Heinecke [00:05:35]: So, with data cloud and with data platforms in general, we're talking massive, massive scale here. Millions, billions, trillions, billions of rows of data. Right? So if I have a target person, I've got a target person in mind, and I want to find all parts of the data that map to them. My first step has got to be to do some candidate selection to narrow down to the relevant pieces of data. The reason is, otherwise I've got to navigate through millions, billions, trillions of rows of data that's computationally not efficient. So this first step, this first step of taking the data, taking the data and finding candidates for each target person is what we call candidate selection. And we solve this in data cloud by using an embedding model. And not just any embedding model, there's a lot to choose from.

Shelby Heinecke [00:06:22]: We're using a pretty small one, only 300 million parameters. So I'll put this into perspective later, but this is a pretty small model, and I'll get to why we're using the small model. But so, right now, customers are actively using this small embedding model for that step. Now, for the second step. Now, let's say we have this target candidate. We've got some, let's say, target customer. Then we have some candidates that match that customer we really want to find. We really want more fine grained search on which of those candidates is really a match.

Shelby Heinecke [00:06:59]: We want to be more careful here. So that's where our second small language model comes in. We've got a distilled Bert model deployed, only 66 million parameters. This is small. This is small. And so, in that step, it takes in pairs, it takes in target and candidate and helps and gives a score. And from that score, we can determine if it's a close enough match to be considered a match. So, to put this all into perspective, the past two years, we've been focusing a lot on large language models.

Shelby Heinecke [00:07:33]: If you look here on this, on. If you look here at this chart, you see an array of large language models and their parameters. We see Palm has 540 billion parameters. We're seeing hundreds of billions of parameters emerge here. But what I'm speaking of and what we're deploying at massive scale are these small language models. And I really think models less than 13 billion parameters are really in the range of being a small language models. So we're calling them slms. Now with slms you can drive so much efficiency, there's so many benefits to them.

Shelby Heinecke [00:08:12]: So because they're small, they're extremely cheap to serve. With fewer parameters they simply consume less hardware. And because you can serve it yourself, you don't need to pay an expensive API or a third party to serve the model for you. So super cheap to use. Another reason we love small language models is because they're fast. They're super fast. So in a data platform you don't have seconds to wait to get an output. You've got to process millions, billions, trillions of data points in your data platform.

Shelby Heinecke [00:08:44]: You need models that are going to be extremely fast. And small language models are just that, they have fewer parameters, right? So they're just, so there's just less computation to do, faster, it's just faster process. And the final thing to keep in mind is these small language models are focused and that's a good thing. With fine tuning, these models can perform exceptionally well on clearly defined tasks. So in your data platform and salesforce data platform, we know exactly what needs to get done. We know exactly the data processing that needs to get done. So we can train a small model to be exceptionally good at it and it's going to be fast and cheap. And another bonus on top of that, besides picking the right small language model as we've done, we don't want to forget the quantize.

Shelby Heinecke [00:09:32]: This is a fun bonus. This is a great bonus. So quantization is so powerful. With quantization, the idea is to reduce the precision of the weights in your large language model. So usually large language models, small language models, they're typically, typically, the weights are 32 bit or 16 bit floating points. By quantizing we're reducing most of those weights down to eight bits or down to four bits, or we can even pick three bits or two bits. And the result is for the most part, most of the performance is going to be retained, but the footprint is going to be even smaller, the latency is going to be even better. So a lot of benefits to quantization.

Shelby Heinecke [00:10:11]: Yeah, and that wraps it up, really. The takeaway here is that again, we're all here for the same reason. There is no AI without good data, and data platforms are king. So don't forget that data platforms themselves can leverage AI to automate and optimize those processes needed for data processing, for data harmonization. We're doing that at Salesforce and it's openly available to our customers today. And the key to making that happen, definitely keep in mind the power of small models and the power of quantization, they're perfect for scale. Thanks so much.

Skylar [00:10:47]: Awesome. Thank you. I think we have a couple of minutes left since we started early, and apologies, my camera doesn't seem to be working, but, yeah, definitely was curious to get your take as you've been talking about applying small language models. So do you think there are cases where small language models are less appropriate? Like how? How does a practitioner decide, like, can I use a small language model here, or should I use one?

Shelby Heinecke [00:11:15]: That's a really good question. That's. That's a super good question. I think that if you have a very focused, very specific, focused use case with a small scope, you might want to consider a small model, and particularly might want to consider fine tuning that model on for your use case. So I think that's one thing to keep in mind now, when to use a bigger. Like, once you use a bigger model, I think if the use case is more flexible and broad, bigger models have better generalization abilities. Bigger models have their benefits. Better generalization.

Shelby Heinecke [00:11:49]: They're more robust, actually, to noise. They have a lot more flexibility. So I think if you have a broader task, a big model will be great. I think other considerations, depending on where you're deploying. A lot of us, you know, many of us, regardless of your background, not everyone can deploy the biggest model on their own hardware. So if you want to deploy it yourself, you have to also think about which hardware. Which type of hardware do you have access to? Am I trying to deploy on a phone? If so, you definitely need a small model. Do I have a good amount of GPU's? If so, maybe a medium sized model is my as my upper bounden.

Shelby Heinecke [00:12:29]: So I think the resources you have access to also needs to be taken into consideration. I think the small models really open up doors for more people, because, again, I think to have these big models, which are fantastic and really have changed the world the past few years, only a few of us can really own those and serve those. But what we're showing with small models, with this project and other projects that we've released with our team, is that small models can also be powerful too. You just have to train them the right way.

Skylar [00:13:03]: Totally awesome. Thank you so much for your time.

Shelby Heinecke [00:13:06]: Thank you so much. Take care.

Skylar [00:13:08]: Bye.

+ Read More

Sign in or Join the community

Watch More

Chronon: Airbnb's Open-Source Data Platform for AI & ML applications // Nikhil Simha // DE4AI

Posted Sep 18, 2024 | Views 2.2K

How Data Platforms Affect ML & AI

Posted Jan 26, 2024 | Views 517

# Data Platforms

# AI

# Machine Learning

# The Oakland Group

Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses // Vinoth Chandar // DE4AI

Posted Sep 17, 2024 | Views 1.3K