MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Multimodal is Here Or is it?

Posted Jul 08, 2024 | Views 114
# Multimodal
# LMMs
# LlamaIndex
Yi Ding
Head of Typescript and Partnerships @ LlamaIndex

Yi Ding is Head of TypeScript and Partnerships at LlamaIndex, the opensource framework to use LLMs with data. In his talk, he will explore the recent advances in multimodal LLMs, or Large Multimodal Models, along with what it means for the RAG applications, and what the gaps are still in our ability to process multimodal data.

+ Read More

Explore the recent advances in multimodal LLMs, or Large Multimodal Models, along with what it means for the RAG applications, and what the gaps are still in our ability to process multimodal data.

+ Read More

Yi Ding [00:00:01]: Thanks for coming. I'm really happy to be here. Thanks Rahul, for inviting me or you know, I asked to be invited. So you know, that's always nice. So yeah, so just a little bit about me, you know, why am I here, right? So I spent eight years at Apple working on messaging apps. Messaging apps are chat bots. Okay, fancy term. Then I spent nine months at llama index.

Yi Ding [00:00:30]: I started using GPD three DaVinci January of last year. So I'm kind of an expert when it comes to Loms and I use GPT four V last September, so I'm really an expert when it comes to multimodal. All right, so a little bit about llama index, in case you've never heard of Lambda index. Lambda index combines lms and data. Okay, if you remember one thing from this talk, loms and data, all right, that's the llama index. We work with practically every single Lm out there, including chat, JPT, Gemini, Claude, you name it. And then we also have a bunch of advanced tools for folks to use to get stuff into production. 2024 is llama indexes year of production.

Yi Ding [00:01:21]: Okay? So I always like to start my talks by telling you what you're not going to hear. Right? So this is not a sanctioned llama index talk. If you don't like it, don't go tell my boss. Okay? It's also not an ML heavy talk. So if you don't understand something, that's my fault. Like I screwed up, right? So just tell me, come up later and be like, hey, that was just a really bad talk, man. Um, the only thing that's not going to happen is I'm not going to be on time. So you guys are just, you know, going to sit here.

Yi Ding [00:01:50]: Okay, what is this talk about multimodal models? What are they and why should we be excited about them? Also, can you use multimodal with rag? I think a couple of speakers have said, yes, you can. Let's talk about that. And then my favorite part, I'm going to make some predictions. And as yogi Berra says, don't make predictions about the future. Unfortunately, I'm going to make some predictions about the future. So. Okay, first off, what are LLMs? This is a quiz. Somebody answer who here? Large language models.

Yi Ding [00:02:26]: Perfect. Okay, so what are multimodal loms? This is a more advanced part B question. Come on somebody, we're with the Motomoto talk. No, no, they're not multimodal. Oms are multimodal large language models, but they just mean models that can you know, handle more than text, right? More than text. Text is the one modal, and then you got more modals. Okay, so just to confuse people, though, we also have lmms. And lmms are actually the same thing as multimodal lms, large multimodal models.

Yi Ding [00:03:05]: So if you ever hear somebody say lmms, you're like, you know, hey, you know, just because you're a statistician doesn't mean you can't speak English, right? Just call them multimodal lms. Okay, so here are some examples of lmms or multimodal lms. In fact, you've probably used all of these, right? GBD four. It now actually supports vision by default, so you no longer have to choose to only use vision. You got Gemini, you got anthropic claude, you even got llama three. Wait, wait, llama three? Llama three just came out. I didn't see any multimodal stuff in there. Oh, yeah, they're going to come up with some multimodal.

Yi Ding [00:03:45]: Facebook's already promised us. Right? So all the models that you're using today are, in general, multimodal models. So who here has not used a multimodal model? You've all used multimodal models. Great. Great. Or are you guys just afraid of me now? Okay, all right, all right. So I use the multimodal model, right? Because I'm an expert. So here's the multimodal model.

Yi Ding [00:04:13]: I have a bus here, and hey, I get some text that says a yellow bus driving down a road with green trees and green grass in the background. Right? Pretty good description. Pretty good description. Okay. I asked a different model to describe it. I said, this is a photo of a yellow school bus parked on the side of the road with a utility pole seemingly splitting it in half due to the angle of the photo. These are real multimodal models that you can use today. Okay, so who liked the first description better? Okay, I see some hands.

Yi Ding [00:04:49]: I see some hands. Okay, who liked the second description better? Okay, I see a few hands. Okay, so, people, who are number two, you're the winners because number two was JPT four. V as hereby chat JVT. I said, and literally, I literally did this, like, you know, I don't know, ten minutes before coming here, right? This is not cherry picked. This was the description I gave. The only thing I'm a little worried about is it says it's parked on the side of the road. It doesn't look parked to me, but okay, pretty good.

Yi Ding [00:05:20]: Guess what model one was. Oh, it was from Baidu research by Andrew Ng, talking about supervised learning. In fact, he did this in 2015. So those of you who liked model number one, you guys are using nine year old technology, folks. Okay? All right. But it's actually a better description. I like model number one, too. Right? So why bother then, right, is the question.

Yi Ding [00:05:51]: Right? Like, it seems like we had all this figured out in 2015, and now we're like, oh, these things can describe images. Adrian could do it in 2015. Right? So this is where the predictions come in. I think the most important thing when it comes to multimodal models is in domain fusion, right? Fusing the ability to understand both images and text and audio, I think, is what's going to cause some more interesting things to come out of our multimodal lms. And I have a reason for thinking that. So let's take a detour. So I worked on chatbots for the last eight years. Chatbots, we have some fancy terminologies called natural language processing, right? It's like being able to talk, but it's a natural language processing.

Yi Ding [00:06:41]: Okay, so this is how NLP, as we called it, worked eight years ago. So this is. This is literally a library from Stanford. It's called Core NLP. You can still download it, right? They worked on this with their entire research lab, is, you know, fabulous piece of work. And basically, the way NLP worked eight years ago is you had all these different functions, tokenization, sentence splitting, name entity recognition. And the folks at Stanford and the folks, most places, said the way we were going to get to understand language is that people are going to publish papers in all these areas. You had the expert in coreference resolution.

Yi Ding [00:07:20]: I don't even know what that is. But you had somebody who's publishing great papers on that, and then somebody one day was going to combine all of them together, and then they're actually going to be able to build a chatbot that actually understood what you're saying. Right? Well, we know how it turned out. It didn't turn out that way. Right? But they were so convinced. They were so convinced. There was this guy, Hugh Loebner. He had made a little bit of money in Silicon Valley, and he's like, look, I'm going to give a little prize to the person who makes the chatbot that is most likely to pass the turning test.

Yi Ding [00:07:58]: Right? Like the closest to passing turning test. It was only $3,000 a year, okay? Guy was not giving out massive sums of money. Well, what did he give for his troubles? Marvin Linsky said he was obnoxious and stupid. Why? Because at the time. At the time, people thought that not only was passing the Turing test impossible, but it was actually stupid to even try because you're taking these graduate students from doing real NLP research and putting them on this thing that would never happen, right? Never, ever, ever happened. Well, Loebner died in 2016, so he never saw and he never had to pay out. Like, the grand prize was only 100 grand, so you never had to pay that out, right? He did survive Martin Minsky by eleven months, so good for him, right? Today it's like, well, hey, you know, if I'm going to give you 100 grand to pass the Turing test, I think I'm going to be giving out a lot of money right now. Right? So what was the key? It was domain fusion, right? So you take these two things, entity extraction, intent, recognition, and people were thinking, okay, I'm going to do this separately, I'm going to publish some papers, I'm going to write some software, do this separately, but in fact, you have to do this together.

Yi Ding [00:09:20]: So, for example, you look at these words, right? Like Bill Ford. He's the CEO of Ford Motor Company. They got Ford Motor Company and they got the Ford Mustang, right? Each of these words, even though they say Ford, means something very different, and unless you're able to do this kind of domain fusion, you can't do it. Another example, this is something we tried out at a hackathon, actually. My address is 899 El Camino real, right? That means one thing. That's a personally identifiable information. That's something that's private to me. The restaurant is on El Camino real.

Yi Ding [00:09:57]: That is no longer personally identifiable. Nobody cares, right? And then you said, I hiked El Camino to Santiago last summer. That's not even a street, right? That's this long ass hike in Spain that some people do because, you know, they're kind of nuts, right? So in these situations, if you said, okay, I have the best recognizer for El Camino, you still do not get the actual meaning of the word, and you can get the wrong entities out unless you put this in the chat GPT today, you could do this right after this, and it'll figure out exactly which of these is the address and which of these is your personal address. It's trivial, right? So I think domain fusion is going to be the killer application for multimodal lms. I don't think we're there yet. Okay, missed the slide there. Okay, so back up a little bit. What is Lam index again? Lam index about connecting lms and data, right? What does it talk about? So there we go.

Yi Ding [00:11:02]: What data? Talking about multimodal data. Right. So who here has heard of rag? Okay, we got a lot of people. Okay, so what does rag mean? Retrieval augmented generation. Okay, so those of you who have never heard of rag, now understand rag, right? No, no. Okay, so it is retrieval augmented generation. I have a better acronym. Right? Better.

Yi Ding [00:11:25]: Output was searched. Okay, you're getting better output with search. So, Bose, okay, in this case, we're gonna take our data, which is the multimodal data. We're gonna make it into vectors, which are also called embeddings. Why? Because once again, SAS stations, then you can search over them and then give it to the LMS. Okay, this is what rag is about. Rag v. Right? Multimodal data searching.

Yi Ding [00:11:51]: Okay, over multimodal data. But how do we do ragd? So there's a couple of methods. First, one is convert the image to text. So you can do this with lava, you can do this with Andrew Eng's model, and then you retrieve the descriptions using a vector. Similar research, just with text, the same way we've been doing with lambda index ever since I joined, not that long ago. Okay, we have an example here. The second way is to use image embeddings. So there are embedding models that are tuned for images.

Yi Ding [00:12:21]: And actually you can search them with both images and text. So like OpenAI has a clip embedding model, and then you have Google vertex, and there's a bunch of other ones, new ones, coming out all the time, we have an example of that, too. Okay, question is, is that it? And this is this, this is a real problem, right? Because remember, we are pretty good at this already. We're pretty good at converting images to text. So if you're saying, well, I'm just going to use my multimodal lom, this thing that costs a lot of money and was developed for hundreds of millions of dollars just to do this. It's kind of wasting it. I think a few of the speakers today have talked about. It's like, we can already do this.

Yi Ding [00:13:13]: What can we do more? So I'd like to talk to you about a few terms that maybe you haven't heard of yet. It's called generation augmented retrieval. Augmented generation. Okay, see, this is why you come to this talk garage, right? Okay, so here's the problem. Question comes in from the user. What is the unique facial feature of the animal known as the king of the jungle? All right, you can't really do this with either way of search. I mean, this is, this is what it is, right? But you can't do it with text embedding or image embedding. You won't get the right result.

Yi Ding [00:13:55]: So what do you need to do? You first need to generate the animal from the LM, and then you can retrieve the animal. Right? Ta da. So, garage, go try it out at home. But wait, there's more. Let me tell you about another cutting edge technique. Retrieval. Augmented generation. Augmented retrieval.

Yi Ding [00:14:21]: Augmented generation. You know, the key here is like, let's say I call this, like, raise your rage. You can come up with your own way of saying, you know, rage or rage. Okay, so this is, this is the question I asked, what is the color of the warbler that was declared extinct in 2023? The problem is, at the time, the LM did not have access to that latest information. So what do you do? You first say, given the following query, return only the name of the animal as a JSON string. And then I say, okay, here's the answer to the query. And it says, warbler. Right.

Yi Ding [00:15:02]: Animals, warbler. Then I say, I give it some context that I retrieved from my vector database, says the Bachmann's warbler is likely extinct. Right? And then by giving it the context, then it can find the image of the Bachmann's warbler. And then it can say, okay, well, now I have the image, I have the context. I know what color it is. Right. But all kidding aside, okay, so the future of multimodal is very bright, but there's so many things yet to be figured out. When it comes to multimodal, we are just on the cusp, and if you look at the results, you know, the sort of, like the benchmarks and that sort of thing, multimodal is way behind where text is, maybe like two years, three years behind where text lms are.

Yi Ding [00:15:53]: So I think there's some very interesting things to be done with multimodal. I hope you join us, I hope you contribute to Lami index. And that's my talk. All right, we have a question here. So, yeah, so basically, I think what we've seen at Lamb index is that most people want to combine their data with the LM in some way, right? If you think about multimodal data, we talked about video data today. We talked about image data. How do you combine that data, that multimodal data, with your LM? Efficiently? I don't think anybody's really figured that out yet. The search that we need to do over video or over images is very different.

Yi Ding [00:16:40]: Than the search that we can do with text. I think that's something that needs to be figured out and people are probably working on it right now at like Stanford.

Q1 [00:16:48]: Do you have a trademark on Ragdrag.

Yi Ding [00:16:52]: And all those words that you came up with, do I have a trademark? Yes. Is it registered? No, not yet. Have you seen any bracket applications of people who want to do garage or something? You know, actually. So garage, all kidding aside, garage and like is already a technique that people are using with text. So doing some kind of generation before doing the search is always something people are doing. Like, for example, one of the things people do is they do query modification before doing the search and they can do it with an OM. So that's actually something that's already being used. I haven't seen anybody else talk about stuff like rage or rage.

Yi Ding [00:17:39]: You all are here, so hopefully at your next talk you can bring it up. Also, with me playing around with AI.

Q1 [00:17:45]: Agents and stuff like that, it seems like that's the future with them doing multiple queries where they're retrieving the image and then going through that search process. So it seems like almost like in essence, it's just putting a lot more time and energy and processing power and just having a longer thing. The future when we do the query is going to be processing for a much longer time as we get more efficient. So that's what Ragarag did.

Yi Ding [00:18:12]: Yeah, absolutely. Absolutely. But there's another problem here, which is the accuracy. So every time you go back to the om, you lose accuracy. So it's an unsolved problem still. But so we have actually built a video retriever for lava in it. So it's kind of like, how should we get to talk with more people in the community? Yeah, absolutely. Well, you know, submit a pr or just message me.

Yi Ding [00:18:40]: Right. We're very, very happy to grow the community. That's actually our number one purpose as a company is to grow the community. And we're really happy to feature anything that people are doing that's interesting in this space. Yeah. So, so I would say, you know our core things, right. One, growing the community. Two, we got to make the stuff production ready.

Yi Ding [00:19:04]: Right. So there's some interesting things that we're working on that probably won't be ready for production in like three years. But then there's some, like, very core pieces that can be used in production today and we really need to make that production ready. So Solana Parse is a great example of that in some ways. You can think of it as multimodal but llamapars was born out of the fact that we tried to build a bunch of our own rag applications and we found that the existing solutions weren't good enough. So we built llama pars to try to bridge that. And we're doing it sort of at cost so that everybody in the community can benefit. Go ahead.

Yi Ding [00:19:41]: Yeah. So order of the context is interesting. Yeah. I mean, and also you think about, you think about like, that's where like maybe like re ranking will come in to play. Now the question is like, you know, who's done re ranking with multimodal? Nobody. Right. Exactly. Exactly.

Yi Ding [00:20:01]: No, I mean, it's, that's why it's sort of an unsolved problem. Right. Like I don't have a good answer for you, I don't have that solution for you. Right. But these types of techniques about query decomposition and query remaking are techniques that we are actively working on. We've already worked on, we're actively working on and also techniques that we know other people are using also. So we know for a fact that query decomposition, in fact, the new assistance v two API they talk about, they use query decomposition or we call it the sub question query engine. It's a little bit, probably composition rolls off the tongue a little bit better.

Yi Ding [00:20:42]: But yeah, these are techniques we know people are using but we probably haven't figured out all the year. Thanks so much.

+ Read More

Watch More

Machine Learning Operations — What is it and Why Do We Need It?
Posted Dec 14, 2022 | Views 699
# Machine Learning Systems Communication
# Budgeting ML Productions
# Return of Investment
Want High Performing LLMs? Hint: It is All About Your Data
Posted Apr 18, 2023 | Views 1.1K
# LLM in Production
# ML Workflow