MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Beyond Text: Multimodal RAG for Video

Posted Jun 24, 2024 | Views 206
# LLMs
# Multimodal RAG
# VideoDB
‚ÄčAnup Gosavi
Cofounder @ VideoDB

Using behavioral psychology, design, and data, Anu builds products to solve big problems. He led multinational design and development teams to ship two products from concept to MVP to millions of users.

+ Read More

Large Language Models (LLMs) excel with text but fall short in helping you consume or create video clips because constructing a RAG pipeline for text is relatively straightforward, thanks to the tools developed for parsing, indexing, and retrieving text data. However, in his talk titled "Multimodal RAG pipelines for video", he will be addressing the challenges of adapting RAG models for video content which combines visual, auditory, and textual elements, requiring more processing power and sophisticated video pipelines.

+ Read More

Join us at our first in-person conference on June 25 all about AI Quality:

Anup Gosavi [00:00:00]: Hi everyone, I'm Anup. I'm the co founder of a company called VideoDB, a video database, and we are building LLM ready video info. A little about me I'm a product designer at heart. Though I'm an engineer by education, I no longer have coding privileges in the company. So I hope all of you will go easy on me. Before working on VideoDB, our team was working on Specst, which was a very easy to use video editor. It was browser based and we scaled it to thousands of customers, started managing tens of thousands of hours of video every single day and we ran into a lot of infra challenges and while solving them, all the learnings we have had, we have put them in VideoDB. Today's talk is about beyond text, right? Multimodal rag for video and let's start with the obvious, right? For multimodal rag it is still day one.

Anup Gosavi [00:00:58]: And because the input is multimodal, the output is still text. And that is why with video, LLMs don't really support simple requests. You can't say that, hey, show me when the deliveries were made last week. Maybe your package was damaged. You can't upload your drone footage, which is silent footage, and say that hey, create a voiceover for this. Or ask things like where is Elon talking about rockets? And the reason why this doesn't work is because video is informationally dense than text, and it has to be indexed in a number of ways depending on the type of video you can use. Spoken words, audio vision, faces, objects, there are domain specific indexes like sports. So it's like very complicated.

Anup Gosavi [00:01:50]: Let's take a simple example, right? Show me where the deliveries were made last week. There is no real audio here, so we can focus only on the vision. Now there are a number of steps that the LLM has to sort of do. It has to first figure out and transform the videos in a series of images so that it can then be given to the vision model. For object identification and person identification, you have to then figure out, find out the relevant timestamps and rank them. Then you have to clip on the basis of those timestamps. You have to merge those timestamps together and then finally make it consumable. So either as a download or as a streamable format, these number of steps tell you that video rag is not just search and retrieval.

Anup Gosavi [00:02:44]: It has to behave more like an autonomous agent that has perception of, it has to have the perception of the type of video and a lot of decision making. So it's an autonomous agent. The obvious question is, what about a multimodal LLMs, right? So they might get the intelligence piece, right, so the analysis piece of it, they might even be able to give you timestamps, but still you have to figure out the delivery portion of it. So you have to figure out how to, you know, clip it, compile the video and then finally stream it. And it is likely not very efficient. So we ran a 37 minutes surveillance camera footage to Gemini Pro, and if you see it use like 680,000 tokens out of the gate. So it is very easy to breach the token limit. Secondly, for every query you have to send that entire video again, which means that it is going to get expensive very quickly.

Anup Gosavi [00:03:45]: Now imagine you have a video library of 10,000 videos. Like multiple lms likely are not going to help. So that's why Rag is likely to be a much, much better approach, because you can pre process the video once based on your use case. You can then tailor the retrieval based on the user query, lower latency, lower cost. And I thought it might be fun to share like a, what I would call a multimodal rag starter kit if you want to sort of think about building something of your own based on the. So you input the video to this system. Depending on your type of video, you might want to process the audio, right? Maybe there is a library called Dibrosa that can help you look at the spectral features and you can identify claps, shouts or enthusiasm and stuff like that. You can figure out the images, you can use whisper to transcribe it and use the text.

Anup Gosavi [00:04:48]: So there needs to be a processor element. Then you have to figure out the best way to encode this information. Vector databases are still one of the best ways to sort of do it, because the inherent scale and semantic search abilities that they have. So video preprocessing and then storing it into the vector database is like one piece of the pipeline. The second piece is actually doing the retrieval pipeline, right? So based on the user query, you need to build a retriever that can use the vector database to sort of rank all these searches and then figure out which are the most relevant clips. Then that is the function of the rank. In each of this you have to give metadata of the timestamps. And what those timestamps will do is it will give basically the limit where you can make the cuts based on the results.

Anup Gosavi [00:05:47]: Then you get into the video generation. So based on the timestamps that you get after this ranking, you can use tools like FFmpeg to compile the clips together and then you have to have a streaming service like Mux, Agora, and there are a number of others, right? So all of this. So retrieval and video pre processing are the two parallel ones that you need to build. And if you see the pre processing and the video generation are like two very, very compute heavy processes, like the complexities of each are very different. At least with the video pre processing, you only have to do it once. So you have to manage the complexity only once. Where it gets really tricky is in video generation, because for every query, you have to generate a new one, right? You have to generate a new mp4 file or a new clip, and that is an entire rabbit hole of its own, right? So, like, we don't even have the time to get there. So, like, for every unique query you need to generate a new mp4.

Anup Gosavi [00:06:57]: To generate a new mp4, you have to do the entire transcoding process. The transcoding process is not a GPU process, it's a cpu process, right? So you have to do it every single time. For a query with agents and LLMs, there are going to be millions of queries, millions of personalized retrievals there. So it is going to get costly very, very quickly. And for four years, we have sort of dealt with these challenges today. Like, I don't want to get into the details of it. So the joke that I always like to say is we have converted this four years of learning into four lines of code so that others don't have to spend that much time in learning all of that. So that's the product that we have built.

Anup Gosavi [00:07:46]: So video DB is basically a video database. It is an abstraction on top of MP4. It abstracts storage, it abstracts retrieval, and it abstracts streaming. So the idea is whenever you're building an AI applications now, you can just get, no matter what kind of instance input, you can actually get a video pack. So that's sort of the case things. So thank you. That's.

+ Read More

Watch More

Building Multimodal RAG
Posted Jun 17, 2024 | Views 175
# Multimodal Models
Building RAG-based LLM Applications for Production
Posted Oct 26, 2023 | Views 1.8K
# LLM Applications
# Anyscale
Innovative Gen AI Applications: Beyond Text // MLOps Mini Summit #5
Posted Apr 17, 2024 | Views 636
# Gen AI
# Molecule Discovery
# Call Center Applications
# QuantumBlack