MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Building Multimodal RAG

Posted Jun 17, 2024 | Views 175
# Multimodal Models
Hamza Farooq
Founder @

Reading text might be easy with LLMs, however when you are reading PDFs with charts and tables - you only process the text and not the charts. In this talk, Hamza explores various techniques using multimodal models to ingest important charts and tables and explore how you can make them part of the RAG architecture.

+ Read More
Darshil Modi
AI Researcher Co-op @

Starting from being a certified ethical hacker, publishing a cyber-security awareness book, and delivering seminars/webinars on ethical hacking to becoming an ML engineer at Teksun Inc., the journey has been enthralling. During Darshil's college days, he was keenly interested in the cybersecurity domain but later when he got an opportunity to explore AI/ML, he was equally fascinated. Darshil started his career as an intern at Hypeteq Solutions in AI/ML and later served as a full-time employee. For career prospects, Darshil transitioned as an ML engineer L2 at Teksun Microsys. Within 2 years of his experience in the cognitive domain, Darshil got the opportunity to work on amazing projects based on computer vision, NLP, and time-series prediction.

+ Read More

Reading text might be easy with LLMs, however when you are reading PDFs with charts and tables - you only process the text and not the charts. In this talk, Hamza explores various techniques using multimodal models to ingest important charts and tables and explore how you can make them part of the RAG architecture.

+ Read More

Join us at our first in-person conference on June 25 all about AI Quality:

Hamza Farooq [00:00:00]: Hi everyone, I'm Hamza. I am founder and CEO of Traversal AI. We are a small company and what we do is that we build enterprise level rag, multimodal and whatever AI we can add to it. So solutions, we offer full stack products and APIs services to our customers. I'm joined with Tarshal Tashil. Do you want to give your intro?

Darshil Modi [00:00:23]: And I work at Traversal AI Research Internet.

Darshil Modi [00:00:26]: I have about four years of experience.

Darshil Modi [00:00:28]: With AI and machine learning and recently I've been working.

Hamza Farooq [00:00:31]: So who has been doing chunking? Put your hands up if you've, if you've been chunking. Put your hand up if you hate chunking, right? Is Lama index here? So Lang chain is not good. Right? So chunking is this weird stuff that we have to do so that we can ingest a lot of data. The biggest problem that we have with chunking is that there's a local context problem, there's a global context problem. There are all sorts of problems. Every time we try to retrieve data, we just don't get what we want. So we will cover one specific thing over here is the multimodal nature of PDF's and how we are using, how we are able to understand. So how many of you have tried or worked on ingesting one PDF? One, two? How about 200? Yeah, that guy right there.

Darshil Modi [00:01:32]: Right?

Hamza Farooq [00:01:33]: How about 1 million? So what happens is when you are reading PDF's, there is a lot of weird kind of data on PDF's. Sometimes there's handwritten stuff that you need to ingest through OCR. Sometimes there is table, sometimes there is a chart, sometimes there is a figure, there is a diagram. There are a bunch of things. So I'll give you a very simple example. We've been working with some customers who want to read ten K data. Now, there is examples of how you can read ten K data through PDF. But believe me, anybody with a private equity or a finance background will tell you they don't work.

Hamza Farooq [00:02:11]: So the way we do chunking and the way we break our data, we try to read PDF's, we are not able to get the exact data. So what we did is that we build our own architecture. This is a paper that we recently read, a deep dive into semi structured and multimodal drag architecture. And what we did is that we, this is the most basic thing where of course, we're not going to give you all the proprietary stuff for free, but the basic thing is that we first differentiate text from images. And that is the first step that we do and it's very important to do that, to be honest. Like in PDF, when you're trying to read PDF, the data is not read the same way that you do in reading basic text. So first is that we differentiate PDF from text data to figures and charts and all the other things. So, and then we apply a lot of metadata to the PDF, you know, which is the text part, because, you know, we want to make sure.

Hamza Farooq [00:03:13]: So we use an LLM to write a lot of metadata about it so that the retrieval becomes better. And a lot of times we end up embedding the metadata of the text part rather than the actual, actual text chunk, because you are able to rewrite and get better results when you add metadata, which explains what's in the chunk. The second part that we have done is that when we take an image, a certain set of images, the technique that we use is unstructured. Do you guys know what unstructured is? Anyone? The unstructured I o company. So unstructured is a great company. They have built a product in which you can separate out images from text. And then we use the. We used two models.

Hamza Farooq [00:04:00]: Do you want to go to the next slide? Do we have anything? So we use basically two basic model. Two models. We use the fine tuned versions of lava. Lava is this really good model, which is able to understand how to read and explain images. And the second thing is we use GPT vision model, which is really good. And to be honest, that's the only thing from OpenAI that we use. We have tried to be very, we use open source to build non open source proprietary architecture. We hope to make it all open source eventually of what we do.

Hamza Farooq [00:04:38]: But at this point, what we're doing is that we use a lot of embedding models, which are, of course, all open source mixed bread, if some of you might have worked on that, we actually use that as the embedding model. And the GPT version, or the GPT vision model is used to create the structure of the data. And then we combine those two things. So we take the text part. So if you imagine we have unstructured, we take each PDF, we divide the text and the diagrams and charts and everything. Then we embed the text. We actually embed the metadata of that text. And then we take the charts and the diagrams, and then we use the GPT vision model to sort of ingest that data and convert it into a text format too.

Hamza Farooq [00:05:23]: And then we create separate embeddings of them. And then we have metadata on that too. And what we have done is that we have automated this process so that we can do millions of documents. We have not tried millions of documents. I think the most we have tried is 500,000. But we've been able to automate that stuff so that we're at least able to get 100% of information from the PDF, as opposed to the current way of reading PDF's is to just skip a lot of data. What we'll do is Darshan will sort of give you a demo, and he has been working on it, so I wanted him to give an overview. Good.

Darshil Modi [00:05:56]: So here's just for the demonstration purposes.

Darshil Modi [00:05:59]: We have taken isos four bit browshell.

Darshil Modi [00:06:03]: Which we are going to work on. So if you see the PDF contains.

Darshil Modi [00:06:06]: A lot of text images, document like.

Darshil Modi [00:06:09]: Photographs, tables, and it has a lot of information in different formats and we are going to work on it.

Darshil Modi [00:06:14]: So what we have done is, as.

Darshil Modi [00:06:15]: Hamza suggested, we are going to split it up into two parts. So we have extracted text and images differently. So what unstructured does is able to kind of use wizard model to extract the text blocks, image blocks and things, use a style OCR model to identify whether this is image x or auditable, and then it splits it up into different documents. And then we interview over all of the text documents to create its embeddings. And similarly we tried over the images to create its embeddings. And you see here is the summary of those images. So the images that we have extracted, we passed it to the vision model, GPT four and lava. So we tried two different models, but we ended up using GDP port and we kind of summarized the image.

Darshil Modi [00:07:02]: So, so each summary contains like, so each image has a summary attached to it. And then we converted the summary text into a bag. So now we have text converted to mappings argument converted to summary and then to mat. And now we tried this search query energy efficiency of iPhone 14. And what you see is we were able to extract three blocks. The first one is text, and if you see it has relevant context in it. The second one was again a text block which also has some information about.

Darshil Modi [00:07:32]: And third one, interestingly, is an image which is completely relevant to energy efficiency. So all these three was extracted from the, from the vector database on which we stored our embeddings. And so this is how you can convert text and images into vector embeddings and then perform drag over it. This is not the, of course this is not the final version of the rack. So if you pass this information to the LLM, it can further convert it into an wonderful response for you. But this is like end to end pipeline which we are working on. Thank you. Thank you.

Darshil Modi [00:08:07]: We are open for questions. So the question he asked is how do you correlate between different elements like images and text? Because they are interrelated to each other. So what we are doing is we are not just storing them beddings, if you, if you have used lama index, you know, they kind of create a tree of the nodes. So what we do is while we create the embeddings, we kind of correlate them. We create a tree node, kind of a structure so that while extracting, it just does not extracts the one particular node, but it extracts all the nodes attached to it.

Hamza Farooq [00:08:39]: So position encoding, sort of. Any other question? You know, honestly, sometimes we just go with fas also, we just go, we just pick up the most basic thing. Our job is to figure out what is the most cheap, the cheapest way to host to. And if you have less than a million embeddings, we just put it on thes and we build a retrieval system on top of that because you're hosting it yourself. It's very fast, is very efficient, and it does a great job and you can add metadata attached to that.

Q1 [00:09:10]: So in retrieval process you are using only the text to retrieve the relevant charm, which have text and the images, or you also use the image to trim the content.

Darshil Modi [00:09:22]: We can use both. So basically we are just comparing the embeddings, the search embedding with the embedding stored in the vector database. So if you pass on an image as a query, we will summarize it, convert it into embeddings and then search over the vector database.

Hamza Farooq [00:09:36]: So the OpenAI, what would you say?

Q1 [00:09:40]: Like given you've done a lot of work in text embeddings and given that.

Hamza Farooq [00:09:43]: You are now into image embeddings or.

Q1 [00:09:45]: Inside the PDF stages, what would you say are the biggest challenges that you.

Hamza Farooq [00:09:50]: Had to work on for black Fox tutorials on YouTube? Right. Because the problem is that we act like we basically do recursive chunking or all those chunking methods. We literally had to get rid of all. We had to figure out custom way of approaching that. And now one document has the same way, something we have created called semantic chunking. I mean, there is already something called semantic chunking. It's a similar name, but we've made our own way of because we analyze each document and we decide how we want to process that one.

Darshil Modi [00:10:25]: Another challenge that we face is kind of converting charts so most of the financial documents have complex charts, and to summarize them is kind of difficult right now. Image models are not very accurate on that.

Hamza Farooq [00:10:38]: Yeah. So the great thing we did over there is that for ten tier data, SCC also has excel based data for all the charts. So we actually started ingesting that data and we build a text to SQL model on top of it. And it works without hallucinations. So that's been pretty good. And the best thing for us is that we have Internet search built into our rag architecture so we can confirm results through the Internet search. We'll take one more question.

Q2 [00:11:09]: You may have covered this, but I'm curious how for the image chunks, it makes sense that it's just like each.

Darshil Modi [00:11:15]: Image is a chunk.

Q2 [00:11:16]: How do you determine how you chunk the text? Is it sentence by sentence with the embeddings?

Darshil Modi [00:11:23]: So it depends on the format of the PDF. So as I said, since it's using kind of a vision model, OCR vision model. So what it does is if, let's say there is a heading which is in bold, and then there is some text attached to it. So it automatically extracts that as one content. If it's, if there is some content on the side, it extracts as another content.

Hamza Farooq [00:11:41]: And we have to accept something, right? Like we can't just sit in to each one of them. So we do allow, but we do a benchmark from the data set itself to test to see how we perform. Awesome. Thank you, everyone. This was 15 minutes.

Darshil Modi [00:11:54]: Thank you.

+ Read More

Watch More

Building RAG-based LLM Applications for Production
Posted Oct 26, 2023 | Views 1.8K
# LLM Applications
# Anyscale
Beyond Text: Multimodal RAG for Video
Posted Jun 24, 2024 | Views 208
# LLMs
# Multimodal RAG
# VideoDB
RAG Has Been Oversimplified
Posted Jan 23, 2024 | Views 679
# Vector Database
# Zilliz