Sign in or Join the community to continue

Fresh Data, Smart Retrieval: Milvus & Jina CLIP Explained

Posted Jun 19, 2024 | Views 763

# Milvus

# Jina AI

# Zilliz

Share

speakers

Stephen Batifol

Developer Advocate @ Zilliz

From Android developer to Data Scientist to Machine Learning Engineer, Stephen has a wealth of software engineering experience. He believes that machine learning has a lot to learn from software engineering best practices and spends his time making ML deployments simple for other engineers. Stephen is also a founding member and organizer of the MLOps.community Meetups in Berlin.

+ Read More

Andreas Koukounas

ML Engineer Intern @ Jina AI

Andreas is in the final stages of his studies at the Technical University of Athens, where he first delved into machine learning by working on natural language processing (NLP) projects for the Greek language at the Athena Research Center. Currently, he is interning at Jina AI, contributing to the development of multilingual and multimodal embedding systems aimed at enhancing cross-language and unimodal information retrieval.

+ Read More

Saba Sturua

Machine Learning Engineer @ Jina AI

Saba is an ML Research Engineer specializing in language technologies. Over the past two years, he has been a key member of the Jina AI team, focusing on the development of state-of-the-art embedding models. Passionate about open-source, Saba advocates for it as the most effective approach for advancing Artificial Intelligence, ensuring accessibility and collaborative innovation in the field.

+ Read More

Ben Epstein

- @ -

SUMMARY

Keeping Data Fresh: Mastering Updates in Vector Databases Have you built a RAG (Retrieval Augmented Generation) system, but now face challenges with updating data? In this talk, we will explore how updates are managed in vector databases, focusing particularly on Milvus. We’ll explore the different challenges and practical solutions for maintaining data freshness in scenarios requiring high throughput and low latency. By the end of this session, you'll understand the mechanics behind data updates in Milvus and how to ensure your database remains both accurate and efficient.

Jina CLIP: Your CLIP Model Is Also Your Text Retriever Your CLIP model under-performs in text-only tasks? In this talk, we will introduce our novel multi-task contrastive training method designed to bridge this performance gap. We developed the Jina CLIP model, having in mind to connect long documents, queries, and images in one space. During this session, you will gain insights into the methodology behind our training process and the implications of our findings for future multimodal and text-based retrieval systems.

+ Read More

TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)

Ben Epstein [00:00:06]: All right, we are live and we have so many people. Welcome. Thanks everybody for joining. We have a really good session up today, all about fresh data, keeping things fresh, pulling data out of a database rag, all the good stuff that you want to know about. And with us today we have Stefan, who's a developer advocate at Zillus and also a friend of the community running all of the Berlin meetups. So if you're in Berlin, look out for him. He's setting up sessions. And we have Saba and Andreas from the ML engineering team at Gina who are going to show us some really cool things about clip embeddings.

Ben Epstein [00:00:39]: We have a lot to cover today, so we're actually just going to jump right into our first talk, which is Stefan, and we'll take some questions from the audience and then we'll jump into the second talk and look at a couple demos. So it should be really great. Thanks everybody for taking the time and joining us. Stefan, whenever you're ready, your session should be up and you can just jump right into things.

Stephen Batifol [00:01:02]: Perfect. Thank you very much. Yes, I'm going to start, and today I'm here to talk to you about how to keep your data fresh for mastering updates in vector databases in particular. Rag is very popular and a lot of people are building demos around Rag showcasing how you can do rags, but no one really talks how to update your data. Everyone is always like, you insert it and then that's it. I'm going to try to show you how you can understand better how things work. We're going to go quite deep in the technical part for vector database. Let's get the focus.

Stephen Batifol [00:01:44]: We're going to talk about how we insert the data, how we updated the different tips I can give you so you can update your data. Let's get started. I'm going to start with myself. I'm Stefan Batifau. I'm a developer advocate at Ziliz. If you have any questions related to this, talk to Jai, to Ziliz, Milvis, anything Rag, the Berlin branch of the mlops community, reach out to me with my email, LinkedIn, Twitter. I'm everywhere. I tend to reply very quickly.

Stephen Batifol [00:02:15]: I'm going to talk about Milvis and what we are and what we're doing is an open source vector database. It's actually part of the Linux Foundation AI and data branch and you might be familiar with it. We've been here since 2017 and a lot of people who use it like a lot of stars on GitHub. What's cool as well is that we built for scale, so usually our biggest customers are like billions and billions of vectors. We have different features. You can use dense boss embeddings, filtering, re ranking, different things. And what we have is also like, we introduced it actually two weeks ago. Now we have Milvis Lite, which allows you to use Milvis directly on your laptop.

Stephen Batifol [00:03:05]: So let's say, oh, in a notebook, you know, also like Google Collab. But all you have to do is build this look, and then you don't have to worry about using Docker or deploying it on Kubernetes. What we were recommending in the past, everything will work locally, everything can work everywhere. We have integration with different AI toolkits, the usual LangChain haystack as well. I'm not going to list them all. There's Gai as well, which we work with, especially for the embeddings and rankers. But yes, we work with a lot of AI toolkits. That's all about the introduction for Milvis, what we are.

Stephen Batifol [00:03:51]: I'm going to go more into the details. I'm just going to talk quickly about rag. And I guess if you're here, you probably used it already in the past. I'm going to introduce it quickly and then we'll go more into the technical parts. Rag means retrieval, augmented generation. The basic idea is to use rag to force DLM to work with your data. So you inject it in a vector database and then you can work with your private data. The basic architecture is that one.

Stephen Batifol [00:04:23]: So you have your data, you extract the content from your data, so it can be like, you know, anything, it can be PDF's, it can be images, it can be like, you know, different context, different content. Sorry. And then you chunk your data and then you embed it and you store it in a vector database, like builders. Then you have your query from your user, you embed this query, you do a semantic search as well, then with builders. And then we're going to return, at least for the basic rag architecture, we're going to return the top kick, the job results for the similar data. We add this back to the context, we give it to your LLM, and then we give you a response. So that's like the basic rag architecture. Everyone has been talking about it for like the past year.

Stephen Batifol [00:05:07]: Everyone is working on it. It's quite cool. Now we have more advanced rug architecture, but that's not the main goal of these talks. So, yes, keep that in mind. It's going to be quite important. But then also when I usually talk about vector database in general. And people are always like why even use one? I'm not going to list everything, but there's search and then there's high performance search and then there are other things that I needed. It's like crude operations.

Stephen Batifol [00:05:39]: So I create read updates and delete can be a bit tricky if you don't use a vector database. Like you might have to rebuild the index yourself, you might have to delete the data yourself. It can be quite tricky and you might spend all your time actually doing that instead of working on something else. Data freshness, which is what we mostly got to talk about today to ensure that your data remains up to date, how to update your data and different things, persistence, availability, scalability as well. You might struggle to scale if you're using know like a plugin or if you're using something that is like face or something, for example. So usually we can help with that. Of course, you know, like you introduce something new in your stack, but it could be really useful. Also there's the data management part.

Stephen Batifol [00:06:28]: So you know, like how do you do data ingestion, indexing, querying, you know, those things can be a bit complex to do. And then there's backup and migration as well. So you know, how do you back up your data, how to make sure you don't lose it? Those things can be important. Then there's the way to deploy. So like if you deploy on cloud or on premise, it can be a bit tricky as well. Observability, multi tenancy as well. Those are like different points. So then just list them.

Stephen Batifol [00:06:57]: But yes, and then vector plugin, some can be good and it can honestly work for like, you know, if you have a bit of data, but then if you actually go at one point to like scale, scale can be quite hard. You know, like maybe they don't offer filtering, you know, like you might not be able to filter through non vector attributes. Hybrid search might be also like difficult. You know, like if you want to combine your vector search with your keyword search to have like a better search experience, maybe they'll offer that range search as well, which is a different thing. Scalability concerns. They usually can struggle to scale at one point or you might actually recreate a vector database in the end to make it scale. And also everything is changing so quickly that it may be hard to catch up and to follow. With that in mind, I am going to introduce the design principles of Milvis.

Stephen Batifol [00:07:51]: Basically what we did and what we have is that the storage and compute is fully separated. We also use different storage systems so that you never lose your data and it also doesn't stay in memory all the time. So we use Mini IO OS three to store your data and then I'll go more into the details later. But you will also have some data that is live but then that is pushed then to the object storage. We also have microservices so you can scale by functionality, meaning that you see we have on this graph you see the query node and then you see the data nodes and then you see the index nodes. So those are independent to each other and you can scale them up and down depending on your use case, depending on your needs. Let's say you have a lot of data to process, you have a lot of data to work with. I don't know, you have a spark job that is coming.

Stephen Batifol [00:08:46]: Maybe you want to scale up your data node so that you can process everything, but you don't really need to scale up your query node because you're not really running any queries at the moment. Those are the idea which we have and it's really everything is dependent. So if you have a big lot of data coming you won't struggle with it and it won't have an impact on your index or on your query. So yes, that's the architecture that's the same but with a bit more detail. And what I was saying before is that you have the query node which is cycled in red here and this one is here to basically help you saving search requests. So it subscribes to, let's say if you use Kafka then it subscribes to it. So you have real time updates for all the data that is coming and then it's going to convert this data into what we call segments and I will talk about those later. But they basically in memory structures that are temporary.

Stephen Batifol [00:09:46]: And then it also has access to historical data that you store an s three or mini IO or whatever you want. This is also what allows you to perform hybrid search to combine vector and scalar data. Then you have the data node which you can see in orange. This one is processing the data updates. This is what we're going to use to actually perform the updates. That's what we do. This one is performing the mutational request. It stores the different snapshots that you may want to have.

Stephen Batifol [00:10:19]: This one is really here for the data updates. Then in purple you have the index nodes which is building the search indexes really. This one is here to make faster search operations. What's very good with this one as well is that it can be implemented using a serverless framework. You can really scale it up and down. If you don't need to build a new index, you can fully scale it down and then you won't have to pay for it. That's what I was saying before, which is like everything is independent. So that's the good part.

Stephen Batifol [00:10:53]: So yes, that's basically the architecture for the entity. This is basically what's very important for Mealworld and vector database in general. We all have our own entities. And you'll see you have a primary key, a partition key which allows you to split your data into different partitions. Then you have a fixed field which is, which supports different data types, numeric JSON arrays or different things. You have then dynamic fields which allows you to store everything in JSON. This one you can store everything that you want. It's dynamic as the name is mentioned.

Stephen Batifol [00:11:33]: It's really if you want to store some metadata or something and you don't really know the schema, then you can put it in dynamic field. You have then the vector fields and then you have different things that are reserved for the system which helps you with concurrency in general. I said we would go a bit into the details. This is one part of the Mailvis data layout. You have shards, you can think of those filling cabinets for data. Basically you divide your data into sections and those are called shards. They act like a filling cabinet and they are always managed by a supervisor, by a leader. And this supervisor is responsible for adding new information to the shards, storing the data in a safe place.

Stephen Batifol [00:12:15]: So I was mentioning s three before. That's what is doing. It also is the one that is serving the latest information for the SAS request. And then it's forwarding historical data requests as well to the other cabinets if needed. So the other query nodes we have then segments, which is the magic of metaverse usually, which is like you have two kinds of segments so you have the growing segments and you can think of those as like inboxes for the new information. Those are in memory and they're here for like super fast access and updates. So that makes sure that you always have access to the freshest data that is available that you put in mailbus. Yeah, if you have something to update it's got to go into growing statements and then we're going to update it and then you have access directly to it.

Stephen Batifol [00:13:01]: Once we have enough data or once you do some different actions then we move from a growing segment to a sealed segment. And this one is immutable so this one will be stored on s three or on disk, but it's not solid in memory. And then we build an index on top of it and then you can search through it very quickly. But yes, so as a summary you have growing segments for the latest updates and still segments for everything that is historical where people like where we built an index on top of it and yes, so then how the data gets added and accessed. So you have sharding which is like the large datasets are divided into smaller sections and then we write everything into the log broker. So like Kafka and you can think of it as like as a list of to do list for the data nodes. And then the data nodes, they add the new data to the shots, they remove the outdated data if needed. They flush the accumulated data in a powerless storage and then we create segments, so growing segments and then we shield them and then we build index on top of it.

Stephen Batifol [00:14:08]: And that's how it works basically. That's how what happens when everything happens when you add data to your geo vector search and vector database in general. So that's also the thing of like when you build a POC you may be able to use face for example, or different index, different, sorry, libraries. But then it's actually a lot of work to actually make sure that, you know, everything work and everything can scale. And that's why you might want to use a vector database. Then you have the data query, which is, as I said before, you know, it helps you like search through your data and give you like results. So this one is regular search for like different collections for like a k number of vectors that are near a target vector. All the vectors as well within a specific range as well depending on your query, depending on your data query.

Stephen Batifol [00:15:00]: Basically that's how it works. But then how do we handle data updates? And I'm going to talk in millbus in particular here, but then the idea is the same. A couple of things to avoid, I would say. First if you want to be able to update your data and your documents is to avoid having auto increments for primary keys or different things like that because then you won't be able to find them again. Or if you go through the same document, then if you're incrementing you might be lost at one point. Also avoid generating ids based on something you don't know. What I mean by that is that sometimes if you use different tools like lambda index, time chain or something, they will actually generate an id for you. But just be careful on how they generate this id.

Stephen Batifol [00:15:51]: Sometimes they can be very useful. I'm not saying it's bad, but they can generate id based on the document. But then you have to make sure that you know what the document is. What is the definition of a document for Lamindex online chain. So those things are very important because then you need them to be able to update your data. So then you have different strategies. You have unique identifiers. As I was saying before, you can use documents or an image or vector, but you have to make sure that each piece has a unique identifier.

Stephen Batifol [00:16:26]: Maybe you want to assign a uuid to your documents or to your vectors so that you can then identify them, retrieve them when you want to make an update. Maybe you want to do metadata tagging as well. You want to do like you want to attach some metadata to each data and then you can do some quick search. So example, you know, that created author or content type and then you can filter on those. You know, you can be like, okay, I want to filter on all the, everything that has been written by John Doe or you know, every content, like every research paper or everything that had been created on a specific date. Then you can search for those and then it can be really helpful if you want to update everything, every document that has been written on the 6 June, for example, you can also have composite identifiers. So how to have multiple fields to create then a unique one? I don't know when you would need that, but you can be this user for this specific date with a specific timestamp. And then you can be like ok, I want to update everything that has happened for this specific timestamp for this user because I don't know, there was something wrong or you have new data but this user.

Stephen Batifol [00:17:39]: So you can also do this. Then you have hierarchical ids as well. So if you have some complex data sets in general you can use id like project a chapter one, section two, and then let's say you're going to use a new embedding model and then you want to update your data for this specific section. Then you could update only this specific section. So those are usually useful ways, I would say, to update your data. And that allows you to basically still be the one in charge for your data. If you don't have that, then for example for this section, depending on your chunking strategies or depending on how you divide everything, you might not be able to refine this one actually because blank chain might give you an id for this section, for this chunk maybe, but then if you have a different chunk strategy, then you will never be able to find it again. Basically, let's say you have those different strategies.

Stephen Batifol [00:18:48]: If you want to do an update with my mailvis, we have an upset. So you can have this id, this data with id zero with a vector and then with different information, id one with a different vector, id two with a different vector. And then you're going to say please upset for this specific collection and with that data. So then mailves as we saw before. They will check for your data. They will check if you have this idea already in Melvis. If you don't, then we're just going to insert it. And if you do, then we can update it.

Stephen Batifol [00:19:24]: Thanks to this id that you provide. You can also, as an example, you can also do it for a partition. If you decide to divide your data per partition, then you can do the same as we did before, except that you're going to add the partition name that really helps you to filter the data that you don't want and to only take the one that you want. You can also, as an example, this is how to upset entities. With luncheon you have your documents and those ones. You have specific documents and then you add the metadata and then you're going to add all those documents to Melvis. And then you can say please for all the id in one, two, we're going to do like here, we're deleting and then we do an upset. So that's like.

Stephen Batifol [00:20:14]: It's also a way to do it with lunch. That's just an example that I just to showcase basically what we can do. And yeah, basically what happens then behind the scene is that you need to identify the data that needs an update. So make sure you have the primary key within. Transfer the data to the proxy. We send the absence request to move us and then we go to the proxy service. Then we talk with the data coordinator which checks if you have a primary key. Then we do an upset.

Stephen Batifol [00:20:47]: And so as I said before, either you update or you add the new document to the collection. And then it's not over yet. You have the data allocation indexing. So you add your data in the growing segment. Then we assign the segment to a data that is assigned to the segment. Then we create sequence number so that we can store it. Then we flush the segment at one point when it becomes too big. And then you have the segments and then we write it to us.

Stephen Batifol [00:21:19]: Three puts an index on top of it. And that's then how you can have access to your updated data, and that's how you can update your data in general, when you have a rack system and make sure that you have the latest data available, that's what goes behind the scene. I think I am out of time, so I'm just going to say thank you. Check out realvice website, check us out on GitHub as well. And if you have any questions, you can connect with me. Thank you very much.

Ben Epstein [00:21:54]: Thanks, Devin. Awesome talk. Thanks so much for doing that. That was really interesting. I didn't know that Melvis had a totally local version that you could run. Would that be. Would you ever think about using that for production? Or would it very much just be, like, for development testing things?

Stephen Batifol [00:22:08]: Yeah, it's really for development testing. It's like. It's like, it's also limited offering, so it's really cool. You know, like, you work on your own laptop and, you know, you try to do your PoC for your own rack system or something, and that's usually how you use it.

Ben Epstein [00:22:23]: Yeah, it would also be nice potentially for some testing, like for CI testing.

Stephen Batifol [00:22:27]: That can also be. Yeah, that's actually a good idea.

Ben Epstein [00:22:30]: Yeah, super. Love that. Cool. Awesome. Thanks so much. Yeah, I think, yeah. Good timing on that. We'll jump in to let Saba and Andres give their talk next.

Ben Epstein [00:22:42]: Steven, I'll take you off, and we'll see you in a bit.

Stephen Batifol [00:22:45]: Thanks. Thank you.

Andreas Koukounas [00:22:46]: Okay.

Saba Sturua [00:22:46]: Hi, everyone. I'm Saba, and I'm here with Andreas. We are machine learning engineers at Gina AI, and we work on developing state of the art embedding models. For example, last year we produced the first open source embedding model that supported eight k sequence lengths. And after that, we produced a couple of more bilingual models as well as rerankers. You can see all of them on our hugging phase, and we also support API if you want to use it. And today we are very excited to present our latest model, which is a multi model embedding model, Gina clip. So basically, I will start talking about what clip is.

Saba Sturua [00:23:37]: I'll give you a general introduction into what we did with our Gina clip, and then Andreas will go into more details and explain training procedure more thoroughly. And lastly, I'll show you a very simple demonstration on how to use Gina clip with vector search. So what is clip? Today, what we call clip is a family of models that are contrastively trained on language and image. So basically, clip models learn a joint representation for texts and images. This enables us to solve tasks like multimodal retrieval, also zero shot image classification and many other tasks. This is especially relevant today for, I would say, multimodal retrieval that is widely used in rag applications. So you basically have text or image as a query. You encode it into a shared representation.

Saba Sturua [00:24:43]: You have also set off images or texts that are also encoded. And then you search through your text, images or both together to find the most relevant one. I mentioned contrastive training, and maybe I'll give you a very short introduction on what this means. Clip is trained contrastively. As you can see, we have clip contains two sub models, like two encoders, one for text and another for image. And what we do is that during training, we gather a lot of text image pairs. This can be text images with their captions, for example. And then we encode both modalities with their corresponding encoder.

Saba Sturua [00:25:31]: And after this, our aim is to maximize similarity between embeddings of image and its corresponding text, which we call positive examples, and minimize similarity between image and any other random text that we have in our batch, which we call negative examples. So that's where the word contrasted comes from. Basically, click learns relationship between texts and images by contrasting this positive and negative examples. And yeah, the bigger the batch, the more positive examples we have as well. The more negative examples we have. So it's usually better to have a big batch. Okay, so that's how clip is trained. And now maybe a question like why did we decide to train our own clip model? So basically, we noticed that there are some deficiencies in the text tower, the text text encoder, part of the, of the clip model.

Saba Sturua [00:26:41]: Mostly these deficiencies are that the previous clip models, in most of the cases, the text tower weights are randomly initialized. And during this contrastive learning that I described in the last slide, the this text tower doesn't, do not have, does not have enough text during training to learn and understand the language properly. So eventually we got a text encoder that does not have good language understanding capabilities. Also, the text hour is optimized contrastively to images only and not to other texts. Another issue is that most of the image text datasets that exist contain captions of the images. But these captions are usually very short. And you can imagine when you train the text encoder and only pass the very short captions, then this encoder cannot generalize to longer text. And also it's difficult to represent the language and understand the language with only short text captions.

Saba Sturua [00:27:52]: And also previous works to save computational costs. They clipped the contest context length and they used a very small context length. So for example, the smaller models use 77 and the bigger ones use 32, which is not enough for a good text performance. And as a result, most of the clip models do not have strong text encoders. And in production this is a bit inefficient, because when we have models, we not only have cross model retrieval, for example, but we also learn to use our models with respect to text only tasks. And since the clip model and the text encoder of the clip model does not work well with the text only tasks, in production, developers have to store two separate vectors for their text. One will be one which is used with respect to images and is encoded by the click text encoder, and another that is used with respect to other text and text only tasks. And this one.

Saba Sturua [00:29:09]: For this one, they usually use the text only embedding models. So it basically tried to improve this deficiencies. And the main idea and this will. Andreas will talk about this more thoroughly. But what we changed is that now, instead of only contrastively training on text and images, we add another objective and we jointly optimize our model on text image pairs as well as text only tasks. Right? So, as you can see on the illustration, during training, we train our model on text pairs only and use the inference c laws for that. We also do what I already described in the last slides. This text image contrasted learning.

Saba Sturua [00:29:58]: And then we jointly optimize on these two objectives. So our training is multistage. We have three stages, so I'll shortly cover right now, and Andreas will talk about each of them in more details. So basically, we start with, as I said, we have two objectives. In the first stage, we use mostly image text pairs with short captions, because that's most of the data we have and that is publicly available. On the second stage, we improve text image cross model understanding by adding long caption, adding text image pairs with long text. And on the third stage we use. Instead of pair tuning, we use triplet tuning to further improve the text only performance of our model.

Saba Sturua [00:31:00]: So this is the basic overview. And now I'll let Andreas to continue and talk about the training in more details.

Andreas Koukounas [00:31:08]: Okay, great. I am Andreas. Hi. So, as Saba started telling you, we have like a multi stage training. The first stage we did the usual clip training that Saba explained before. And basically we initialize our two word encoders through zine abert model. That is also the backbone for zine embeddings. V two that supports eight k context length.

Andreas Koukounas [00:31:39]: And the image encoder was initialized with Eva two vision transformer these two entities have different losses. So we train jointly on multimodal tasks that is captured and image and also we train on a text pair. From the text we get an infra c loss, whereas from the image text pairs we get a clip loss. I will tell you the difference in a moment. And combining these two losses, we get the final loss for our first step. For our first stage, basically the pairwise training for the text to text data, we choose the data mainly through scraping the web and some datasets. We use the math semantically related pairs like titles and paragraphs or questionnaires and datasets and then we do a pre processing to mainly filter out the pairs that couldn't help us. This has rule based filtering that we mainly removed from the data URL's and things like that.

Andreas Koukounas [00:32:56]: The duplication to remove duplicates from the dataset and also code systems filtering that removes the bad pairs from our data sets that we selected them but with not really good quality. So this text pair is input to the text encoder. As you can see, we basically get a burn, have the embeddings produce the embeddings through the model and then compute the cosine similarity. This creates a table similar to what samba showed you at the first slide where we choose the diagonal for the positive pairs and the non diagonal for the negative samples. So for the clip loss, for the clip part we used the lion 400 million dataset. That's a data set of 400 million pairs, image text pairs from a common crawl between from 2014 to 2021 and this data set, the original common crawl is like twelve bit 13 billion pairs. And to keep the 4 million pairs we just use the original OpenAI clip to filter like the most similar pairs to keep to have the better quality from this huge data set. So how the clip looks in a similar way.

Andreas Koukounas [00:34:34]: Here we have on the left side we give us input to the text encoder caption and to the image encoder and image through our models and we compute the cosine similarity as the first way. So the difference between inference C and clip loss that we say is that clip loss has just learned parameter that changed during the training. But other than that the losses are the same for the first time. For the first stage of training, that is the biggest one as well. Basically the main contrastive learning training as no clip models is done. We trained with a context length of 77 for computation costs in order to fit a big batch size of 32k for a total steps of 60k. This this means for around with. So during training, 2 billion pairs, around 2 billion pairs of famous text and 2 billion pairs of text to text pairs.

Andreas Koukounas [00:35:42]: After this stage we just took the checkpoints and what we changed here is like we wanted to give our model the capability to understand longer texts. To do that we found long caption dataset which was like 1.2 million Nimas text pairs. And the captions are from 100 to 750 tokens. As you can see, this task created with shared captioner, basically they took image text pairs that they were already on the common pool. They added their models are captioning and this is the generated captions. So here we also have the link if you want to check the model. And so we use this model to get the long captions. So for this second stage tuning we trained much.

Andreas Koukounas [00:36:43]: We didn't trade so long as the first one, as we did have also so much data. The text pair here are the same and the image text pairs are the long captions. Totally context length, 512 for 7000 steps and for the step stage that we basically wanted to have, all the previous methods were like unsupervised methods. But now for the, for our text encoder to get better text to text capabilities we need to do a supervised training. And that's why you use the pairs plus hard negatives for the, for the image and text. For the multimodal pairs, they remain the same thing, the data set. So here, what are the triplets? Basically we might, we mine triplets in order to. Right now we don't have, we don't create our negatives.

Andreas Koukounas [00:37:48]: It's like the previous state that we have positive pairs and negative pairs, but we don't create our negative pairs randomly from the data set. We have already decided what is the negative example. And so this on this stage we could do a contrast of training by comparing answer with security without positive example against the answer and all the negative examples. So in that way we want to get our query positive example together in the space, whereas throw move away from the, from there the negative examples. This was also not a long training stage. We trained for, again with a context length of 512 for 12,000 steps. Our models basically has trained at the context of 512. But because of alibi that is used on Zinaberte implementation, this can extrapolate till eight k, till the eight k codex length.

Andreas Koukounas [00:39:00]: And it can be supported in a pretty nice way. This was basically our model. What we wanted to achieve with this model was mainly to have one common model for multimodal retrieval and text to text retrieval in order to save all the, save all the embeddings in one space. And here you can see the performance differences. As you, as we told you before in the text, to text the original clip models doesn't perform well, whereas here Zenaclip V one has a performance closed to Zenebedx v two on the MTB retrieval data sets. And we also achieved to have slightly better performance on all the multimodal tasks, on all multimodal retrieval tasks than the original clip version. So at the end, what we wanted to achieve, and we must achieve with this paper, was like to encode both teammates and texts in one paper, in one model, one model, basically to do all the jobs we needed before in a rag pipeline. We managed also to achieve a strong performance text driven, and we are pretty pleased with that.

Andreas Koukounas [00:40:21]: If you want, you can find our model here on hiding face or through API. And now Saba will give you also a presentation for a small demo sheet we created.

Saba Sturua [00:40:34]: So basically this is a very short and simple demonstration on how you can use Gina clip. Initially I wanted to have a multimodal rag example with Lama index, but it looks like Lama index does not yet support open source multimodal embeddings from hugging phase. So I had to choose a bit more like a simplistic demonstration. In this case, what we're going to do is we'll use Gina clipt with our open source library dockery. Dockery supports in memory very simple vector store and we're going to use this vector store to do some vector search on our multimodal documents. Basically I start with installing required libraries. Then we need to load the Gina clip model which you can do like that. This one important thing is that you have to pass the trust remote code equals to true, because this Gina clip V one relies on our custom implementation, which you also can see on the hugging face.

Saba Sturua [00:41:50]: So it's open, but we have to pass this. So the model is loaded. Now, regarding data, this is, as I said, is very simple example. So I have a list of some texts regarding different animals and I have image URL's also regarding some animals. What I'll do is that I'll create in memory exact neighborhood neighbor index using dockery. Basically what this is is that this is an in memory index which loads all the vectors you have in memory when you use it. This is good for the cases where you don't have a lot of documents because it doesn't need some extra dependencies. And also it doesn't use any approximate search algorithms.

Saba Sturua [00:42:44]: It just iterates over all possible vectors you have in the index and finds the most relevant one. In case of if you have short, if you have small amount of documents, this is a pretty good choice for Rag, and not just for Rag, but also for vector search. What we do is that we define a document schema. In our case, our document will contain a modality, so it might be text or image, it will contain an embedding. So these two fields will be required. And also depending on which kind of document it is, it might contain the text, so the original sequence or the URI which will be the URL of the image. So I define the schema and I pass it to the in memory document index. To create a document index using doc array, yeah.

Saba Sturua [00:43:43]: So the next part is to actually create those documents and index them. What I do first is that I create the text documents. So for each sequence in the list of the sequences I have, I first of all embed the dot sequence by calling the model doc text. So after you load the model from hugging phase, you can use encode text to encode your text documents. And I feel the other keys as well. After my text documents are ready, I'm just indexing them in my document index and I'll do the same for image documents. The only change is that instead of encode text, what I'm calling encode image, and then we can index the documents we have. It's been indexed.

Saba Sturua [00:44:35]: Now what can we do? We can search using our queries. Let's let this be our query, this Buddha frog statue. We first need to encode this using our clip model and then we can call the find function, passing the query, passing the field that we're doing the search on. So it should be the embedding field. And also for this case I'm just going to retrieve one document and let's see what it returned. So it returned this text sequence that was regarding the frog as well that we had here. Now if we want to search through only images, for example, what we can do is build a bit more sophisticated query which also consists of filtering. So basically first we filter on the modality, we only keep the images and then we do the same find only on the images that we have left.

Saba Sturua [00:45:36]: Limit is also one. So this will return one document only. And hopefully this will return the image of the frog, which it does. And yeah, basically this is a very short example of how to, well, first of all load our model from hugging phase and utilize it with a very simplistic case of vector store. But you can do this for other vector stores as well. Or I mean, once Lamindex supports this kind of models on Lamindex as well. But if you still want to use it with Lama index, what you can do is so it already supports Gina API. You can go to Gina AI embeddings generate API key.

Saba Sturua [00:46:22]: You will have 1 million free tokens initially and you can use it very easily. So I think that's it from the demo side. And now maybe if there are any questions or we can also discuss that.

Ben Epstein [00:46:42]: Thanks so much for showing that. That was awesome. I really liked seeing it. I have to imagine you could have also done that using Milbus and the new like on system vector database to store those Gina embeddings, right?

Saba Sturua [00:46:56]: Yeah, yeah, of course.

Ben Epstein [00:46:57]: Super cool. That's awesome to see. How did you guys start working together? Gina and Zillus?

Stephen Batifol [00:47:04]: I invited Bo, which is head of engineering, I think, at Gina AI, to an event because I, so I'm based in Germany and I work like german english documents and Gnai has like an embedding model, which is multilingual, so you can use it with like German and English. And I started using it and see, and I've seen like how much better it was than the other embedding models. So then I reached out, I was like, hey, do you want to talk about it at a meet up? And then they did. And then I was like, oh, actually it could be cool to have direct integration of JNA into Millwoods. And so now we have it as well for like, yeah, the different embedding models and renkers as well that Jinnah is offering.

Ben Epstein [00:47:45]: That's super full. Is the model that you guys were just showing showcasing in Gina a multilingual model as well, or is it focused on English?

Saba Sturua [00:47:53]: So right now, this clip model is only English, but we want to also develop a multilingual clip.

Ben Epstein [00:48:02]: That's super cool. Yeah, I don't think I've seen a multi, I'm sure there is one, but I haven't seen a multilingual, multimodal model.

Stephen Batifol [00:48:08]: Yeah, it would be cool.

Andreas Koukounas [00:48:09]: Yeah, there are some, but the problem is, like, there is not really easily supported in the hiding phase. There are some different problems and the problem will just hit.

Ben Epstein [00:48:23]: Yeah. Are you guys working sufficiently down at, for example, the tokenizer level for non english languages, or is it something you take from the open source community when you're working on these models?

Saba Sturua [00:48:35]: We had our own BERT implementation as well, including this tokenizer and everything. We also experimented with the existing backbone models such as Excel and Roberta and, yeah, it's in progress right now for the multilingual ones. So the bilingual ones that Stephen mentioned, it was our own, so like, starting from zero in Tokenizer and everything was based on our own implementation.

Ben Epstein [00:49:00]: That's really cool. That's awesome. What's coming up on the agenda? I guess for both for Milvis and for Gina then.

Stephen Batifol [00:49:08]: Yeah.

Saba Sturua [00:49:09]: So I guess we're trying to become more multilingual in both fronts, like just text only models as well as multimodal embedding models. And that is probably the main direction we have right now.

Stephen Batifol [00:49:25]: Yeah. And for us, it's like even more things with Manvis Lite in general and different integration with partners, I think.

Ben Epstein [00:49:36]: Very cool. And you know, when I was, the last time I was using clip, it was a little while ago, but I was using an Onyx compiled version of club embedding model. Are you guys working in that direction as well for like, the lower compute available teams?

Andreas Koukounas [00:49:50]: I think there are onyx integrations fighting face models for sure.

Ben Epstein [00:49:57]: Nice.

Andreas Koukounas [00:49:58]: We have Onix indications.

Ben Epstein [00:50:01]: Great. I mean, we're now we're pretty much at time. I really appreciate all of you guys joining and everybody who's listening, and thanks for Zillow for sponsoring this session. It wouldn't be possible and all the future ones that we do. So thank you very much and have a good one. Thanks, guys.

Stephen Batifol [00:50:16]: Thank you for hosting. Thank you, everyone. Bye.

+ Read More

Watch More

So Fresh and So Data Clean

Posted Jul 21, 2022 | Views 825

# Mage

# Fresh Data

# Clean Data

# mage.ai

How to Systematically Test and Evaluate Your LLMs Apps

Posted Oct 18, 2024 | Views 15.1K

# LLMs

# Engineering best practices

# Comet ML

Small Data, Big Impact: The Story Behind DuckDB

Posted Jan 09, 2024 | Views 13.3K

# Data Management

# MotherDuck

# DuckDB