Create a Contextual Chatbot with LLM a Vector Database in 10 Minutes
Raahul Dutta is an MLOps Lead at Elsevier. he has around 8+ years of experience in transforming the Jupyter Notebooks into low-latency, highly scalable - production standard endpoints, and he implemented various ML/AI models and pipelines (30+) and exposed them. He was associated with Oracle, UHG, and Philips. He is a proud author of 13 patents in the ML, BMI, and chatbot domains. He enjoys riding motorbikes and lives in Amsterdam with his partner.
Building a chatbot is not easier. We need:
An embedding model that translates questions to a matrix. A Vector database to search. LLm to generate the answers.
We can orchestrate the job using Langhcain with minimum development.
 Awesome. So hi guys. My name is Raul and, uh, I'm, uh, talking from the community in Amsterdam. So in next 10 minutes I will try to provide the information, how you can create a contextual chat bot with a 11 Vector db. I will try to cover it in 10 minutes. So, um, a little bit of information about the use case.
Uh, what is the ultimate use case we are trying to solve and little bit of information about us. So, uh, my name is Raul and I am working in Elsevier and in Elsevier. Um, uh, we are, uh, a, uh, research publishing organization, one of the largest research publishing organization, and we have lots of, Research articles.
So sometimes, um, we need to, uh, we have some clients, uh, who are asking that, can you please provide us information about a topic in last year? So for example, uh, we need eeu and, uh, that type a policymaker come to come to us and they asked that, uh, what is happening in lithium? In the last one years or two years.
So sometimes we need to, we need to build a team and that team will read all the lithium related documents in Elsevier and um, then they will try to peel the policy paper and they will forward this policy paper to eu, anyone. So with this a lm we are trying to reduce this manual effort, expensive manual effort, uh, with long chain vector DB and uh, a LM model so that, um, this tool can read all the lithium related documents and it can provide the, it can generate the policy paper.
And, uh, at the, at the end of the slides, we will show you our generated, one of our generated policy papers. So, um, that was the use case and, uh, uh, we are team of three peoples, uh, s um, who was, is our intern and I was working on that part. So about implementation, this is the Raul. I just wanna interrupt for quick basic chain architecture, but we have changed a little bit, uh, in couple of places to increase the quality of the prompt and, uh, To reduce the latency problem.
So normally in the architecture, what happens that if you are a user, then um, you will talk with O B Y, that I b y will call the Langton orchestrator. So whenever you are prompting something, that prompt will first go to the embedding model. It will provide you the embedding. And with the embedding, it'll go to the vector database.
We are using quadrant. And uh, this vector database will provide you more sim, most similar three or four documents according to your parameter. And when you will get the result of the vectors from the vector database, it'll search you, you are asking to go in presentation mode. Okay,
this one better. I think so. Good. Thank you. Good. Sorry. Uh, so, um, uh, these vectors will send to l l m model to generate the prompts. So little bit of modification we have done, uh, where we have done this modification. We will come back to you later. So the first part is embedding. So embedding normally use two parts.
The first part is that when you are, uh, uh, when you are building this, uh, vector database, you need to provide all the vectors of the documents. So we have had around. 500 k, half million, uh, documents related lithium. And we have found that if we use the built-in solution for quadrant and um, Chen, it'll be like really slow for us.
So we have a pipeline, um, uh, of our key flow s pipeline, case of based pipeline, which is mainly based in the temporality quantitation, which will provide us these, um, result of, um, a thousand of documents. In a couple of minutes. So that pipeline we have used to grab the embedding vector and we stored this embedding vector in a NumPy.
Um, this just like list and then we dump somewhere. And after that we, um, are using this quadrant database where we are uploading these number vectors to discordant database. Uh, why we have used quadrant, because we have found that the documentation is really cool and it'll support it. It supports multi language.
So sometimes you can write your code in Rust, you can write your code in Python. Um, there are a couple of ways where you can, uh, upload your vector database to upload your number vectors to, uh, quadrant. But we used, uh, last, uh, because in last, uh, we have just uploaded our 500 KC database in some minutes.
Uh, we have little bit fine tuned as index, index Optimizer and may map threshold values to, to get better latency time, um, for when you are searching any prompt. And we have deployed this quadrant, uh, Docker in our cluster. So, uh, it'll provide the, uh, when there are like lots of loads, it'll be auto scaled, uh, scaled down, two zero, everything that can be, um, accommodated in the K eight cluster.
Then, uh, the third part is the elm fine Tuning model. So we have tried couple of models. We have tried T billion, we got good result. OP 6.7, good results. Bloom, not a good results rate perma, we are trying to, uh, account it now. Uh, we have also fine tuned, uh, Falcon 7 billion instruct and uh, with our dataset and we have got really good results.
So for the fine tuning, uh, we did couple of, um, coding for that. Um, so, um, uh, we built our own framework where we can pass any type of model, uh, with P F T. So, um, this is this model architecture. It support Bloom, it's support falcon it support LA type of models. So for P F T you need to, um, just target some of the layers of the model.
So this is the configuration. And, uh, and when you were doing this, uh, when, when, when you were generating this, um, uh, PFT model. You can debug anything from this pipeline. You can see that your trainable met trainable parameters. Um, you can load your models from here and then, um, uh, we can show you actually where we have successfully fine tuned our, uh, one of our model with these half million dataset.
And, uh, take around 30 minutes to, to fine tune the model. Uh, we did it for this Falcon seven B, and uh, we did really, we got really good results. And, uh, for some of the models like Bloom, uh, we are trying to encapsulate with Onyx Optimizer and try an inference server. But we, for this demo, we just using a first api.
Cool. And for the oep, we just use a, uh, grad to, to show it. So I think now it's the demo times. So, um, before that meeting, I just, um, Build a, um, Jupyter Notebook to show everything. So this is this thing where we just embed that parts and this quadrant local client where we just, uh, initializing the vector database.
It's really simple what coalition name you want to, uh, target and what content payload. Really simple. It'll take. Couple of seconds to load. And then, uh, the Langton init method, that Langton init method will be called, uh, when you will, um, start, start this conversation. Um, this end point u l is where your m models are hosted.
You can, you can pass this type of parameters. Uh, this is little bit of code of the Grado wave interface. And when you will do it, uh, you will, uh, see this page. Uh, this is thing and here you can find in all of your parameters. So, um, we have some prompt dictionary. That prompt D study we are using to generate the total document, um, has truly has written this prompts, uh, to get the output.
So I am pressing the first one. It'll take couple of seconds to Janet, the first output.
And, uh, why this? Taking some time. I don't know. Yeah, it is done next, at least. So this is this, uh, output. And, uh, so in that way we, in this, in the, in this type of prompts, we have, uh, tried different, different prompting. And we got, we, we, we find that these prompts are really good for some of the model. So for different, different model, we have a different, different prompt.
Uh, for the Falcon one, we are using these prompts and we got this output. So it was a disclaimer. It is a title. It is executive summary. Everything has been retained by these LM models. Recommendation, introduction, and the quality is really good. Completion. Also, we have, um, appended the source document, uh, from the, uh, from, from, from where we got this type of article.
And, uh, now we are good. We can provide this document to our, uh, uh, pro owners to review that. So this is about the Lang chatbot that we have built, uh, in house. Um, using our own infrastructure.
I think I am done.
Awesome. Thank you so much. Here, let me take your screen away. That was great. Thank you. How'd it feel? Yeah, I did just go dead list. Okay, cool. All right, well make sure to jump over to the chat. People might have some questions for you and thanks so much. We'll see you later. Thank you. Yes, see you. Bye.