MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Retrieval Augmented Generation

Posted May 17, 2024 | Views 843
# RAG
# KiwiTech
Share
speaker
avatar
Syed Asad
Lead AI/ML Engineer @ KiwiTech

Currently Exploring New Horizons: Syed is diving deep into the exciting world of Semantic Vector Searches and Vector Databases. These innovative technologies are reshaping how we interact with and interpret vast data landscapes, opening new avenues for discovery and innovation.

Specializing in Retrieval Augmented Generation (RAG): Syed's current focus also includes mastering Retrieval Augmented Generation Techniques (RAGs). This cutting-edge approach combines the power of information retrieval with generative models, setting new benchmarks in AI's capability and application.

+ Read More
SUMMARY

Everything and anything around RAG.

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)

Syed Asad [00:00:00]: My name is Syed Asad. You can call me Asad. And I am a senior or lead AI engineer here, heading the AI vertical in KiwiTech. And KiwiTech is actually a us based company headquartered in DC and having its offices in New York and one of his offices in this Delhi, New Delhi, where I am right now. Okay, so I'm not a very hard coffee drinker. I would say I would be loving flat white coffee with some cream on it. That's all.

Demetrios [00:00:30]: Welcome back to another MLOps community podcast. I am your host, Demetrios. And today, talking with Asad, this was a breath of fresh air. He gave us all the tea, as the youngsters say, on what is working and what is not working when it comes to tools. I appreciate this conversation because he did not sugarcoat anything. He said what he has been testing and what has worked and what has not worked in production as far as open source projects, as far as tools. And we mainly played on the inference layer, but he was able to go a little bit in and out. We referenced a conversation that we already had, which was a live conversation that we'll link in the description where he talked a lot about rag and the problems and challenges he was having when building rags.

Demetrios [00:01:20]: For those who do not know, he is a consultant and he is, as he mentions in the conversation, time bound. He's basically put on the spot and needs to get something into production as fast as possible. I really like that because he has to go and figure out the best ways to do something right now. It's not like he can explore and then maybe in a few weeks or a few days, this could get into production with a lot of work. He has to see what's out there, what's working, how can it work for me? And then he tests it and he iterates and he talks about how it falls over or what's working and what's not working. We get right into it. He talks about some production level problems that he was having with one of the rags. That was the first thing that he says, and I also really like this, that we.

Demetrios [00:02:10]: You'll, if you stay to the end, you'll hear it, because it's almost like this bonus part that I got out of him after I asked him what his favorite type of coffee he was, he talked about how his company has a whole ML research team, and this is not ML research in the way that they're trying to find new architecture for models. It is ML research team that goes out there, they test and they pit two different frameworks or two different tools against each other. They figure out which frameworks and tools are better for which scenarios, and then they push that all to a GitHub repo so that they have it in mind in case a client comes to them with certain requirements. They already know they've tested this tool, it works for that scenario. And I think that is brilliant and I appreciate him sharing that information with us right now. It's a really no hold barred conversation. We're not making any friends in this conversation, especially if you are a company that plays at the inference level. We may lose a few sponsors from this, but it's worth it because the truth needs to get out there and he is speaking his truth 100%.

Demetrios [00:03:15]: Let's get into this conversation. As always, in case you like it, please share it with a friend, let others know and give us some stars or some likes or feedback, anything of the sort. Love to hear from you. Let's start with this man you just mentioned you're having a production issue at this very moment, just before we started recording what's going on.

Syed Asad [00:03:44]: Yeah, so it will seem to be a very easy issue. There is a huge chunk of CSV data which many people claim on the Internet or the LinkedIn that we deal with vast amount of data and that can be had handled in just a blow. So here the problem goes. So we have to develop a sort of raggedy on that CSV data. So the size of that c, I'll be very precise with the terms. The size of that CSV file is 133 mb. That is a huge file as far as. And the file is perfectly formatted.

Syed Asad [00:04:24]: There is no data cleaning required. Everything is fine. I basically there needs to be a chatbot which could answer that. So the first approach would be to go ahead and straight away develop vector embeddings, do some sort of push it into a vector database and do a retriever engine type of thing. So the primary challenge was that when you directly convert that CSV into a vector embedding, it will not convert. No, it will not convert. So, so there is. So there is a huge fuss.

Syed Asad [00:04:55]: It is gone.

Demetrios [00:04:55]: That doesn't matter what model you're using, it does it? You try.

Syed Asad [00:04:59]: It is not making any effect. I'll name the models. Also mix bread. I'll snowflake arctic normal sentence transformers mini embedding any, any damn models. That, that, that is not making any difference. Any context size, any vector size. It is not even starting to generate the embeddings. I have tried best of the systems I have tried best of the Macs, I have tried desktops.

Syed Asad [00:05:21]: It is not even starting. So I was stuck at 0.0.

Demetrios [00:05:26]: I mean, do you know why that is?

Syed Asad [00:05:29]: Not sure why it is happening. I just tried to gather the logs, anything like that, but it was never happening. I mean it was having some sort of public data. I mean the data is public, it is not a private data. So I can just tell what was inside. It's some sort of agricultural produce data of the US government. So it is a huge file and somebody makes to make a query on that file, some sort of analysis. So the first approach, which I applied without, since the vector embedding was not working, I used pandas, AI, pandas ki to do the vector search.

Syed Asad [00:06:03]: It is failing miserably. Two or three queries. Error, error, error, no module found. There is an error, no module found, it is gone. And when I check that error on the Internet, it is still open GitHub issue. So that is of the list. So the second thing is that I needed to do some sort of ETL on that data, then transform that data into a vector embedding, and then push it into a vector DB and then query from there. So now the problem was, that problem is in creating vector embedding.

Syed Asad [00:06:35]: So I, I tried to use mixpeak, the one which you were pointing out that day. And that person who is the founder of mixpeak, his name is Ethan. Ethan. He was a very helpful person. He went on a call, he was on a call with me also few days back. He was doing back and forth. He created the entire pipeline. So he used the mixed bread model and that model, and that pipeline was able to convert the vector embeddings to a MongoDB vector search.

Syed Asad [00:07:07]: But the point is that that file was very small, which I had given to him as a to play around. I mean I was given around the 133 mb file. And today is Tuesday, I need to go into production by Friday. So Friday is my deadline, including the testing. So this approach, although I would not say that this approach is failing, but this approach needs some time on iteration and testing. I need to be sure it is working.

Demetrios [00:07:36]: And so when you look at this again, and if you encounter something like this, do you feel like the danger in this scenario is the size of the CSV file, or where is there the danger? And where is there something that you can now have a bit of a red flag go off in your head if you see it again, the danger.

Syed Asad [00:08:01]: Is actually in the size of the file also. I mean, the amount of data. Also, I won't say it is a big, it comes under the category of big data analytics. I know there are tools, airflow can handle this very easily, but seeing the timeline of me going into production, I mean, everything is not possible in that particular time span. So the size is one of the factor which is the bottleneck. The other factor which I would say is that the complexity of the data, because there is similar data also in that file. What if, what if somebody is querying a data regarding the production of sugar cane in Mexico? Yeah. So that won't that, what is the guarantee that will result, the correct result? So that is one of the things.

Syed Asad [00:08:44]: So vector embedding. I figured out that vector embedding will not be able to handle this type of data because Mexico was coming many at many places, sugar cane was coming at many places. So it will go with the context and it might result in some sort of confusion inside the document.

Demetrios [00:09:00]: Oh wow.

Syed Asad [00:09:04]: So vector embedding will not work in this case, even if I try hard with the ETL process. So at the end, there is a solution for this, which fortunately worked. I converted that entire CSV into park it format. Park it that target format, reduce the size from 133 to nine mb without too much loss. Then I made a normal lama index agent. It queried and it is giving just blazingly fast responses in one or 2 seconds maximum, everything is correct.

Demetrios [00:09:41]: Okay, so I really like you pointing out the fact that this doesn't work with normal embeddings because of the amount of times that the same topics come up in very different scenarios. So understanding your data in that regard really helps you recognize if a vector, or, sorry, if an embedding model will be the right use case. And so you didn't do any embedding model?

Syed Asad [00:10:10]: No.

Demetrios [00:10:11]: I mean, you said get rid of that whole idea of trying to gone with the.

Syed Asad [00:10:18]: So I didn't do that because, and to be very frank, when you check, today's AI is develop development in the field of AI agents came after rag after rag. So they are actually replacing the complexities of the. I won't say that vector search is losing its significance. It is still relevant and it is still relevant for many of the things like PDF data. And I would rather say hybrid search is also something which is required. Sparse embeddings plus vector embeddings combine the results and all those things.

Demetrios [00:10:52]: Yeah.

Syed Asad [00:10:52]: So, and one of the tools which I personally favor is I have been continuously using in production is fast embed. Fast embed is very, very fast. I mean very, very fast.

Demetrios [00:11:05]: Yeah. Shout out to Nirant, who created that. That is great. And you're able to use it in production with no problems.

Syed Asad [00:11:12]: Yes, it is being used in production with no problems, and it is even reducing the infra size also of the docker containers also.

Demetrios [00:11:19]: Okay.

Syed Asad [00:11:19]: That is also a good thing.

Demetrios [00:11:22]: So I wanted to play around a little bit when it comes to the topics of this conversation on the inference layer and really dive into what you've been doing there, what you've been discovering, what have you experimented with thus far when it comes to inference? And maybe just start with what are some of the use cases that you have, and then go down the rabbit hole of what you've been trying to use for this inference layer.

Syed Asad [00:11:54]: Okay, I'll go with the one of the use case, which was two days back, having some issues. You remember I talked about the multimodal rag in which there was a memory in which you can pick it out by searching. So from the last talk, so that that thing was working perfectly fine. But suddenly what happened? The docker container on which it was running, or the infrared was running, it was consuming around three to four gb of space because the data was limited. Not, not very much. The embedding model was using three to four gb of space on the AWS infra. I'm talking about. Suddenly three days back, it dropped from three gb to 400 mb, and that embedding model stopped using the GPU, although GPU space was allocated to it.

Syed Asad [00:12:46]: So. And this resulted in the queries becoming more stale, the retrieval times automatically increased. There is no problem with the embedding model. Everything is working fine. When you check that code on a different system, it works fine. Absolutely. So what happens? It is actually linked with the asynchronous processing of the data, how the backend is integrating into the, into the final app. So one of the backlogs, which I think is that the software people don't understand what is happening in the AI set.

Syed Asad [00:13:25]: They do it the software way. And so that that particular thing, there were multiple videos, and those videos required sequential processing rather than parallel processing. They put all the videos to be searched parallelly. So it was taking too much memory and it was not giving any results. So this is one of the things. And at this point of time, I would say the auto scalability option of AWS, which is a widely popular option, was not also working and handling the issue, which usually handles all the issue. So I don't think you were using.

Demetrios [00:14:01]: Which service from AWS, because I know there are almost like infinite services here.

Syed Asad [00:14:07]: Yeah. This AWS service which we were using, is that the EC two instance which we were using on that. And so I don't think so at this point of time. Such level of complexities can be handled by these startups. They might improvise in future. But the startups which I used because I personally go ahead with that Kubernetes layer. And what is. Because monitoring of logs is a very important concern how and why things are failing.

Syed Asad [00:14:44]: We use celery containers also. So you usually, what happens whenever you deploy a model? The workers in celery often get overloaded. That is one of the bottlenecks.

Demetrios [00:14:55]: Celery is where you were getting problems because it couldn't handle the load.

Syed Asad [00:15:01]: Celery was handling the load, but you never know when celery could. Pipeline bus. That pipeline keeps on bursting. The workers get overloaded, they bust. So I think personally mlops is something which is a hugely, hugely, say, having scope, huge scope is there in mlops at this point of time?

Demetrios [00:15:26]: Yeah, because just in this conversation, you're talking about going from the Kubernetes layer to the data engineering layer, to the inference layer, to the model layer.

Syed Asad [00:15:38]: Yeah.

Demetrios [00:15:39]: There's a lot you need to know in each one of those subsets. And so your production was struggling because of the auto scaling on AWS, because celery was just kind of flaking out every once in a while and you didn't have the correct notice time or you didn't have the confidence that it could stay working in production. What did you change?

Syed Asad [00:16:07]: So what we did is that we tried to, we tried to test the load on every worker. The first step we started, so there were few workers which were getting overloaded. So initially we tried to increase the GPU, GPU and the VCPU also. But to be very frank, increasing the VCPU and the GPU does not really makes any difference. It will, it made difference. It, it, it made the production running. Those salary workers were not overloaded. But you need to have some sort of mechanism to test because the AWS logs are not that accurate, which can give you an idea from the standpoint of a machine learning engineer then.

Syed Asad [00:16:51]: So that is the one of the bottlenecks also that AWS needs to upgrade.

Demetrios [00:16:56]: Yeah. So they're giving you logs that are helpful for the SRE or for the DevOps team, but it's not.

Syed Asad [00:17:02]: Yes.

Demetrios [00:17:03]: Very helpful for the machine learning engineer.

Syed Asad [00:17:05]: Yes. And what happens the team, I mean, those teams also, they get a log. I mean, they, what we do, we do we deploy time loggers to check where exactly the time is going on. So there is one more issue. The embedding model keeps loading again and again, although I have, we have loaded that model globally once, but it does not work. It keeps loading again and again. It consumes memory, it takes time. So that is one of the more most problem.

Syed Asad [00:17:33]: I mean, that is still not sorted out.

Demetrios [00:17:35]: So let's dive into the inference, because I know you've been doing a lot of testing with different startups and AWS itself. What have you found?

Syed Asad [00:17:45]: One thing I found is that you can start like salad, because I have explored salad extensively. I've deployed my own POC also, and I would rather say it was working fine. It was working fine.

Demetrios [00:17:58]: Salad for those, because I think they have an interesting and unique value prop.

Syed Asad [00:18:03]: Yeah, salad is actually, you can say salad is an inferior inference as a service platform. And they claim that they are the most affordable cloud for AI, ML inference at a scale, and their starting price is 0.02. I think two for the lowest bandwidth. I won't say, I mean, it will, it will suit you. But what I did was that my bill was $58 for approximately one month for a small POC.

Demetrios [00:18:38]: Nice. Okay. And the reason that they can do that, correct me if I'm wrong, is because they are trying to do distributed computing or distributed GPU's.

Syed Asad [00:18:48]: Yes. Even if you are having your own GPU, you can register on salad and they can use the GPU also provided you have an Nvidia GPU. So our clients are in, most of the clients are the US. So they are having, they're having good infra, good parallel computing infrastructure in us.

Demetrios [00:19:08]: Yeah. And is it fast enough?

Syed Asad [00:19:12]: It is not fast.

Demetrios [00:19:14]: Okay. There's the. So if you want fast, maybe it's not the best option, but it is cheap.

Syed Asad [00:19:20]: It is cheap.

Demetrios [00:19:22]: And why did it fall over for you? Like, why are you not using it?

Syed Asad [00:19:26]: Because the primary problem is that I cannot monitor the logs. They gave a Jupyter notebook type of terminal to monitor the logs. And I need to train my DevOps team to how to observe that Jupyter notebook. So that is another task. So I dropped that.

Demetrios [00:19:40]: Wait, so it almost the exact opposite problem that you were having with the AWS one. You had the DevOps team getting DevOps type logs and the ML team couldn't understand it. Now you have Jupyter notebook type logs that the DevOps team can't understand.

Syed Asad [00:19:55]: Can understand. Yeah.

Demetrios [00:19:56]: And that's where it comes to the idea of you talking about how this is still a very pointed solution, but there's a lot of scope and a lot of design decisions as a startup that you have to make to say, here's where we're going to plant our flag in the sand. We believe that you should get logs, but you should get them in a Jupyter notebook format. And for the people that don't like that, well, right now, at this very point in time, they're not going to be our customers. That's not our ICP or our ideal customer profile. Right.

Syed Asad [00:20:27]: So that is where I dropped with salad. Because I, because I cannot handle the hassle of handling the logs myself and just checking the logs every time something is failing. Because, because logs are at that stage when you plant the product into production and, and at that point of time, if anything is bursting, that is hampering the client experience. Also, because we have to, we have to give answers to the clients, also keep them in fast.

Demetrios [00:20:55]: Yeah, yeah. You're on the hook, right. You can't just let something be a problem for days, it has to be hours. Because you are basically, and correct me if I'm wrong, you're going into a client and you're saying, we are going to help you with your production environment, we're going to make sure that it is bulletproof and we almost give our seal of approval and our guarantee that if something goes wrong, we can handle it.

Syed Asad [00:21:23]: Yes, absolutely. Because that is where I'm standing. I'm heading the AI part here in my company and I'm in and out responsible for everything happening as far as the AI product development is concerned and as the deployment is concerned, because I wouldn't be blaming on a DevOps team that they are handling this. So they should be doing that because I am primarily responsible for everything. Everything. Like what type of model we are handling. So that is why I stopped using models from hugging face. They're not suitable.

Syed Asad [00:21:49]: I mean, you can go with production. I cannot say that is not suitable production, but they consume lot of space, so not production friendly. I would say fast embed is a good option to go into production. Lama, as usual, works good. But at the end you need to have your own framework. No lang chain, no llama index, no bullshit. So you need to have your own things.

Demetrios [00:22:11]: And so you've looked into quite a bit of services, it feels like. And you mention the idea of, okay, you've got certain options. You mentioned the different flavors that are out there. So you can go with the together AI and you can say, I'm just going to grab a model off the shelf from together AI, it's fast, it works, and I'm confident that it is going to be working in production. Or you can go a little bit deeper down the rabbit hole and say, I'm going to go with something like a VLM where I understand more. I'm going to be bringing my models. I get more control with the models that I want to put into production. And it obviously has a cost, there's an overhead cost to it.

Demetrios [00:23:03]: And then of course you can just use OpenAI right off the shelf. Right. And so how much of it do you think is overengineering when you can just be using OpenAI?

Syed Asad [00:23:16]: OpenAI, there are two concerns. One is the, obviously the cost. The other concern which is readily coming out is the data part. I mean, nobody knows what they're doing with the data. So at the end the client want something which is localized. But if you go to the Olama type of thing or Lama CPP, again, Olama is difficult to deploy in production because when you deploy that Olama, call any model with Olama, and then you go with, go with deploying that container on the AWS. It gives, it throws out errors regarding remote connection issues. I don't know, I don't know what exactly happens, but I have started testing it more rigorously.

Syed Asad [00:24:02]: I won't be able to comment much on this because that is one thing, because error debugging is too much. So that is one of the pain points. At the end, OpenAI works fine, but I would recommend going with the framework in which you have more control on your designs, more control on your things, especially the data and the type of customization which you can do, like for example, the VLLM.

Demetrios [00:24:30]: And so for those who don't understand the difference between Olama and VlLM, can you break that down real fast?

Syed Asad [00:24:38]: Yeah. Olama is something which, in which you can run the models locally by downloading them. It gives you some quant in models in quantized fashion also. And VLLM is something which increases the inference speed. It does not localize this, although there are options for localization. It can offer some sort of localization, but it increases the inference speed.

Demetrios [00:25:04]: So Olama is more for running models locally, it's for figuring out if something works and it distills or it compresses the models by default. I think I've seen that as like one of them.

Syed Asad [00:25:18]: This is the model by default?

Demetrios [00:25:20]: Yeah.

Syed Asad [00:25:21]: Like llama, three 8 billion parameters is a 4.7 gb in Olama.

Demetrios [00:25:26]: Yeah.

Syed Asad [00:25:27]: In reality, which 4.7 gb.

Demetrios [00:25:30]: It can be good and it can be difficult because if I've seen some people who wanted to use like a one b or the, I think what is the Microsoft?

Syed Asad [00:25:40]: Phi, Phi or Phi how they pronounce it, but it's Phi sedari.

Demetrios [00:25:45]: Yeah. And I think I saw some people talking about how it wasn't really working with Olama because of the compression that you get from Olama. And it's interesting to note that basically, if you're trying to go into production for now, at least stay away from Olama because of the debugging issues. And just trying to figure out the errors that it's throwing is not worth your time.

Syed Asad [00:26:06]: I mean, if, if you have time, you can debug the errors. I would say go with Olama, but not in a time bound environment in which I am working system agile, I work in an agile scrum based formats.

Demetrios [00:26:21]: Yeah. And it could be something that's good to prototype and then once you see that there's some value, then you can go and figure out what the next steps are.

Syed Asad [00:26:32]: Yes, absolutely. I mean, it requires some sort of research. Usually I am not able to get that research done, so that's the reason why I tried to deploy it, but at the end I need to just quickly switch onto a different one.

Demetrios [00:26:47]: And so you're all about this flexibility, but also understanding the scope and the needs that you have. As someone who is trying to find a tool that's going to be doing the majority of the hard work for you. Did you look into something like a modal or a base ten when it comes to these services that are out there? Because these are the only other two and maybe there's a beam cloud I know is another one. I'm judging by the look on your face that you weren't going too deep along the lines of either of these.

Syed Asad [00:27:27]: Last this base ten I started studying out, but it is still on my hit list. I have not gone into the, I mean the bot, I mean the grassroots of base ten. It is also an inference service. But I would really like to try it. The reason why I like to try it is that because I have got feedback from my peers in the UK that based in and vulture about these two platforms, so I have not readily explored them, to be very frank on that, because there are too many platforms to explore and I need to at least get on a 15 minutes call with anybody on their platform to just to get a, get a good idea what they are trying to offer for my use case.

Demetrios [00:28:13]: Yeah, exactly. That's why I like talking to you, because it feels like you've gone out there, you've done the hard work of figuring out what is out there, what your requirements are, and you put it into practice, and then you give us a little bit of that wisdom of how you've learned from it, where it falls down, where it holds up, all of that. And it's very unfiltered, which I appreciate.

Syed Asad [00:28:36]: One more thing I have found out few days back is the fine tuning of small llms using techniques using ORPO and a third party tool, unsloth AI. They are working wonders and it, they are very very easy to do that. So all four technique from that odds ratio preference optimization, which actually eliminates the two, two layers, the spine tuning and preferential treatment layer. And it combines them into one or both layer. So instead of doing a massive level of rag and just checking on vector embedding, we directly go to unsloth. They have ready notebooks available. Click on those notebooks, select the model. That models are scalarly quantized in a four bit format.

Syed Asad [00:29:28]: That means if you test that model, that model might be having some sort of lossy compression, but that is not relevant to you. You train data, train it on your data from hugging face or any of the JSON format, and then develop a sort of a small rag. So it is a 2 billion or a 1 billion parameter model. Small model, and it works fine.

Demetrios [00:29:50]: And so break down the system on one of your use cases that are you creating pipelines that are hitting this and then they are going out and they're using these inference engines. And then you have some kind of evaluation framework on the backend so you can retrain. How does that look?

Syed Asad [00:30:10]: Yes, evaluation framework. I have checked deep eval very rigorously, but deep eval is having an issue. You have to give a context, and that is the point. The question, actual output, expected output and the context, they have three options. So you will be stuck in the context part. What will we do in the context part? They will give you a score, but I don't think so. That score is relevant to the, to your, unless you give a context. So the context part, they have already explained in their docs what you need to put in the context, but I don't think so that is relevant.

Syed Asad [00:30:50]: The second thing is that I tried to explore two other tools, autoragas, the R a G a s rag evaluation that is working good.

Demetrios [00:31:01]: Did you notice? Because I've heard people talk about ragas and it's a great tool, but some of the challenges involved in that are that it can bring your cost up.

Syed Asad [00:31:11]: Yes, it brings the cost up. And if you actually do that, because there is one more thing which can be done to automate the testing of large language models from the GitHub. It is the C. It is the Ci creating a Ci CD pipeline. And on which Andrew Ng created a course also from LLM eval course. I'm forget Circle CI is the tool. Circle Ci. It is a very good tool for evaluation.

Syed Asad [00:31:39]: If we're really concerned for evaluation part, there is a full blown free tutorial available for on deep learning website for circles AI. But it comes with a cost. Their API is costing you, but it is a serverless instance and it, it will be less as compared to. And it creates a Ci CD pipeline. So the automation is done. It is automated, actually. You don't need to interfere in that.

Demetrios [00:32:03]: Yeah, that seems like the dream. And that is very interesting that it's coming from the Ci CD level where the software engineers are used to playing and you can get that eval, you can create those pipelines and hopefully it's retraining. And I guess I see it in a way as like you're going to the sloth as part of your CI CD and you're just creating the new fine, fine tuned model.

Syed Asad [00:32:33]: Yeah, I mean I have still not deployed that circle CI. The reason why I've not deployed is because at the end I need to get approval from the clients. And at this point of time I would also like to talk about a very important thing, the client or the, or the end user who is actually paying for the project. They do not understand the value of evaluation in LLM. They don't understand how valuable it is. They still, they still think that manual software testing is the way forward and we need to do that only. So that is the thing. So this needs to be, there is lack of awareness and I think it will pick up in the coming few months.

Syed Asad [00:33:13]: But still, people don't know what is the relevance of LLM evaluation. People are usually testing it the manual way.

Demetrios [00:33:20]: Now this is fascinating, man. So any other pain points that you've come across as you're trying to put these different models into production or as you have your different use cases that you want to just mention here? Because I feel like you've been very vocal at the different parts that have been painful in your journey and hopefully we can all recognize this and we don't have to fall over the same potholes that you fell on and we can skip that part of the learning.

Syed Asad [00:33:55]: If somebody is learning from this, I mean, I would say that they would be having a huge exposure on what is right and what is wrong. But at the end, I would say if there are any pain points also, then there are good points also. You keep learning many new things and just checking on how things are working in the background. So you actually come with a set of tools which you need to go forward and add, then make your own ones. Like recently the can paper came. It is challenging. You might have heard about the Can Kolmogorov Arnold networks. It is making waves.

Syed Asad [00:34:31]: It is challenging the transformer architecture. So even I am trying to, I am trying to develop up framework for that so that it fits the GPT model. I mean, it is one of my personal projects, not a client project. So let's see how it is, how it goes. It is still a research paper.

Demetrios [00:34:48]: Yeah. And I do know that the last time that I talked to you, you mentioned how hard it is to ingest data also. And the idea that you don't have the best tools out there for ingesting data, like an airbyte, I imagine all the data engineering tools out there that are used to ingesting data, they have something for unstructured data now. But again, it goes to this scope, and it feels like this is an underlying theme that you're talking about as we're trying to shoehorn in all of these traditional methods with the new AI methods. And you have to choose a Persona that you're going to be creating a tool for. If you are an airbyte or a tool that has traditionally catered to data engineers on structured data, it's not like you can just be like, okay, now we also do unstructured data, and it works just as well. So have you made any updates since the last time I talked to you?

Syed Asad [00:35:47]: Yes, I have tried unstructured also for production, and I think it is. Unstructured is the most advanced tool right now for data ingestion. And because it partitions the data first, it goes with the partitioning of data also, coincidentally, yesterday I had a call with byte wax.

Demetrios [00:36:09]: Oh, yeah, nice.

Syed Asad [00:36:10]: The byte wax. Yesterday I had a call with byte wax because I needed to discuss with them a special connector of what they can offer. So they said that the connector fall under the category of premium connectors costing $1,000 per month. I said, whoa, when that connector is $1,000. So.

Demetrios [00:36:30]: And byte wax, just for those who do not know, what do they do?

Syed Asad [00:36:34]: Byte wax is a streaming platform, and they deal with live data streaming. And they act as a connector for streaming data, basically. So they have both open source and closed source format. But anything specialized you need will fall under the category of premium connectors that will cost you depending upon the use case. For me, they gave $1,000 costing per month for a single connector for a database.

Demetrios [00:37:01]: Wow. And so this was just connecting the streaming data?

Syed Asad [00:37:04]: Just connecting to a database? Yeah. Through a database, through a public database API.

Demetrios [00:37:09]: Okay, interesting. Yeah. These, that's a whole other can of worms that we didn't even get into. The pricing and how you determine the pricing on all of these tools, because a lot of them, it is usage based, which makes sense because you aren't going to have problems like if your scale is very small, you don't want to be paying the same as a gigantic enterprise that has lots of queries per second. But at the same time it can be prohibitively expensive and cumbersome for the developer. If at the end of the day you need to use four or five or six different tools in your workflow just to get that desired result, it feels very painful. And so I don't know if you've seen that in your own workflow that you just aren't open to bringing on another tool because you have so many tools in the toolkit.

Syed Asad [00:38:12]: Yes, because I think at this point of time I am open to exploring new tools, but it actually consumes my time. And to go and do research in that tool and then test that tool for that production thing is something which is very tiring for me. So at this point of time, what I have done or what we have done here in my company is Q Tech, which I am working on. We have developed a machine learning research lab. This lab keeps on researching new tools and sidelining the ones which we, which we don't have to use. I mean, we don't, we don't necessarily research on those tools which are deployed, which, which are working with, on which we are doing the projects. Even there's us, like for example, there is a resource who is sitting idle. So we do some, we keep doing pocs every day, like um, like we pick LLM evaluation framework, Prometheus and DB.

Syed Asad [00:39:10]: Well, so just start comparing them and put it in a ranking method. Like what type of use case this will fit in, what type of these kids this will fit in. We quickly create a code for that, push it onto the GitHub and keep it for our reference. So the next time we just need to do a copy and paste. That is all. So we have started the method of machine learning research lab. So research to production is our goal.

Demetrios [00:39:33]: That's the thing that's smart. That feels like a very good idea. It also feels like it could be a little bit dangerous if in six months from now, one of these tools that you are testing and you put into GitHub has completely pivoted and it has breaking changes all over the place. You have to be very careful at how you're implementing that.

Syed Asad [00:39:57]: Yes, yes. That is, that is one of the bottlenecks that I can say that is a risk. But we usually, because this is something which Lancin usually does every day, they keep changing the libraries and your pipeline is gone with the wind.

Demetrios [00:40:13]: That's again, why it helps to have a little bit more control when you create those frameworks yourself. You know that at least you have this ability that it's. Yeah, it's not going to change daily or weekly, monthly. So. Well, man, this is always a pleasure. I appreciate you coming on here and talking to me about this stuff. If anyone wants to relate or get in touch with you, I encourage them to either follow you on LinkedIn or connect with you on LinkedIn, because I think you also are sharing brilliant stuff on there. I appreciate your transparency and you going out there and doing this research for us.

Demetrios [00:40:52]: I think it is very valuable for the community to understand what has been stress tested, what has been very useful for you, especially when it comes to this inference layer that you were talking about, how you tested these different pieces and it takes you so much time. That's why I really love that you come on here, you share what your findings are after. I imagine you've probably spent a combined 2040, 60 hours researching the tools, trying to put them into practice, figuring out where they work, where they don't work best. All of that is very valuable information. And so I thank you for that.

Syed Asad [00:41:33]: Thank you. I mean, it is always a pleasure to share these types of information from you also. So, I mean, same feeling from my end also.

Demetrios [00:41:44]: So I didn't realize. The other funny thing is that ML engineers are now all AI engineers.

Syed Asad [00:41:52]: Yeah.

Demetrios [00:41:53]: Have you noticed that pivot?

Syed Asad [00:41:54]: Yes, everybody is an AI engineer.

Demetrios [00:41:57]: Exactly. Except again, going to that scope idea that you were talking about. It's so funny because an AI engineer, like what does that mean these days? The scope for an AI engineer is so large and so many different people coming at it from so many different angles. You have an ML engineer who's just change their name to now, being an AI engineer, you have maybe a full stack dev who now says, okay, I know how to fine tune a model. Now I'm an AI engineer. And so it's interesting to me that it's like, okay, there's AI engineers now. That's the term of the month or the year, I would say, and I.

Syed Asad [00:42:36]: Would also say, I mean, I was mentioning previously also with you that there would be specialized DSPY engineers also in the future. I mean, that is such a big thing. I am still testing it in production. I have not, because that is a very hard framework to understand and to then implement it. So that is the big thing. So I am into a PoC stage right now. I've still not completed. I'm pretty confident next time when we speak, I'll be in a production stage.

Demetrios [00:43:01]: What's been so hard about it?

Syed Asad [00:43:03]: To understand the modules, actually, to understand the modules, to understand how they are aligning with your prompt strategies. So that is the hardest part of them. Even. I have created a GPT on that. The chat GPT offers the GPT version to help me understand those things. But I have had a good hold now with DSPI, but py. But right now I need to work on getting it aligned with my use cases first. That is something very hard, but it is very, very promising.

Syed Asad [00:43:38]: I think not many people are having those skills.

Demetrios [00:43:42]: Yeah, I think a lot of people are very in love with the idea. But as you say, when you try and actually use it in production or you try and actually use it. Yeah, it's uh, it's like you're trying to take a Ferrari to the supermarket.

Syed Asad [00:43:59]: Yes, they say that is, that is the thing. Yeah, yeah, just on it.

Demetrios [00:44:03]: Fascinating, man.

+ Read More

Watch More

11:43
False Starts and Dead Ends: Building a Retrieval Augmented Generation System
Posted Oct 24, 2023 | Views 501
# Retrieval Augmented Generation System
# Best Practices
# Train GRC Inc
Information Retrieval & Relevance: Vector Embeddings for Semantic Search
Posted Feb 24, 2024 | Views 1.3K
# Semantic Search
# Superlinked.com
Embeddings and Retrieval for LLMs: Techniques and Challenges
Posted Jun 20, 2023 | Views 938
# LLM in Production
# Embeddings and Retrieval
# Chroma
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io