Sign in or Join the community to continue

Alignment is Real

Posted Sep 13, 2024 | Views 393

# DSPy

# AI infrastructure,

# TrueLaw Inc

Share

speaker

Shiva Bhattacharjee

CTO @ TrueLaw Inc

20 years of experience in distributed and data-intensive systems spanning work at Apple, Arista Networks, Databricks, and Confluent. Currently CTO at TrueLaw where we provide a framework to fold in user feedback, such as lawyer critiques of a given task, and fold them into proprietary LLM models through fine-tuning mechanics, resulting in 7-10x improvements over the base model.

+ Read More

SUMMARY

If the off-the-shelf model can understand and solve a domain-specific task well enough, either your task isn't that nuanced or you have achieved AGI. We discuss when is fine-tuning necessary over prompting and how we have created a loop of sampling - collecting feedback - fine-tuning to create models that seem to perform exceedingly well in domain-specific tasks.

+ Read More

TRANSCRIPT

Shiva Bhattacharjee [00:00:00]: I am the CTO of True Law. We build, you know, bespoke AI solutions for law firms. And as far as I take my coffee black, no sugar, no milk, I think just the way I feel. I do feel like, you know, I'm from Calcutta. We grew up drinking tea, and we don't drink tea with, like, milk and sugar. And all of those added, I think, like, the actual flavor of tea comes from, you know, like, darjeeling tea. That's what it. I think the same thing with coffee is also.

Shiva Bhattacharjee [00:00:41]: It's like, where if you. The milk and all of these. I mean, current american coffee is, like, recipe for diabetes. I feel like. I mean, versus, you know, things like no sugar coffee is. That's what. That's for me.

Demetrios [00:00:58]: What is happening? Good people of earth, this is another emulsion community podcast. We're talking with Shiv. And what a leader this guy is. He has done so much in the engineering world. I feel honored to have gotten to speak with him. Before we get into the conversation, I want to play a little song for you. Get that recommendation engine that you have spiced up just a bit. Maybe you can consider this destroying your algorithm.

Demetrios [00:01:31]: Maybe you can consider it upgrading it. That all depends on you. The only problem that I have with this song right here that we're about to play, which is called glimpse by Virelli, only problem here is that it is too short. I put this song on a playlist that I called the sound of angels wings, because, literally, that's what it feels like to me. I listen to this, as always. If you enjoy this episode, just share it with one friend so we can keep the mlops community at the top of them podcast charts. Correct me if I'm wrong, but you guys are using DSPI, right?

Shiva Bhattacharjee [00:03:05]: Yes.

Demetrios [00:03:07]: So you're. I think I put something out, and I said, if anybody is using DSPy in production, please get a hold of me, because I would love to talk to you. And you reached out, and it was like, hey, what's going on? Let's talk now. Break it down for me. What exactly is going on, and how have you found it?

Shiva Bhattacharjee [00:03:31]: So, we were experimenting with prompting, as usual, I think, as things were coming along, aside from zero shot prompting, and we played around with Lang Chain at the time. And there's always this notion. I mean, I think I resonated a lot with Omar's notion that these prompts are brittle, and if you change them here and there, this sort of breaks. And at the time when the LLMs, where the GPT revisions were much faster, or we wanted to experiment with other LLMs. There was definitely a lot of changes to the prompt we needed to make in order to get this. Now things have, over time, has improved, I'm sure. Now there's structured outputs that OpenAI is doing, but at that time this was not so easy. And there was a lot of these changes to the prompt that we had to continuously make.

Shiva Bhattacharjee [00:04:36]: And that sort of like, you know, resonated quite a bit. I mean, I worked at Databricks before, so I had Matte in my sort of like LinkedIn fiend. And I think from there I kind of learned about DSPy or DSPy. I think that's what we call it now. Yeah, always called it. I guess that's it.

Demetrios [00:04:59]: I was calling it DIPD before, so no shade, however you want to call it.

Shiva Bhattacharjee [00:05:06]: And then, I mean, I think I went through a couple of these videos that Omar has put out and I think they resonated very strongly, especially in the context of what we were doing, that it's modular. It's the idea there is that you sort of like the optimizers, make a lot of sense in terms of how you can actually iteratively improve this. And you still have to go through, and we can talk about that later, but you still have to go through this limitation of big optimizers still stake a bunch of examples to optimize on. But in cases where you are doing that and where you have those limited examples that you can show to mimic what you're looking for, I think this is a better. That makes a lot of sense. I also really like the Py aspect of DSPy is that it's very like Pytorch. It's very modular in nature. So from a programming point of view, I think it was very modular.

Shiva Bhattacharjee [00:06:16]: And we can go a little bit over the sort of like stack on the retrieval side that we have done. And it was easy to use different re rankers. It was easy to use those things. So I think that's why we stuck with DSPy. I mean, we also use this for generating data, synthetic data in some cases, but. Nice.

Demetrios [00:06:40]: But yeah, yeah, we can talk about the synthetic data in just a second. But the burning question that I have is, aren't you afraid that DSPy is basically a research project and it is not necessarily something that you want to put that much faith into?

Shiva Bhattacharjee [00:07:04]: Yeah, I mean, you know, like, the thing is that we obviously have made changes. So the way this works is like, you know, you clone the rebuild yourself and you're making changes along the way because you obviously win. Our upstream change is going to get pushed in. I mean, that way. Actually the community has been pretty good. I mean, we have pushed one or two changes that got accepted very quickly. But yeah, I mean, at the end of the day, I mean, you know, it's python code. You're looking at it and kind of like seeing, seeing the thing.

Shiva Bhattacharjee [00:07:36]: And that is also true for the land chains of the world. But there were certain, I mean, that was the one of the good things. I also felt I in the spy because of its modular nature and how it works, you can kind of write your own module also and kind of like see how it works. Versus again, maybe I made mistakes when I was doing this kind of changes in lang chain. I felt like you're chained, had to go through a lot of layers to make the same sort of changes. Versus in DSPy we were using our own reranker. It was easy to sort of use that as a different module and then plug it into the framework whereby you're still calling this DSPy retriever. I mean, the object hierarchy and the calling is just better and modular in my opinion.

Shiva Bhattacharjee [00:08:43]: In general, there's always that thing whereby what you are doing, first of all, the nice thing is that this is code that is just running on your servers bunch. Our use cases is really latency bound rather than anything else. And you can enable tracing and basically see the sort of calls that you're making. Actually, one of the things that we have done, which has been quite helpful is that because of the chaining and the hops that are happening and that's all query rewriting that happens, we kind of have exposed that intermediary state, the step to the end user so they get a good sense of what is happening and how the query is becoming more contextualized with the content that they're searching from, from where they started. So I think in some of those cases that visibility and eligibility actually has helped in also the product in terms of what we have done.

Demetrios [00:09:58]: Yeah, it's fascinating to me because DSPy is the one open source tool, I think, out there that does not have a big well funded VC behind it and a company that has been, that just got all that money from the VC, I think. And so you see that there's an incredible community and there's a lot of people and a lot of energy around it. But I know that some folks get a little scared because there isn't a company that is in a way giving it that stamp of approval and saying like, yeah, we're going to be the shepherd of this open source tool.

Shiva Bhattacharjee [00:10:42]: Yeah, I mean, I mean, that's a very interesting point. Yeah, I mean, I think you were right. Like, you know, why, why lank chain has funding and, you know, that got started and not DSPy. I think that's a very good, I mean, there's definitely, you know, the first mover advantage of these things and, you know, how much active development goes on and stuff like that. I think with dear spy and my sort of gut feeling here is that because there's this subtleties around improvements and the iterated inference, it happens, like the prompting that happens over in the back that's probably not as exposed as in lang chain. I mean, in Lang chain you are still sort of like explicitly calling out a particular chain of thought or something like this. Very explicitly. And again, these things have, they become more modular and more abstract over time.

Shiva Bhattacharjee [00:11:46]: So it's been a while since I have used LangChain, but I. But DSPy, when we first looked at it, the learning curve was a little bit, I would say steeper than lang chain. And I don't know if that causes adoption issue or that's kind of where. But ultimately it is a very fascinating thing that these are equivalent systems. They still sort of like suffer from the same, like, you know, this is ultimately prompt engineering that is happening. So, you know, there are limitations around that. But it is a good point. I think the main thing, in my opinion, would be typically, I've seen a way to productionize things is you kind of have to provide a service.

Shiva Bhattacharjee [00:12:43]: You know, like you basically, if you were like taking this thing and you needed like a server or some state management that you needed to manage yourself. And that's kind of when this, you know, there's a company that comes behind and says, you know, we are going to take on all the operational headache of you doing this and that becomes a legitimate service. And that's when you see this with Lang Smith and stuff like that where, you know, you're doing this drag and drop and observability of those and those are becoming kind of the part of it. I'm sure if someone invested the time, they can make an argument of the same sort of suite of things that LangChain has done on the DSPy side of the world. But yeah, I mean, if you just think of it as a framework and an SDK and a library that you're using, then most people will be like, well, I'm kind of running it. All this code and server stuff is running under my environment, what is the point of this? But, yeah, but yeah. So I think you have to sort of like always associate a service with this, which may or may not, you know, like you have to think through like what, what that would be and that probably would justify, you know, sort of like building out a company.

Demetrios [00:14:04]: Okay, so getting back to what you're working on, you have a ragdez use case, right? And you mentioned that sometimes DSPy is the interesting route and then other times you fine tune and I would imagine other times you just, you don't do much and it comes, it's like naive rag as they call it. Can you break down these different scenarios and what you found to work best when.

Shiva Bhattacharjee [00:14:32]: So one thing, you know, we have focused on the legal domain, so we, we build like bespoke solutions for effectively lawyers and law firms. And here precision and quality matters trumps over latency a lot. So they want still search results to be quick, but it doesn't have to be instantaneous in terms of your Google search experience. Yeah, we always had to focus on quality that way. Also, this demography of users are not like the best prompt engineers. I mean they're not thinking in terms of what is the best way to write a prompt. And I think what we have found out is the way the questions are posed by a typical lawyer. And it's very important to contextualize that question in order to able to do a better retrieval.

Shiva Bhattacharjee [00:15:41]: And that's where this query rewriting helps a lot and that's what we are using DSPI for. And then it's very parameterizable, which means you can determine the number of hops you want to do and what is the depth of retrieval you want to achieve and how many documents you want to see in your top k and all of those things. And these are again important to the lawyer, they are making the call between recall and precision in terms of the number of data that you want to get back. And because we could play around with the latency, we don't have to, we don't. Again, I think they understand the more hops people are doing, you are going to get a little bit better quality with this approach, but at the expense of the results taking a little bit longer time. And this is especially true in kind of very domain specific question answering stuff. For the fine tuning work, we have done some work around embedding models because that's one of the other metrics around time. So when you're doing the search and the retrieval the typical way, if the embedding model is just trained on the corpus of data that you're searching on and the queries that you're doing.

Shiva Bhattacharjee [00:17:17]: It's just a better typical match. And again, mdome specific things. This, this tends to work a bit better. But then there is also the question around the generation part of it whereby, you know, certain firms have a particular way of seeing the answer, or, you know, like they expect a certain. Because typically this would have been done by junior associates, where you effectively would have told other folks to do this search and come back, and they have a certain way of presenting this data. So I think that alignment of generation is something where we have also used, fine tuned for fine tuning things for and typical rag. I mean, again, I think when we started this, we found a bunch of off the shelf rag approaches, and again, on domain specific things. And for the lawyers, they typically, I mean, in our experiment, they didn't perform that well.

Shiva Bhattacharjee [00:18:20]: We always had to do a few different things, either extracting metadata first or, you know, basically make the retrieval process much more contextualized in different ways. Before we aggregated the data, if you just were to use your typical, like, you know, just use an embedding and your regular embedding and just get the retrieval, the quality was not that great.

Demetrios [00:18:46]: Well, so there's a few things that I would love to know about specifically on the fine tuning of the embedding models. I think you can do that for relatively cheap these days, right?

Shiva Bhattacharjee [00:18:58]: Yeah, I think, you know, the, the main thing here is, because the embedding model itself is, doesn't have to be very big, that you can actually have just the encoder part of these models, and they typically are the dimensions doesn't have to be very large. And the main thing there in an embedding model is to figure out the training data as to, you're basically giving it some contrastive data, and that the generation of that contrastive data is the hard part based on the compass. And that's kind of like, it's cheap.

Demetrios [00:19:42]: Money wise, but it is expensive in resources.

Shiva Bhattacharjee [00:19:47]: It's expensive in figuring out, like, how to generate that contrastive data. I think that's the harder part. But, yeah, I think in general, even training and all of these things now, these are becoming extremely commoditized. So we actually don't focus on the, the training infrastructure per se. We have an orchestrator layer that is very agnostic to where things could be trained. But in general, all of these are, I think, from when we started to even now, the drop in price is significant and I think that will just continue to grow even when mini is a very powerful model and training that is very cheap. Yeah.

Demetrios [00:20:36]: So, you know what I would love to break down is you've worked at a whole slew of incredible companies. You're now CTO at true law. And you knew coming into it, we're going to be setting up infrastructure that is itself very prone to change. And technology in general is prone to change. But with AI and LLMs, this is like taking that to the max. It's very volatile, right. Because as you said, things get cheap. A model that is super powerful, all of a sudden it becomes just a commodity overnight almost.

Demetrios [00:21:17]: It feels like. So as you're going through and you're setting up the stack and you're thinking about which pieces to value, which pieces to try to make future compatible or making bets on what's going to come down in price, how are you thinking through all of these stages since you have that bird's eye view and you're the one who's ultimately the guy in charge on the technology side and you decide to with your teams, what gets implemented. Can you walk me through your decision making process and how you think about that?

Shiva Bhattacharjee [00:21:58]: Yeah, that's a very good question. I think, like, you know, in general, like, obviously, you know, one of the starting points is sort of a money constraint. Like as a startup, we are have to be scrappy. And so, like when to build versus buy is always a decision we have to make. And then you have to also juggle with the fact that we cannot take six months to build a feature because we have to go through that quick iterative cycle of development. I think very soon into this approach, we realized that building foundational model is very, very difficult and it is just nothing. Even if it is like technically feasible for us to learn and do this, the sort of resources needed is just very different. And then at that time, we started focusing on more fine tuning approaches and building.

Shiva Bhattacharjee [00:22:59]: I think mate and Omar has the paper around this compound systems, which is building sort of like a system of small language models that can orchestrate and doing a bunch of these things. And the corollary to this is like, you know, you have a brain and you're doing your different parts of the brain are kind of like doing different things. And then there's a sort of like coordination around. I mean, there's a two train of thought is like one large model doing it, or you bunch of smaller things coordinating and doing things. And I think we kind of took this approach of the compound system which I think it was more feasible in terms of the infrastructure. I think that's very good in the sense we understood that in order to get that scale, the price point, you need economy of scale to do certain things. It's very hard for us to build a training infrastructure just for the few models that we are training on or the volume at which we are training on. From a very get go, we understood that we have to leverage these sort of training services or model training services that are available.

Shiva Bhattacharjee [00:24:14]: And my previous experiences at confluent, where we were doing this orchestration for provisioning, orchestration for confluence servers or Kafka servers and stuff, that got me some insights into building this as a way of an orchestration service which does this in terms of the training. And then there's data generation pipeline. So if you think of it, the way that we have focused and ever sort of like all our ip is around this data generation and how are we fine tuning things and incorporating into the models. And in that way it is not too dissimilar from my previous experiences of what I have worked on, which is sort of like this data management and orchestrating that data flow. But the way we have always built our stack is we have always leveraged other SaaS providers in terms of using their infrastructure to train this. This has both been a blessing and sort of like cost wise it has been quite efficient, partly because we give the options on which infrastructure things could be trained. Of course we can train it on our cloud, we can train on the customers as your environment also. So that flexibility is actually quite powerful.

Demetrios [00:25:58]: You said something there that I want to double click on, which is around how you used to be dealing with data flows and now you're still dealing with data flows. It's just the data has changed a little bit. So I imagine you used to be dealing with event click data, that kind of data flowing around, or purchased this person, purchased this user id, purchased something, et cetera, et cetera. Now you're dealing with prompts, I am assuming, and outputs of those prompts and how you can best bring the output back into the fold to make sure that you're constantly leveling up the pipeline. Is that what I understood?

Shiva Bhattacharjee [00:26:42]: Yeah. Yeah. I mean, in essence, at the end of the day, whether you're doing prompting or fine tuning approaches, this like input and output to the way you're talking to the LLMs and in general, I think in confluent and when we're dealing with. I was in the data governance team also for a few years it was all about the flow of data. And then that particular case was flow of metadata that you have to be worried about from where it is originating into how it is distributed to everywhere else. Here, this is a similar thing where you're generating data and you're aggregating this data. Of course, this is all unstructured data. So the benefits of PLM is you're trying to get a decent sense of what this is, but at the end of the day, it is incorporating the feedback or moving this data, or constructing the training set on what this is.

Shiva Bhattacharjee [00:27:50]: So of course, everything is data. And what you're removing with the context of data changes quite a bit. But the infrastructure that you need keeping versions of it and things like that, and being redundant and being able to replay back, these are all the same building ethos and engineering ethos around how you do that. And that has a lot of similarity.

Demetrios [00:28:17]: So there's stuff that you decided to buy and not build. One obvious one is the LLMs, which I think that is, in hindsight, a very good choice. Knowing how much it costs to train them and seeing how quickly they are going down in price, that makes total sense. Are there things that you are a little bit more surprised that you went for the buying option or you are happy, I guess, is how I could frame it, that you bought instead of built or vice versa? You're happy that you built instead of bought?

Shiva Bhattacharjee [00:28:56]: Yeah, I think so. When we first started, we actually built, I think, with every microservice. Of course, our thing is a microservice architecture, and we needed a communication mechanism between how they will talk. So you need some sort of mechanics between this communication mechanics between your microservices. And we have leveraged given from confluent, reverse confluent cloud. But ultimately we build a messaging system between these services to handle a bunch of asynchronous stuff. Again, as I said, most of these work that are dealing with LLM is latency bound. So asynchronous communication had to be part of the building stack from whatever we built.

Shiva Bhattacharjee [00:29:50]: And that served us pretty well. And until we were getting production use cases of doing massive inferencing for very thousands of emails, or like very large scale inferencing, which will take hours to run, and things like that. And in those cases, we are actually using temporal. This is, I don't know if you know that this is like, we call it durable workflows started from Uber in terms of how they manage. I mean, this is ultimately, again, workflow management, sort of a thing. And we made a good decision because I think I was talking to the developers. The first pass of doing this, we used our existing infrastructure to do that. And again, the sort of loopholes that there is, the kind of guard you need to do against retry, how much to retry when things get interrupted, all of those other things that you need to take care of versus sort of like using a company that's effectively like, this is the first of all, we're a small group of engineers, and then kind of like using them to kind of handle this durable workflows has worked out very well for us in terms of when running this very long running inferences or doing training, getting notified if things get interrupted and getting handled.

Shiva Bhattacharjee [00:31:32]: Of course, you have to write code to sort of get it to use. But I think we haven't had infrastructure problems related to that thing, which I'm sure it would have taken us at least a couple of months to perfect that sort of infrastructure for ourselves. And then again, it will be very accustomed to just our use case and not very generic. So every time we probably needed to add and make any changes to the state machine, we would have to have gone back and made those changes and tested and whatnot. Versus I think using temporal has been very useful for us because I think we kind of like, again, it's a workflow engine. We have to build, of course, the state machine that does that. But writing that state machine logic is much more simpler, much more additive in nature, and then depending on them to execute, that is much easier for us, or has been very easier for us.

Demetrios [00:32:30]: So one thing that I noticed about you is that you've worked in many different areas of the stack, we could say, and I think it's more even vertically and horizontally, I could say, and I don't know if you agree with this, but knowing your background, you've gone very low level. You've also gone from the data side or the front, like front end, back end to Devopsy. Now, LLMs, ML, Opc, how do you see things now? Is there stuff that you can, you can look at and say, because I know this, I can draw parallels with certain parts of the stack or the LLM side. I know you mentioned earlier that there's data flows. Now it's just LLMs and AI data flows. And all of this metadata is the important thing. And you came from confluence, so you had that data flow kind of in your blood. I know you were also, you were working in many different awesome places, one of which apple and so you got to kind of see the gambit.

Demetrios [00:33:53]: Have you noticed different things that shouldn't necessarily be related, but that you were able to now relate because of this? Why this breadth of experience?

Shiva Bhattacharjee [00:34:06]: Yeah, I mean, you know, I've been very fortunate. I think in certain cases things have clicked in unexpected ways. For instance, I clearly remember I was at riverbed and we were doing file systems there. At least riverbed at that time was building a file system, a de duplicated file system to do so, like compression on your primary file system. And one approach there was that because these things ultimately are, your performance is ultimately how well your data reads and writes are happening on the disk and they get blocked. And how do you sort of paralyze as much as you can? So how, how's your pipeline into those data? Aren't that happens? And in riverbed, we remember, we use this very macro based c to kind of get ourselves this sort of version of async IO, if you think of it, but at a much deeper level. But they use this notion around, you know, used, I forget what was called, it was r threads, if I remember correctly. It's a riverbed threads.

Shiva Bhattacharjee [00:35:34]: That's the idea there. But it was all macro level way of doing it. And you have to get used to this framework. It wasn't CC because written in c, but to understand when things would unwind and stuff. Surprisingly, when I went to Apple and I was kind of explaining what I had worked on in this war in the riverbed side of the. And I think they were, I mean, I didn't even know at that time that they were working on just Grand Central dispatch, which is their sort of like, approach towards multithreading libraries. And suddenly they were like, oh, like this makes sense, like, you must have like a lot of context around this. And we got started working on it.

Shiva Bhattacharjee [00:36:24]: And so it was like one of the two maintainers of GCD at the time, but that sort of like relation, you know, that correlationship, those are unexpected things that happen. But it's also, I also feel like that gives you great insights into, in one stack, but you're working in different companies, but, you know, they could be related to sort of like this other stuff, which seems completely unrelated, but, you know, at the core, they're all sort of like connected. I remember at Arista, I remember Arista, I think Adam Sweeney or I think very senior principal, or I think BP at that time, he told me about, he was asking me questions around circular buffers and tines and other things, and then I think he mentioned, all of computer science could be boiled down into the seven or eight algorithmic questions. And if you can, you know, you're good at this, you kind of like have a pretty good grasp at most of these system level things. And I think that has been true. Like, if you know, buffering and how to deal with memory overflow, like out of core algorithms and things like that, they have a lot of. I mean, you find that application in several different places, and I think that has definitely helped. But, yeah, I think going back to your original question, like, has the breadth helped in understanding it does? I think for sure there are still areas of GPU learning and how the GPU works.

Shiva Bhattacharjee [00:38:06]: I don't have a very good depth of. I never really worked on those kind of things. But I think in principle, when I read some of these papers, it's sort of all, I mean, again, short circuiting certain things, caching certain results for quick access. I mean, they are all sort of like consistent approaches to solving performance related issues that has been sort of like, universal. Any one last thing I should just add, like, even at databricks when I was there, I think one of the very thing that got repeated all the time was that spark in general is an amalgamation of all these different concepts. Let's do caching properly, let's do distribution of data, separate the data separately, and let's do work on immutability and things like that. And I think that it was repeated quite a bit, like, you know, like, are we doing something very revolutionary? I mean, but not in any one particular angle, but if you like, aggregate all of those features, Spark as a system seems very complete in terms of being able to have implemented all of these sort of best practices. And I think that was one that was, that kind of stuck with me.

Shiva Bhattacharjee [00:39:34]: I think that was true. Like, if you look at any particular feature of it, it was. It had basically kind of gotten from the experience of, you know, what are the issues with other systems? And then kind of like addressed all of those principles. In essence, I think that was part of its popularity. And of course, the data frame API was very good to access with the declarative idea around it.

Demetrios [00:40:03]: Um, but, yeah, so many great teachings. Awesome, dude, we'll end it here.

+ Read More

Watch More

LIMA: Less is More for Alignment

Posted Jul 17, 2023 | Views 785

# LLM in Production

# LIMA

# FAIR Labs

Autonomy vs. Alignment: Scaling AI Teams to Deliver Value

Posted Jun 30, 2021 | Views 502

# Managing

# Scaling

# Seek.com.au

# SEEK LTD

Real-time Machine Learning with Chip Huyen

Posted Nov 22, 2022 | Views 1.8K

# Real-time Machine Learning

# Accountability

# MLOps Practice

# Claypot AI

# Claypot.ai