The Real E2E RAG Stack
Sam has been training, evaluating, and deploying production-grade inference solutions for language models for the past 2 years at You.com. Previous to that he built personalization algorithms at StockX.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
What does a fully operational LLM + Search stack look like when you're running your own retrieval and inference infrastructure? What does the flywheel really mean for RAG applications? How do you maintain the quality of your responses? How do you prune/dedupe documents to maintain your document quality?
Sam Bean [00:00:00]: I'm Sam Bean. I work at Rewind AI and applied AI and take my coffee black with a lot of caffeine in it.
Demetrios [00:00:11]: What is up, good people of the Internet? Welcome back to the Mlops community podcast. I am your host, Dimitri Os. And today, what a conversation with Sam. Wow. Coming away from this, having goosebumps. I don't say this all the time, but when I do say it, hopefully you take me seriously. This is a conversation that I would recommend to everyone. Sam has had experience dealing with llms since before they were cool and building systems in search and with llms since before the whole chat GBT craze.
Demetrios [00:00:51]: And he talked about some of his huge learnings, this collection of learnings that he's had. And oftentimes what he mentioned is there's simple things that you can do to help you progress your LLM systems. He [email protected] he's now working at Rewind. We get into all that, but really, I think the key factor that I'm going to take away from this is when he talked about overcomplicating your LLM systems for no good reason and how we can all fall into that trap for some reason or another. And then you don't recognize what metrics are crucial, and you don't know how to read your metrics if your system is too complicated from the get go. So he walked us through how to build from one piece of the puzzle to the next and make sure that you're able to get those metrics and understand and interpret those metrics very well before you decide to add any new pieces to your system. I think this is a little different than the talk that I've been hearing quite a bit of. It's very easy to say, okay, we need everything now.
Demetrios [00:02:18]: We need to make this the most robust system possible, because these llms could go off the rails. And we got to make sure that we are following the most up to date system design that I've seen on LinkedIn or on Twitter. And Sam is not about that at all. He is very pragmatic in the way that he does things, and he is down to the metrics, making sure that before you do anything, exhaust the simple stuff, exhaust your ability to get very far with what you have and really crush the metrics that you have. And once you can't go any further with those metrics, then bring on more. Seems simple for some reason. It's not so easy to do in practice. We like to over engineer things, and that's just, I think in our nature.
Demetrios [00:03:18]: So enjoy this conversation with Sam. And if you liked it, as always, it would mean the world to me. If you tell one friend or leave us a review, leave us some stars. We've got almost 120 stars now on Spotify. That's incredible to see. I am appreciative of every single one of you that are listening, and I thought I would say it before we get into this conversation. There is only one place that we can start, and that is with the story behind your competitive pinball playing.
Sam Bean [00:04:01]: Oh, yes, of course. You want to start so people don't know I'm a competitive pinball player. I actually have my weekly league tonight. I'm going to go compete in. I'm playing for a trophy. I had a friend of mine, Jack, who I went to computer science university with at University of Michigan. And out of school he went and he went and programmed pinball machines. He wrote game code for Stern Pinball, which is one of the manufacturing companies out of Chicago.
Sam Bean [00:04:35]: And he wrote game code for, I think, the likes of Mustang, world poker tour, and some stuff on the game of, you know, he know, playing coding. He was like living and breathing this stuff. And it just kind of got into our friend group as like, hey, this is something we can kind of go do. And I'm a competitor, and so it went from fun thing to do with my friends to competitive thing that I can find people to go play against. And if I ever find something like that, I can't help myself.
Demetrios [00:05:07]: What does it mean to code pinball? Because I thought it was just the ball goes and bounces off of shit. Is it that you're deciding how many points you get when it bounces off of things?
Sam Bean [00:05:20]: Yeah, so I think it's easy to think of it as kind of like a strips down traditional game, similar game mechanics. It's mostly like a state management system that's like, hey, this is the current state. This is the next input switch. The ball is mostly rolling over switches or hitting targets, and that creates more inputs for the basically state machine to figure out which state to go to next. The only interesting thing is that the inputs are physical. It's a ball rolling over a switch, or it's hitting a target or a pop bumper or something like that, versus virtual inputs or inputs from like a PS five controller. The inputs are coming from a ball rolling around on a wooden field. It's interesting, but past that, the input source, it's mostly like your traditional game engine.
Demetrios [00:06:16]: So where do you find these pinball arcades? Or did you buy your own.
Sam Bean [00:06:22]: I've got a collection. Yeah. So I've got a few that I practice on. Those are the ones that if you want to have a bad time playing pinball and come play in my basement. Really? The dojo. My machines are set up to make you a better pinball player, and they are not so much set up to have fun playing. So they're mostly set up to not have fun playing. They're extremely difficult.
Sam Bean [00:06:44]: They're very fast. You can lose the ball very easily because of the way that I've removed certain physical objects that usually would impede the ball leaving the play field. Those posts are all gone. It's brutal. But if you want to get good at pinball, I got the collection for you.
Demetrios [00:07:04]: Yes, that is it. This is a really cool hobby, and this is the first that I'm hearing anybody on this podcast talk about. So I like digging into it. I had to when I saw that that was possibility there. You've had quite the experiences dealing with ML ops and particularly llms, and I want to go through your journey a bit because you've [email protected] you are currently working at Rewind, and I think there's something interesting in and about this space of dealing with search and dealing with large language models that it feels like interests you, and maybe you can break down what gets you excited. And how did you decide to start working on these llms at such a time where it was like before the hype almost?
Sam Bean [00:08:00]: Yeah. So my journey with mlops, I think, really starts at Stockx, doing things like simple drift detection, retraining routines based on performance degradation, like your standard initial MVP ML op setup. I think that going from a forecasting problem domain, which is a pretty simple, just each point has a feature set, and then you're mostly just creating cross features or better feature representations, and then you're putting that through a big decision tree. I think the complexity with observing these systems increases a lot when you move to search. I think that there's a lot of different failure modes for search. And also search is one of those really special problems where you're never actually going. There's going to be questions that there are no right answers to in the world, right. It's impossible to get right because it's either someone's asking something that doesn't exist, it doesn't make sense, or a variety of other reasons.
Sam Bean [00:09:14]: And then there's questions where there could be millions of right answers. Any of a million web pages could be served to this user, and any of them will solve their query, and then there's a ton that is in the middle. Most of it's in the middle. And so search is one of those problems where you're never really quite 100% sure if you're doing it right without the behavioral data. And behavioral data is so expensive to go out and get via a b testing or user surveys or however you're going to go get real people's feedback on your software. That is just so expensive to get. And so it feels like the machine learning operations just paradigm and that tool set. The importance is amplified in search and in loms.
Sam Bean [00:10:04]: I think because of the ambiguity that is just inherent to those systems, I think that they can handle more ambiguous problems and inputs and can do more for the user. But because of that ambiguity and because of those capabilities, I think they're more challenging to monitor and to create really robust lifecycles around those systems. So it made sense to take those skills and level them up in the search space. And you was like the perfect place to do that.
Demetrios [00:10:33]: And so it went from being at Stockx, working on more, as you mentioned, more traditional ML, we could say, and that's not without its problems. That's not without its difficulties, right? And then you jump into search and there's so many more nuances as you're mentioning, just thinking about how many right answers there can be for certain questions or how there could be no answer for certain questions and how you make sure that when you're serving up these results, they're hitting on the most relevant topics is such a fascinating problem. And it almost is like, I think one thing for me with search, which is really hard to think about, is like, most of the time when we think about search, nine times out of ten it's like, oh, but that's a solved problem, right? That was like the 90s with Google. Didn't they figure that out already? That's all done. How is it still a difficult problem? And I imagine you were in the weeds figuring out day in, day out what you're doing with search, but then also adding llms to the mix makes it even more fun.
Sam Bean [00:11:48]: It does. I think it actually makes the problem simpler and harder at the same time. It's one of those kind of. You have another. The thing about search is there isn't like a model. You're not like, I've got my search model. Search systems are usually comprised of multiple models, right? You have an intent detection model, a query rewriting model. You have your actual retrieval models.
Sam Bean [00:12:13]: If you're using semantic search, you have re ranking models, you may have filtering models to remove irrelevant result. You're talking about five to ten models to run a search system, and then you add on top of that an LLM. And those have their nuances and subtleties around training and deploying. But what they kind of do is they kind of give you a lens to how the user is going to use these search results. And so while it might be really hard to figure out the right or wrong answer for a search query, it's actually significantly easier to figure out the right or wrong answer for just a question answer pair out of your traditional squad v two hotpot because they come with a right answer. And you can guess this right, you can guess it wrong. There might be different versions of right or wrong, and we can totally get into all of that. But if you have this kind of end to end question answering set up at the very end of your pipeline and you have metrics for that, like how many of these questions can I actually answer correctly with my rag setup? Then you can actually start to work your way backwards towards the more sub models or subsystems within your search, and then you can figure out, like, hey, if I tweak this a little bit, does my LLM actually answer more questions right or wrong? That's a very easy thing to understand as a human being and to actually track your progress.
Sam Bean [00:13:47]: And so yes, you have this new model, and yes, it has its own kind of special latency requirements, and it has streaming tokens, which is a different kind of paradigm than maybe some request response servers were equipped to handle. But at the same time, again, there's this lens that's exposed that kind of gives you this insight into the motivation of your search system, which I think is really powerful. And for people who aren't trying to tie those dots together, I think you're really missing out on an ability to give yourself a really good vantage point to kind of see, get some customer sympathy because you're actually seeing the end result of what your search system is going to do. Will this actually help a person? If it can help an LLM answer questions better, you would expect it to be able to help a person answer questions better. And that's really like being able to make progress on that is really rewarding. And it kind of gives you a really good grounding, to borrow that term from llms, of how you're going to measure your progress. Very clean, very clear, easy to communicate. Yeah, that I found to work really well.
Demetrios [00:15:04]: It's like if this information that I'm gathering is helping me understand my question. Nine times out of ten, it's going to help the LLM that you're feeding it to. Yeah, that's a great insight. So the other thing that you've mentioned, and I don't want to skip over it too much before moving on, is how expensive it is to get human feedback and get human almost like evaluation on this or just humans telling you anything. I know from a survey that we conduct in the ML ops community, there's thousands of people in the ML ops community ecosystem. I think we tallied it up and there's over like 60, 70,000 people that have in one way, shape or form interacted with the ML ops community. For the survey that we do, we get like 150 responses. And I'm stoked.
Demetrios [00:16:00]: And so you're talking about how expensive it is and how resource intensive it is to get that kind of information back from your users. So what are some workarounds that you found that you don't have to almost lean on the goodwill of the users so that you can still make sure that you're doing the right thing, but without having to ask for that goodwill?
Sam Bean [00:16:27]: Yeah, I think it's a really good question. I don't know if there is a shortcut to get your first batch. I think the first batch of feedback is, I don't think there's shortcuts around going and sourcing it from people rolling your sleeves up, talking to your users, talking to your customers, and trying to put different answers in front of them for the questions that they're trying to solve in their day to day lives and just directly collecting it. To your point, it's extremely expensive. And so what I think the methodology that I found work really well is that you create this, like, I almost think of like a planet. Your core is your human labeled data. That's like what's powering everything else. And again, there's no shortcuts around that.
Sam Bean [00:17:26]: I will go into a spreadsheet and I will label my data. And so for some, the sad truth that people may or may not be trying to avoid is that you just got to go start and talking about how you can kind of spin yourself out a little bit and talk yourself out of going and getting feedback, because there's all this stuff around synthetic data. There's all these bootstrapping kind of self improving systems that are coming out. But I think going and getting an actual labeled data set for yourself. We can use like an intent detection system as an example. Just going and getting 100 queries 200 queries, and then hand labeling them. And then you sort of use that as your bootstrap set to either align another model, maybe a small model, or you just use it to prompt, like, a larger oracle model to generate more things that are of this ilk. And so I think that you begin to insulate your core of your planet with some synthetic data, and then depending on if the synthetic data hurts or helps, that'll kind of tell you whether or not you have a sufficient amount of your actual real golden data to get started.
Sam Bean [00:18:48]: And if you don't, then you can go and you can increase that initial set a little bit until you start to see that your synthetic data is close enough in distribution such that using it to train a model is still improving your eval scores, whatever you're using, f one or perplexity or whatever. And so I think that there's no way around getting hand labeled able. Nothing beats it. Nothing's ever going to beat a human being coming and saying what the right answer is. But you can use that to create much more sophisticated and much more close to distribution synthetic data once you have that starting point, and then as you start to see that you can create smaller and smaller models using more and more data, then you can kind of go on this trajectory where you start off with this very big oracle model and this very small amount of hand labeled data. And then as you kind of train smaller and smaller models, your data is increasing and increasing. And so you have this kind of trade off, whereas your data increases. You can get away with smaller models, which means you could create more synthetic data, which means you can use even more smaller models.
Sam Bean [00:20:03]: But you have to make sure that as you're generating more and more of that data, that you're not generating a bunch of stuff that's out of distribution. This is the less is more alignment paper, right? Paper is not about data set size. That paper is about data set quality. And I think that that's something that's largely overlooked. Oh, you only need 100 examples? Well, no, you need 100 really good examples. And if you have 100 really good examples, you shouldn't show your LLM anything else, because that's enough to align it. But what I've seen is that once you've aligned that, it gets a lot better at synthesizing new data, which you can then make an even smaller model good at capturing that distribution. So that's, I think, the journey that I've seen be really successful.
Sam Bean [00:20:49]: I wish I had better news for the people who are like, I am never going to hand label data. But I think that's just kind of like it comes with the territory. If you want to work in machine learning, you want to be a professional machine learning engineer, you're going to label some data. Sometimes just comes with a gig.
Demetrios [00:21:03]: Funny that you say that too, because one of our first virtual meetups that we had back in 2021 of the guests talked about how they would have labeling parties at work. So, like at 05:00 on a Thursday, instead of everybody going and having a beer, they would have the party at the office. They would have their beers or whatever it was, their social hour at the office. But they would all be labeling data. And so they would get the power of numbers and just be like, all right, here's how we're going to label data. And everyone from the receptionist to the ML engineer was labeling the data. And I thought that was an interesting way of going about it because you make it fun and you also get the power of numbers. So now one person doesn't have to label 100 things.
Demetrios [00:21:55]: Maybe they just have to label 20 while they're chatting with friends. And I don't know how the quality is on these. You might want to check as the night goes on what the labeled data turns into. But that was one. And then on the other thing that I wanted to just mention was that it feels like what you're talking about reminds me of the supply and demand curve, and the price is just right there in the middle where supply equals demand. It's like what you're showing is, yeah, in the beginning you're going to have this big model and a little bit of data, but then eventually it can switch and you can have a lot of data in a smaller model, and you can just kind of like watch as one goes down and the other goes up.
Sam Bean [00:22:39]: In a way, it's funny because our initial intent [email protected] was this distillbert standard transformer model. 100 million parameters, nothing to write home about. But as soon as we started trying large language models, we were able to beat that model just with a few shot prompt and ten lines of python. It was like, okay, well, there goes that. Six months that we built that fully featured intense training pipeline and deployment and stuff. But then we were like, okay, but now we can actually make this model, this large language model, this few shotted model much smaller because now we're going to use that to label data. But if you think about that towards its logical conclusion, you're going to converge back to the Distilbert model that we started with, like, the logical end of that is that you're going to end up back where you started almost, but you're probably going to have a large amount of synthetic data that you've created, and that's going to allow you to hopefully capture that same performance that you got with the few shotting the Oracle model with the model that you were originally using. So hopefully you end up ahead still, but you're still like, it's this very strange kind of acrobatics that we ended up doing.
Sam Bean [00:24:08]: Interesting times.
Demetrios [00:24:09]: Did you ever hear that paradigm where I remember it was like, okay, if you have the finish line and then the car, each time that you move the car, you just like, half its distance to the finish line, and then you never actually go across the finish line, right? Because you're just having the distance to the finish line, so you can never fully go through it. And it reminds me of that in some way.
Sam Bean [00:24:39]: I think there's like a greek fable about that. It's like Archimedes Arrow or something like that. One of those philosophers was like, if I shoot this arrow, it's never going to hit someone. It always has to go half the distance. Surprise. It'll hit the guy. It'll hurt. But, yeah, it is one of those things where it feels like the finish line ends up being where we started, almost like a loop.
Sam Bean [00:25:12]: But it was an interesting journey, and I think it illustrates kind of just the type of step function that we were seeing at the end of 2022. Going into 2023, things had just fundamentally changed in a way that I'm still not confident that we appreciate how much things are going to be different in five to ten years. But you know what? We should all enjoy this time because it's probably going to look back on it and be like, that was a really special time to be in machine learning. That was pretty sweet. So anyone out there listening, just find ways to enjoy this time in history because you're never going to get it back. Right?
Demetrios [00:25:51]: So talk to me a bit about rewind and what you're doing there, because I think there's a lot of stuff that we can dig into with how you are now leveraging AI and also how you've built rag systems. I know that we wanted to fully dive into production rag and how that's very different than what we see on all the social media networks and how we have to be very aware of some caveats before jumping into creating these production rag systems. So let's just start with rewind. I know what it is but it's probably worth telling people out there what it is.
Sam Bean [00:26:35]: Yeah, sure. So rewind is fundamentally a company that is focusing on giving people superpowers via augmenting memory. What this means in practice is it's an extremely powerful data collection platform for those people who are willing to fork over a lot of their data, a lot of everything they see and hear. We have full microphone audio capture. We have full OCr on top of your screen capture, and we convert everything to text, and we index all of that and try and put it to use for people. And I think for those out there who are in the camp, that they're just like, take all my data, I don't care. If you can give me like 30 minutes back, like an hour back, just take the data. I don't care.
Sam Bean [00:27:33]: I just want the time back for those people, and they are out there. There's a good amount of them. We are offering the solution where we're going to try and make everything you see in here accessible by an LLM. We're going to try and create these proactive flows that are going to one day, hopefully even interrupt your flow and be like, hey, I did some work over here and this is what I think about the thing you're working on, and maybe even give you just someone to kind of look over your shoulder and check your math. And very similar in spirit, I think, to what I expect out of like a pair programmer, someone to go look things up for me while I'm typing code and maybe talk things through with me, go do some research, maybe go write some tests, exploratory code. Those are the types of things that I traditionally think of as like a passenger. If you're thinking of driver, passenger, and pair programming and being able to create something like that, that never gets tired, never has to go take a bio, is just there with you until the end of the night. It's going to be a really powerful experience, I think, for people who are used to working collaboratively with humans, and especially if you're getting used to working collaboratively with an AI in your day to day kind of workflows, this is going to be something that is going to, I think, really be a step function in productivity for you.
Demetrios [00:29:07]: And so right now, it's taking everything that you're doing and taking everything that comes in or you interact with out on the Internet or that you speak with, whatever it may be, whatever form you interact with your device, and it's able to index that. So the search is on point, and then it's also able to give you back answers. First off, I think the burning question on my mind is multimodal rag, are you all doing that already? Because that's like the buzzword of the year, right? And tell me about it, what is going on there?
Sam Bean [00:29:50]: Well, I can't say too much, but we're talking about logical conclusions before I think the logical conclusion of a, OCR and ASR on top of those mediums and then converting it all to text and then having some text model that operates on those as kind of a proxy. I think most people could see that a model that can operate directly on speech and directly on images, instead of having these two models that have to kind of translate for the text model, is going to be a more performant system because you don't have that extra translation layer in there. You're operating directly on the input signals and therefore you can fit features directly to them instead of these proxy features on top of the text. So if you're keeping track at home, I think it's pretty clear that a multimodal approach is going to be really beneficial for the likes of rewind. Can't say where we are in development, but I think it makes a lot of sense for us to invest in that technology. Yeah.
Demetrios [00:30:59]: All right, real quick, some words from our sponsors, Eliza, and then we'll be right back into the program. Are you struggling with building your retrieval augmented generation, aka rag applications? You're not alone. We all know tracking unstructured data conversion and retrieval is no small feat. But there's a solution available. Now introducing Zilliz cloud pipelines from the team behind the open source vector database, Millvis. This robust tool transforms unstructured data into searchable vectors in an easy manner. It's designed for performance. Zilliz cloud platforms pipelines Zilliz Cloud pipelines also offer scalability and reliability.
Demetrios [00:31:47]: They simplify your workflow, eliminating the need for complex customizations or infrastructure changes. Elevate your rag with high quality vector pipelines. Try Zaliz cloud pipelines for [email protected] you can find the link in the description. Let's get back into this podcast.
Demetrios [00:32:07]: Yeah, 100%. And so again today, you get to take in all of this information, you get to index it and be able to search it really easy. And I know, I think the thing that I loved whenever I look at rewind is like, oh, remember that time that I was reading that thing that had that stuff in there? I remember this emoji that was in it, but I don't remember anything else. And I remember vaguely that it was, like, about this, and I can just give all that information to rewind. Or ideally, one day, as you're mentioning rewind will be able to ask me, like, yeah, so tell me more about it. And what else was it? Something along these lines. And this happened to me just the other day because I've been quoting this paper from these folks from the University of Prague on how the data that you send to an LLM, it inadvertently leaks that data. And so these folks at Prague did this whole paper, and they put it out, and I was like, cool.
Demetrios [00:33:14]: I need to really synthesize this paper and understand what's going on here, because I'm quoting it a bunch, but I don't really know what is going on. So I tried to find that paper again, impossible. Impossible. And I remember it had leaks in the word and it had an emoji, but I was like, was it a leak emoji? So I'm going back and looking for this leak emoji. There is no leak emoji, first of all. It was like, it's either cabbage or broccoli. So I'm looking through my Google search history for cabbage or broccoli, trying to filter. I spent way too long doing this, dude.
Demetrios [00:33:51]: Literally way too long. And then I'm searching archive for University of Prague papers leak in the title, and then I'm like, is it leek? Or is it L-E-A disaster? I can't find that damn paper. For the life of me, if I had rewind, I assume that would be much easier, and I could just ask. Rewind the questions directly.
Sam Bean [00:34:15]: What was this paper about? Yeah, data leakage. Yeah. I think that creating those linkages automatically is somewhere where we're going to start to really shine. And it's not all there yet, but if you imagine even having this conversation, imagine being able to have this conversation about the paper that you forgot and describing it and having it be able to go into your memory, your virtual augmented memory, and be like, this is the paper. And then automatically create a linkage through time for you that will actually create this ability to trace your steps back through time and your thought process. Those are the types of things that today human beings just forget, right? There's no tool that's going to actually tell you unless you have really great notetaking, which, I mean, not many of us do anymore. We just are going to forget, and it's just going to be like, how many times out of the day? You're like, oh, it must not have been important. Imagine if every single time rewind was able to go back in time and find those things for you and be like.
Sam Bean [00:35:27]: And then that actually related to this thing even further in the past, which related to these two or three events, which creates this lineage of almost thinking and cognition for you, which is, I think, for people who have a lot of knowledge, work to do, and they are trying to keep a lot of stuff in between their ears. That type of read only memory, so to speak, is going to be really valuable to them because there's just too much for us to remember every single day. You know this, you run the MLBS podcast. How much stuff comes out every single day about machine learning nowadays?
Demetrios [00:36:05]: Dude, seriously. And just thinking about all the people that I interact with on the Internet itself too, and so many times it's like somebody was just telling me about that and I feel kind of bad sometimes these days because I'm like, who just told me about that? I swear. And it's gone, like you're saying. And so to be able to get that back is at least a win for attribution. So I don't go around plagiarizing other people's thoughts as much as I am now. But one thing that I was also wondering with is, are you looking at signals behind the scenes? As far as I've had this tab open, like for example, this paper that I wanted to read, never got around to reading it for some reason. I updated chrome or something and then it didn't come back up and then I lost it forever. Right.
Demetrios [00:37:00]: Are you also looking at signals as far as, like, this paper has been open as a tab for five days or ten days. And so that gives you a bigger window or it gives you a higher weight on things. Break down how that looks too. Are there those types of features that you're taking into account?
Sam Bean [00:37:20]: So I think that, in effect, I think we are because we do have ways to kind of handle time specific queries. But I think that if you're asking like, hey, are we using behavioral signals for re ranking results and we're going to use, are we actively using metadata about tab lifecycle or whatever as features, I don't think we're quite there yet. I think that there's a lot to master in text, and one of the things that I see as a mistake in industry today is like trying to convince yourself that you don't have to master text for X, Y or Z reasons. There's lots of companies with a vested financial interest in convincing us of that, but I think that at the end of the day, that is something that you really want to master via the text medium. And I think that trying to go straight towards we're going to use text and we're going to mix in these behavioral signals as well, and we're going to create this composite semantic and behavioral feature set. I think that if you're doing that and you've got a team of people, big team of people, go for it. I think for most people, if you are not comfortable that you've mastered text, that's what you should be doing. Master text, get really good at that.
Sam Bean [00:38:58]: And once you feel like you've mastered that and you can train and deploy your own text models, then I think you start to create your ensemble models. Right, but I wouldn't start with an ensemble until I've mastered the models that make up the ensemble first. Otherwise, it's going to be tough to explain. You're going to end up with something that isn't quite what you're looking for, doesn't do quite what you're looking for if you don't understand how the ensemble is actually compositing those predictions.
Demetrios [00:39:27]: Yeah. In other words, it's like the content inside of the document is more important than the metadata around that document. In a way. Like, I don't care if you've looked at it five times in the last days, is it relevant to the question that you're asking me right now? Is the actual content relevant? And then later, if I can, without a shadow of a doubt, figure that out, then I can start adding in these features where it's like, okay, cool. It's the question that you're asking. This document has relevance, and it's what you most recently have been viewing. So that's probably where you're going with this question.
Sam Bean [00:40:13]: And I think that it's going to be different for people who are operating in small domain search versus web domain search for the U.com and the Googles and things of the world. Once you're operating at web scale search, there are popularity signals, domains that are in the top 100, cloudflare clicked lists, things like that are extremely important because you search for the word attention and you can get stuff from a variety of different subjects. There's different ways to use the word attention nowadays. And then I think if you're bringing in behavioral signals and even personalization signals, like, hey, Demetrios is a machine learning person. He probably means attention mechanisms. That's when I think that those types of signals, the behavioral signals, really come in and start to shine. But if you do not have tens of billions of documents you're searching over, you probably don't have the problems that warrant bringing those signals in, unless you're really good at them and you have really clean signals. But then you're probably at a point where you don't even need a machine learning model.
Sam Bean [00:41:21]: You could probably put business logic on top of that representation of the data and be on your way, right?
Demetrios [00:41:27]: Yeah, it's like trying to over engineer.
Sam Bean [00:41:30]: It, and that's what you see all around the industry. Everything is so complicated, everything requires five different systems, and it's become really hard to figure out what an MVP LLM system looks like nowadays, because if you naively look around, wow, I need like 10,000 lines of code to run one of these. And it so diverges from the true spirit of the LLM from late 2022 to early 2023, which was, oh my gosh, you have to do so vanishingly little work to get these things to work. Like, the whole point of the anecdote around the intent model was, oh my gosh, we spent six months on this fully featured NLP training system, and then we were able to just blow it away with ten lines of python and a prompt.
Demetrios [00:42:27]: Oops.
Sam Bean [00:42:30]: If you are creating more work for yourself than you are saving using an LLM, then you're probably using it wrong. Like, llms are supposed to make things simpler, they're supposed to lower the barrier of entry, and they're supposed to short circuit you to being able to solve more complicated problems. You are not getting those benefits, then you are not getting what you are legitimately paying for. And they are not very cheap things to pay for. So make sure you're getting what you're paying for. And what you're paying for is not a huge convoluted MLOP system that's required to operate these things. Start simple, start getting business value with it, and then you'll start to figure out where your lifecycle breaks down and where maintenance and operations are going to start to seep in and require some thinking.
Demetrios [00:43:22]: Well, let's go into this, because I know that you had spoken to me about this very thing before we hit record, which is like optionality around when you're doing evaluation, when you're doing search, you don't necessarily need to chase the hype. And just because one design pattern is using LLMs as a judge these days doesn't mean by any means that you should feel like you have to do that, and you're not going to be able to get business value out of your application if you're not doing that. Or the same goes for using neural networks for search. Right, and so give me your take on this, because I think there's some really interesting facts here that you're bringing up.
Sam Bean [00:44:15]: Yeah, I think that if you are, one thing I think is that if you are ever in a situation where someone is trying to make the case that add this neural network to your stack, it'll make things so much simpler. Like they're selling you something. You are being sold a bill of goods at that point, and there might be heat to that smoke, and there might not be. I would take all of that with a grain of salt. I think that your tried and true kind of counting algorithms are where I would start with any of these things, word frequencies. They can get you pretty far in search, TFIDF, get you pretty far standing up like a standard single node open search or elasticsearch, like one database, and then using that instead of a neural network and delegating everything into BM 25 inside of Lucine, that'll get you really far. It's also simpler than operating your own neural network, and I think the same goes for evaluation. It's the same thing, right? Add this neural network, it'll make evals easier.
Sam Bean [00:45:24]: Will it? Is it going to be simpler than me just converting both the predicted answer and the ground truth answer into a bag of words and then just counting the number of words similar? Probably not. I think that I've largely gotten pushback on those approaches because of it's not quite what you're looking for, or it's going to be a floor on the performance, because you're technically actually penalizing the models for certain failure modes that an LLM may do better at, but I think that's okay. As long as you have a floor and you don't have a ceiling. Like you're not overestimating your performance, as long as you're underestimating your performance, you're okay. And I think that if you're in a situation where it's like I should use this neural network, or I should use this very simple counting algorithm that will give me a lower bound on my performance. Just start with a simple thing. Start with a lower bound. You can make progress on a lower bound, you don't need an exact thing.
Sam Bean [00:46:25]: If your lower bound is increasing your f one scores, the number of words that you're getting correct is increasing, then you're probably making progress. It may not be progress on the perfect metric, but you don't need the perfect metric. You need a metric, something that you can understand and that you can actually synthesize a technical strategy around. And if you have a very complicated evaluation system, it becomes challenging to synthesize a technical strategy and be like, well, this says that this is unfactual. And there's these metrics that are kind of human understandable, but also they don't quite correlate across data points, if you know what I mean. Like the factuality of a ten here might be an eight here, when in reality they're both pretty much totally correct. And I think that getting away from these very simple baselines is where I see most people going wrong. Your first baseline should be super easy to beat because it should be very simple.
Sam Bean [00:47:27]: That's why it should be your first baseline. You can go get a little dopamine hit, you're going to beat your baseline, but at least you know you're making progress and at least you have some metrics and some guardrails to tell you that this is what progress looks like. And I think that if you aren't starting there, then you are running the risk of ending up with a system that kind of escapes your ability to understand because of the complexity that you're adding in piece by piece, versus just having a very simple system that when you can't make progress anymore, then you may be in a position where your metrics need a little bit more of a nuanced definition, that you may need to find more other metrics to measure besides a simple word counting intersection over union count. But you should do that then. Not at the beginning. That would be my main takeaway if I was starting today.
Demetrios [00:48:34]: Well, if I'm understanding this correctly, and I think this is fascinating because basically you're saying we can get so caught up in trying to make the system robust that because of that goal of having a very robust system, we get lost and we can't debug and we don't know where things are going wrong or we can't really even interpret what is going on. So start with simple metrics, push those to the limits. Once we feel like we've gotten to the natural limit on that, it's like, great. This is our very clear, very understandable metric that we can go off of. And now let's add a new piece to the puzzle of this system that we're creating, and let's figure out more metrics that we can go off of. And each step along the way, you're adding that robustness to it, and you're adding the complexity, but you're adding in the complexity in a way that you can understand. So when you're building this big Lego base, you know which piece is which and what it does and what metrics it is tied to. And you can always, in effect, roll back if you need to, as opposed to, like, we build this gigantic system and now we got to debug it, but we're not really sure what is causing the problem.
Demetrios [00:50:05]: And how do you roll back from there? It's like if you go from zero to 100, then you can only roll back to zero, as opposed to going from zero to 1010 to 20, then 20 to 50, that type of thing.
Sam Bean [00:50:18]: I could never have put it better myself. I think that motivation is a really key word here. If you're actually making those changes in a stepwise manner, you have the motivation. You understand the context of adding them. You know what it's actually doing for your system and what the capabilities are added in. If you're just adding all that stuff from the beginning, I think what you're fundamentally lacking is the motivation for a bunch of that complexity. And so you may end up with things that you don't entirely understand their purpose or what their place is because you didn't have a real motivational thrust to add it. Your motivational thrust was Twitter, and that's not where you should be getting your motivation for adding complexity to your software.
Sam Bean [00:51:03]: I don't know if I'm bursting any bubbles here, but. Sorry.
Demetrios [00:51:08]: Yeah, it's like, oh, I heard I would be able to get my next job if I just add this gigantic system in there.
Sam Bean [00:51:17]: I got vector stores. Can I get a job now? It's like, yeah, no. Solve real problems with natural language processing and you will get a job, no problem. But it's really hard to solve those real problems if you don't understand what you're building.
Demetrios [00:51:32]: Yeah, it's that understandable. Oh, my God. I don't even think that's a word. Understandability is where I was going to say. But just the understanding of what you're building is so crucial because you're starting with that foundation and you're making sure that you are clear on that foundation. And from there, everything else can be built. Just like we say this a ton. You can't have mlops without a very strong data culture.
Demetrios [00:52:02]: You can't just jump straight from having nothing to like, okay, now we're going to do AI and think that you're going to have success, and so you want to have strong data, then you can build your ML on top of it, and then you'll see that success after. But this is very much in that vein of, you want to make sure that your foundations are strong before branching out and complicating things. And dare we say, over engineering it, dare I say?
Sam Bean [00:52:37]: Yeah. And to be fair, I'm a very simplistic engineer. I'm a simple guy. I like simple systems. And so this is obviously my bias. You can take it all with a grain of salt, but I don't think that most people will tell you that they've gone wrong in their career from being too simple and simplifying problems too much. I think people probably will tell you the opposite quite a bit. But I don't know a lot of engineers who'd be like, I just made it too simple.
Sam Bean [00:53:05]: Dang it. It was just too easy to understand and debug. Gosh, I don't hear that a lot. I think it's really seductive to do it in machine learning because it's so hot. But if you keep it simple, I think that people will have more success in real life, and I think that that will translate to success in their careers, because that's what companies are looking for. They're not looking for someone who can be really smart and say all of the smart AI things. They're looking for people who can solve real user problems.
Demetrios [00:53:41]: And now talk to me about production rags in this same vein of trying to make it as simple as possible. I imagine you've seen on the socials that there are new terms coming out of, like, naive rag, and that's like, where it's just, oh, it's naive because it's only these three steps. And you get your embeddings, you have your vector database, you throw it at the LLM, and then you're good. And on one hand, I see the point of saying, that's cool and all, but that's not really production ready. You can go places with it, but if you're trying to go and set up a really productionized system, you have a few places where that can fall flat. And so I'm wondering, from your experience building rags, how do you see this and how does it all fit in with, okay, the ethos is start simple, then go complex, step by step. So naive rag is a great place to start. Even though I imagine with that name, nobody wants to start with a naive rag now.
Demetrios [00:54:57]: Like, we just killed it. Nobody wants to say that they're doing naive rag. I'm doing, like most advanced complex rag you can think of. But then how do rags look in your eyes? And what are some gotchas that you found over the years?
Sam Bean [00:55:13]: Yeah, I think that.
Demetrios [00:55:18]: You took the.
Sam Bean [00:55:19]: Words right out of my head or right out of my mouth. I think that it goes back to that motivation thing, right? I think that a naive solution, you can call it naive. I think a naive solution is a perfectly good place to start. And I think, again, if you're going to add complexity, you really want to have that motivation. And the way that you get motivation is by having metrics and benchmarks that can actually create that motivation for you. And so again, we use this example of trying to make progress on very simple metrics. I think that you can kind of translate that even to architectures. I think that if you have a rag system and then you have a more advanced rag system, if you cannot be very demonstrable with what the improvements are outside of cherry picking.
Sam Bean [00:56:12]: And I know it's so tempting to cherry pick, but those are selected examples that you're seeing regularly you need to have set up. And your setup could, like your naive rag, could be generate random text. And that's your generate function, right? Just output random text. It's going to be a really easy baseline to beat with a naive rag system, because you're competing against random text, you're competing against noise, in effect. And so I think that it's perfectly reasonable to move up the complexity and add knowledge graphs, add these very complicated trees and graphs of thought and all of those agentic search systems. I mean, I've built systems where I had agents, and those agents had actually access to my individual indices inside my database. And it would be like, I'm going to search this index, I'm going to take the results, I'm going to feed those into this, and build multi hop reasoning agents. You can get arbitrarily complex, but I think that if you aren't motivating those examples and you aren't actually able to show, these are my metrics before and after, and here are a really good representative bunch of queries that we were really bad at, but now we're actually really good at.
Sam Bean [00:57:37]: And look, we didn't actually degrade performance in these other spots. I think then you can motivate your more advanced techniques, and then I think you can actually go and figure out which of them. The really nice thing about them is it becomes very easy to vet what's smoke and what's fire, because the things that are fire will make your metrics move. And then you'll be like, that's actually legit. Let's keep that. And then you'll add this other thing and be like, that didn't do anything. We'll delete those hundred lines of code. Right.
Sam Bean [00:58:07]: All this needs to be motivated. And I think that if your motivation is the name of the system, then, I mean, that's one way to do it. But you don't even need rag to begin with as your first system is my only point. Have something you can make progress on and have something that's easy to beat to begin with. And then motivate your changes and your additional complexities through that. You may find that you end up at a system that is not represented at all on social media or anything because you actually did the work to create a system that solves your business's problems and only your business's problems, because you're using your business's data to actually judge your systems and improve your systems. And if you do that, then you're going to build models that are very special for your business, that solve your business's problems better than anyone else's models. And you're going to have the simplest system that does it, and you're not going to have any creft.
Sam Bean [00:59:05]: And then you're going to be like, everyone's going to love you. People will be like, oh, my gosh, give me your autograph. Trust me, they'll love it.
Demetrios [00:59:14]: Then when you organize your labeling parties in the office, people will show up.
Sam Bean [00:59:20]: People show up. They sure will.
Demetrios [00:59:22]: They will understand the beauty of it and that will make total sense. Yeah. Going back to the motivation, it's just like, be very clear at why you're doing things. Don't just do it for the sake of doing them. Or as one friend, Flavio put it, he called it resume driven development.
Sam Bean [00:59:44]: I've heard that one, yeah.
Demetrios [00:59:46]: Just because I need to make sure that my next job is going to be a step up. I'm going to implement this crazy in depth system here, and then I'm going to go and talk about it. Maybe I'll submit a paper or something so that I can show off how badass I am. So I could see that for sure. I wanted to switch gears a little bit and talk about your work with, I know you've contributed to DSPI, or as I like to call it, dipsy. What are your thoughts on that? It seems like you've had a few of your contributions actually merged. You've done some cool stuff with them. We've had Omar on here.
Demetrios [01:00:28]: I am huge fan of what he's building, actually. It's been really cool to see how much attention it's been getting lately. But I also know that you've done stuff with Langchain, too, so I just kind of like, want to set the groundwork with that. You've done a little bit of, I guess you could say you've played around with both, and you're familiarized with both, and I want to get your feel and your take on these different tools that are out there.
Sam Bean [01:01:02]: Yeah. I think the thing that caught my eye about DSPI or DISP was that I felt that Langchain and llama index and some of those abstractions, while they were useful, I think, for organizing people's thoughts and organizing the discourse around llms really well. I always felt that disp brought a little bit more to the table in terms of what it was trying to accomplish. And that, I think, is by design in some respects. I think that Langchain is fundamentally about representing these processes in a way that is easy to understand and very translatable and makes it so that we can build a community around this and increase mind share, which is like a very important thing to do in this field. I think DSPI's motivation is slightly different. I think that the motivation around self improving systems is what really caught my eye. There are even these kind of self improving rag search agents that DeepMind has put out papers on very recently.
Sam Bean [01:02:23]: DSPI really came and said, hey, I have an opinion on what a self improving LLM is. My opinion is that we're going to create these student teacher models and we're going to try and create a framework for this distillation process. That distillation process is, I think, something that is actually very challenging to understand and challenging to represent, I guess. And I think the DSPI does a really good job of giving you code sort of as a blueprint for what a system would look like that could. Can you take a very large model and automatically search through prompt space to optimize that in the few shot setting? Can I use that to actually synthesize data? Can I use that data to then actually train a smaller model that can do very similar things to what we were talking about before, things that I had actually seen work for a company, this distillation, and really taking a shot at creating a very rigorous way of representing that and orchestrating it. And that was, I think, what I saw as the future. These systems eventually will require very little input data. We have our core at the center of our planet that's heating everything.
Sam Bean [01:03:51]: We start with that, and then we basically create frameworks for these systems to self improve. And there isn't much work that the human does after that. This work that we're doing to systematically create more and more data sets with smaller and smaller models, that kind of supply demand curve that we were going over, that will all just be orchestrated by a system. Right. That whole process, which was done by a human being, will be done by a program. And that program, one of them could be DSPI.
Demetrios [01:04:22]: Fascinating. Yeah. And that's one thing that I liked about Omar's vision for this, was talking about from prompts to pipelines, and it's almost like you're echoing that in different words, like saying, hey, it's not just the prompts that we're fucking with here. We're trying to create the system so that we can put a little bit of data in on one side and get our distilled model out on the other side. And that whole pipeline is done in a way that we have confidence that it's optimized.
Sam Bean [01:04:56]: Yeah, I think that's a lot of what I've seen be really powerful. And when I was experimenting with it, my state of the art numbers came from DSPI. And so I saw, again, this is one of those things where if you have really good metrics, you know what's smoke and you know what's fire? Because this system gave me state of the art performance, and this other one didn't. And so that's a cop out answer. But when you have metrics, they're all cop out answers, aren't they? Because you're just like, dude, the numbers told me to. I'm really easy to be peer pressured by. Data made it better, dude.
Demetrios [01:05:39]: Well, that brings me to the next. Probably the biggest question that I've heard people talk about when bringing up DSPI is, do you know anyone using it in production?
Sam Bean [01:05:53]: I don't know anyone myself. The work that I had been doing was still in r and D while I was [email protected] so I think that there are maybe a few more ergonomic things that are stopping that. I don't think it's anything to do with the paradigm. There are some things that are difficult to do right now, like deep speed distributed training. There's a branch on DSPI which kind of does deep speed, but you also would have to be running that DSPI on, like, a ray cluster or some sort of torch cluster that's actually running that and can run the deep speed plugin. And so I think that there's a little bit of a disconnect there from where I was able to get DSPI working very well all on kind of third party llms, but then it was time to move in house and actually train our own. And I think for people who are on that journey, they're going to find that distributed training is going to be their larger upfront blocker to being able to do what they want to do, especially if they're trying to train very large context length models. If you're trying to actually train on eight k tokens of context, you're probably only going to fit one example per GPU, even if you're using not the biggest model in the world.
Sam Bean [01:07:27]: And so that's all to say that I think that for people who are on that journey, I would figure out your data and your infrastructure first. And once you have those set up, that's when I think you can bring something like a DSPI in to really supercharge those efforts and give you a framework to kind of couch all of your discussions in. And it just gives you this really good lens of looking through optimizing llms, but there's just that little ergonomic kind of hole there. And shout out to Omar, he built some really amazing things. I think that the vision he brings to bear is really differentiating in the LLM kind of framework space. He's got ideas, and they're good. So that's probably how I would do things. Shout out to Omar.
Demetrios [01:08:21]: Yes, that is very well said, dude. Sam, thanks so much, man. From this is a ball, your humble beginnings of this podcast, talking about your pinball hell that you have at home. And if anyone wants to go through the equivalent of the Navy Seals hell week in pinball, I guess it is hit up, Sam, and you can talk more about how to become the pinball champion of the world. Dude, I had so many great insights from this. I think the biggest thing though, walking away is just like. And I'll be parroting this back, and I'll probably be saying this, I'm going to just tell you right now, I will steal it. Hopefully it's okay with you.
Demetrios [01:09:07]: I'll plagiarize it. I'll try and give you credit as much as possible. But the motivation factor. Know why you are creating each line of code? Know why you are creating each extra building block, and what motivates that is so crucial.
Sam Bean [01:09:25]: Code is baggage, right? Code is where bugs live. So you want less code. It's one of the things as you go from junior to senior. Like, I don't want more code. I want no code. How do I do this job with no code? And that's like the journey that you go on, I feel like. So if there's anyone out there who I've given a little push and they're going to maybe get to the simple way of thinking, keeping it simple, then I'm going to be a happy. I'm going to be happy.
Sam Bean [01:09:56]: So I appreciate you preaching the good, good word closer.