MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Look At Your ****ing Data 👀

Posted Feb 18, 2025 | Views 78
# Data
# LLM
# Hyperparam
Share
speakers
avatar
Kenny Daniel
Founder @ Hyperparam

Kenny has been working in AI for over 20 years. First in academia as a ML Ph.D. student at USC (before it was cool). Kenny then co-founded Algorithmia to solve the problem of hosting and distribution of ML models running on GPUs (also before it was cool). Algortihmia was an early pioneer of the MLOps space and was acquired by DataRobot in 2021. Kenny is currently founder and CEO of Hyperparam, building new tools to make AI dataset curation orders of magnitude more efficient.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

In this episode, we talk with Kenny Daniel, founder of Hyperparam, to explore why actually looking at your data is the most high-leverage move you can make for building state-of-the-art models. It used to be that the first step of data science was to get familiar with your data. However, as modern LLM datasets have gotten larger, dataset exploration tools have not kept up. Kenny makes the case that user interfaces have been under-appreciated in the Python-centric world of AI, and new tools are needed to enable advances in machine learning. Our conversation also dives into new methods of using LLM models themselves to assist data engineers in actually looking at their data.

+ Read More
TRANSCRIPT

Kenny Daniel [00:00:00]: My name is Kenny Daniel. I'm the founder and CEO of Hyperparam and I take my coffee in frappuccino milkshake form.

Demetrios [00:00:11]: What is up mlops community? We are back for another episode of the podcast that you know and love. I'm your host, Demetrios. Today we're talking with with data about Kenny. Today we're talking with Kenny about data. All about the value that the data can create, which is no novel idea. We've been hearing Andrew Yang say this since 2020, since probably before 2020. But I think Kenny brought a few new takes on it that we've lost our way when it comes to high quality data and how important it is. Let's get into this conversation with him.

Demetrios [00:00:57]: So back in my day used to call it ML. What. What did your girlfriend tell you?

Kenny Daniel [00:01:06]: Yeah, well, in sort of all my talking and, and writing, I just write machine learning. Machine learning instead of AI. And she's like, no, you got to put AI. Like, you know, it's, it's 2025. And yeah, she called me a boomer because I still call it ML.

Demetrios [00:01:19]: She was like, okay, boomer, get out of here.

Kenny Daniel [00:01:22]: Exactly.

Demetrios [00:01:24]: Yeah, man. I was mentioning to you that the favorite thing, the favorite way that I've heard it phrased is, yeah, machine learning, as it was called back in those days. And we just have shifted to AI. That's cool with me. I guess we'll go with AI.

Kenny Daniel [00:01:44]: Yeah, no, exactly.

Demetrios [00:01:46]: Yeah, dude. Well, let's talk data and, and then let's talk UIs and interfaces, because I know those are two things that you're quite passionate about these days. But data you were mentioning, you were just at Neurips, you couldn't find any talks that were data related.

Kenny Daniel [00:02:07]: Yeah. So I don't think it's a controversial statement to say that data quality is super important when it comes to LLMs and models and all that. Models in general are very sensitive to bad data. Everybody agrees on this. This is not a controversial thing. But people don't actually talk that much about data quality from what I see. I went to Neurips in Vancouver just a few months ago. I went looking for papers about data quality and I could not find them.

Kenny Daniel [00:02:31]: I mean, I'll caveat that with say I found three papers and they were all in the vision space, which is cool. I'm glad that's happening. But like, who's talking about text data set, you know, for LLM quality? And the fact of the matter is you can kind of understand why The Googles, the OpenAI's, the anthropics of the world don't talk about it. They very much consider that their secret sauce. Because really, the only thing that distinguishes, you know, anthropic from OpenAI's models and things like that is the data that they're training on. Right. Everybody's using Nvidia hardware. There's no hardware, you know, arbitrage there.

Kenny Daniel [00:03:08]: Everybody's using transformers, the same sort of gradient descent, transition training methods. You know, there's, there's obviously optimizations and, and tricks and things that are used, but ultimately the data is what defines the model. And so they don't want to talk about that. But why isn't academia, why aren't researchers, why aren't, you know, hackers in their garage talking about this? This isn't something that needs, you know, multi million dollar budgets to make progress on. So that's kind of what I'm interested in right now.

Demetrios [00:03:35]: Yeah, it's funny how you say that. Nobody's talking about it. Even the, the Llama papers. It's not like they do a full shout out on what exactly they did with the data and how they cleaned it. I mean, you'll get some stuff and then I haven't gone through the Deep Seq pager that just went out, but I feel like they're a bit more transparent. I wonder if the data, the reason they don't even want to touch that in these papers is because of the potential lawsuits that can come with it. When they're like, yeah, we use this data set and we also use this one. If they start saying we which data sets they used and how they cleaned them and whatnot, somebody from the New York Times is going to be like, you did what with my data? I think you're going to owe us for that one.

Kenny Daniel [00:04:21]: Yeah, well, no, I mean, and that's fair. And like, look, I don't even necessarily expect them to talk about the sources of their data. Right? You know, things to never ask a cto, you know, their age or you know, where they get their data. So. But that's not even what I'm talking about. Right. I don't care if you're scraping it from YouTube or Reddit or whatever. Right.

Kenny Daniel [00:04:42]: They don't even talk about their pipelines. How are they cleaning and filtering the data? But you're exactly right about the Llama paper, the Deep Seq papers. You know, these are some of the only glimpses that we get into, you know, what these people are doing. And it's still like less than A page of the total paper is actually talking about their data. But you do get these glimpses and like the Deep Seq paper is great. They talk a lot about, you know, know, the proportions of data that they selected. Right. You know, so it's not just about taking the entire data from the Internet and just throwing it in LLM.

Kenny Daniel [00:05:12]: You also want to, you know, choose, you know, Deep Seq in particular. Did I forget the numbers? But you know, 60% code, you know, 20% math, 20% English, 10% Chinese, whatever it is. Right. Whereas Llama, they also talk about kind of their data distributions of, you know, they went heavier on English, they still had a whole bunch of code. But you know, these are some of the decisions on the data that, that you're making to go into these models. And I do wish that it was talked about more.

Demetrios [00:05:42]: Yeah, what are some other vectors that you want to see data being talked about? Because there's so much stuff that we could talk about. It does kind of baffle me that we're not talking around as much and maybe that was the 2020-2022 era where Andrew Yang was talking about that and it blew up and so now people are like, yeah, we know data's important, so let's just move past that. But it does feel like we haven't, even though we know we haven't operationalized it like you said, and we haven't figured out all the knowledge sharing and the learnings as we're continuing to advance.

Kenny Daniel [00:06:22]: Yeah, I mean people might say they understand it and yeah, I appreciated the era of sort of data driven AI and all that. People say that but then they don't follow through on it. You know, who's talking about that? I mean there's obviously people out there making great data sets. There's obviously people working on this. But like, I think that this is underappreciated still, even though it's widely accepted. And I do think that there's a couple reasons. Well, so for one thing these data sets are massive, right. You know, you go to hugging face like there's, there's terabyte sized data sets that are, you know, web dumps and GitHub and all these things.

Kenny Daniel [00:06:55]: Right. And so if you're a data scientist or an ML engineer and, or AI engineer, how do you even approach that? Right. Like how are you going to get familiar with that data that's in these, you know, multi gigabyte data sets? I mean it just seems like completely like a non starter. So I think a lot of people don't they just throw it at the models and, you know, hope that it's going to work? I do think though that there's some new techniques and some new things that could be changing this. And Llama is actually a really good example of this because they talked about in the Llama 3 paper, they used model based filtering on their data sets. So this is cool. So what they did was I was talking with one of the folks on the LLAMA team. They had this whole pile of data, I don't know how big exactly.

Kenny Daniel [00:07:41]: They didn't train on that whole pile that they theoretically had available to train on. They actually trained on less than 10% of that total pile. So they had to take this huge pile and whittle it down to just the 10% that they wanted, which is a huge amount of data to cut out.

Demetrios [00:07:55]: Yeah.

Kenny Daniel [00:07:56]: And so there's a question of how you do that. And they used llama2 to basically do a forward pass across their data set to label it, classify it, quality, filter it, and use the models to reflect back on their own training set essentially in order to produce the next generation model. Wow.

Demetrios [00:08:19]: So it was almost like llama2 knew in a way what it had seen before. And so if there was novel stuff, it would say, hey, this is good for the next iteration.

Kenny Daniel [00:08:30]: Yeah, I mean, yeah, there's a bunch of different approaches you can take about, you know, looking at using the model to just classify things so you can choose proportions, using it to like rate quality and things like that. You can also use the model itself to say, you know, what can it predict, what can it model correctly and what can't it? And that might also give you insight into the data. And this idea of sort of dataset scale inference I think is really interesting because for the first time we're kind of entering an era where it's feasible. Even just a year ago, if you looked at what would it take to run a state of the art sort of model against a trillion tokens of an input data set. And if I'm remembering correctly, it would have been GPT4. Over a trillion input tokens would cost about $5 million. Yeah, it's pretty expensive. But then you fast forward a year.

Demetrios [00:09:27]: You gotta be pretty confident in that investment.

Kenny Daniel [00:09:30]: Yeah, yeah, no, exactly.

Demetrios [00:09:31]: I know.

Kenny Daniel [00:09:31]: You got, you got prompt wrong, you gotta do it again. Yes. But now if you look at 4O mini, you can, it's much cheaper. You can do batch mode. So you can process a trillion input tokens for $50,000.

Demetrios [00:09:47]: Oh, wow. Yeah.

Kenny Daniel [00:09:48]: Which Again is maybe not like, you know, pocket change, but it's actually accessible to like the scale of the projects that many like enterprises or even, you know, small scale projects could afford. And that trend is only continuing.

Demetrios [00:09:59]: Yeah. One thing that I remember when we had the databricks team on who trained dbrx, they were talking about how hard it was to know if the data was clean and proper. And especially one of the hard problems that still sticks with me to this day is in PDFs, you'll like ingest PDFs and then it will reference in the text a table. It will say see Table 1 or see Table 3.2. And then you don't know how to attribute Table 3.2 because sometimes it would just get cut off and you wouldn't have. This table is actually Table 3.2 and it deals with this reference. And so that was a really challenging problem that they faced continuously, like being able to grab the tables, put them in a clean format, and then also have the text and that unstructured data around it that was referencing and gathering those insights from that table.

Kenny Daniel [00:11:03]: Yeah, yeah, totally. So I think you're making a point, which something I like to kind of harp on. It's a little aside, but this is the sort of insight that you only get from looking at your data. Right. And in the context of the model. Because, you know, people take web data, people take whatever data or take PDFs and they use some package to convert those into just basic text so they can be turned into tokens. But sometimes that very much messes up the data or it misses like entire key sections of it or it's out of order or things like that. And so I do think it's really critical to, and this is something I also harp on is looking at your data and getting familiar with it.

Kenny Daniel [00:11:45]: This is a thing that used to be table stakes in data science. Is data science. Yeah, yeah, just like look at your data.

Demetrios [00:11:52]: Yeah. And the best data scientists were very familiar with the data because they were creating features that would mean the difference between a high performing model and a low performing model.

Kenny Daniel [00:12:03]: Yeah, totally. And so like I, I, I assert a little bit that this is kind of going by the wayside. And I think it's partly because the data sets have gotten so huge. The models have gotten amazing at, you know, even dealing with just, just throwing a bunch of data at them. But it can be so much better if you pay attention to these things. You know, looking at, you know, how is it serializing tables out of the papers that it's doing, you know, is it parsing HTML correctly or in a way that makes sense? You know, these are very, very common issues. And, you know, from my time as a data scientist, you know, as, As a boomer, that was a loop that I would do constantly. Right? Yeah.

Kenny Daniel [00:12:39]: I would take a pile of data, I would build a model and I would look at, okay, great, what of this data, whether it's in my test or training or whatever, set which of this data is classifying correctly and which of it's classifying wrong. And if it's classifying wrong, look at it, look at the examples. Start bucketing those into error classes and starting to understand. I don't think I've ever looked at my data and, you know, in the context of the loss or the, you know, the sort of, the classifications, I don't think I've ever looked at my data and not gotten some insight that ultimately led me to making a better model.

Demetrios [00:13:14]: Dude, yes. That is so good. Never looking at your data and feeling like that was a waste of time.

Kenny Daniel [00:13:23]: Yeah, no, like it just doesn't happen. So there's a tweet by Andrej Karpathy, which I really like, and in a certain sense, my new company that I'm starting, Hydropram, is kind of entirely based off of this one sentence. Andre Karpathy tweet from and inspiration and.

Demetrios [00:13:39]: Everything, or you already doing it and then he just rode your way?

Kenny Daniel [00:13:42]: No, I mean, it was the inspiration. I mean, it's what I said, though. It is kind of the same pattern. It just kind of highlighted something that I've kind of done intuitively for years. And so I'll try not to butcher the quote, but it basically goes, if you take a data set and you sort it by loss and you look at the extreme values, you're guaranteed to find something interesting and useful, which is basically the same thing. And there was actually a reply to that tweet, which I liked, which was, fun fact, if you take your data set and you sort it by almost any metric and look at the extreme values, you're guaranteed to find something interesting and useful.

Demetrios [00:14:19]: Fun fact, if you take your data set and you look at it, you're almost guaranteed to. You spend enough time just toying around with it. That is so true.

Kenny Daniel [00:14:30]: Yeah. And so, you know, kind of started asking the question of, you know, how can we do this at a larger scale? How can we make this, you know, best practice apply to the world of sort of large data sets and modern LLMs? And, like, I think that That's a really good way to, you know, make progress on these and make better data sets in order to make better models.

Demetrios [00:14:48]: Well, so the first thing that comes to mind obviously with me is I've been talking to a lot of people and it feels like fine tuning is something that is not as in vogue these days as it was because of the risk reward or the effort to pay off type of scenario that you get yourself into. And you can get a lot of lift just with one shot or zero shot prompting.

Kenny Daniel [00:15:19]: Right.

Demetrios [00:15:20]: And just prompt tuning. That's just for fine tuning, which is a very low bar compared to if I want to go and train a foundational model. So how do you look at like you're creating something that is for the cream of the crop type of person or the. I can count the number of companies that want to train a foundational model almost like on my two hands.

Kenny Daniel [00:15:50]: Yeah. So I think that you're getting at a couple things there and I have thoughts on many of them. The first thing I'd say though is data is going to be important at every level of sort of the ML pipeline. Right. Whether you're doing pre training, it's still important. You can talk, look at the deep Seq Lama papers. If you're doing fine tuning, it's clearly very important and you know, over time you just need fewer and fewer examples as we get better at this. But I will argue that fine tuning is still important.

Kenny Daniel [00:16:16]: Although I will say if you can get away with not fine tuning, you probably should. Right. If you can prompt your way or rag your way into a solution and not have to fine tune Like I agree if I went into a company, I would not recommend starting with fine tuning. But there are situations where there's just no better alternative today. And there's a couple of things I would say on that. I mean there's certainly the instruction tuning and the supervised fine tuning and all the post training stuff. The R1 paper that came out yesterday talks a lot about really interesting methods. And what I think that partly shows is that the knowledge and even the reasoning and the connections and all that are already in the base weights after pre training.

Kenny Daniel [00:17:00]: We just need to figure out how to get them out in the most effective way. And fine tuning can help with that for sure.

Demetrios [00:17:09]: Oh, that's, that's very cool to look at. It's just, it's that latent, it's in latent space. We just have to get the right combination to the lock.

Kenny Daniel [00:17:19]: Yeah, yeah, exactly. I mean, yeah, going from, you know, the first person who told the model think step by step to now we're, you know, training that in, baking it in that it just does that on its own. So there's things like that if you're one of the, you know, the bigger labs are doing fundamental research on this, but then there's other classes where I think fine tuning is still useful and probably will be for a while. And you could loosely or broadly call this style style control. You certainly see this with the image gen models. Like frequently if you want to generate a certain style of cartoon or whatever, you fine tune it on examples and now you can just do inference to generate things of that style. But this applies to text too. Um, and so if you're building a chatbot, you know, for your company and you want it to have a particular tone, a particular style of response, you want it to handle different situations in certain ways.

Kenny Daniel [00:18:16]: Some of that you can put in the prompt. But honestly some things like style especially are really hard to express with prompting, but they are easier to express with data. So you can use data to express the will of the, of the user.

Demetrios [00:18:33]: Ooh, interesting to think about that. That it's easier to get to that end goal through expressing it in data and a highly curated data set that you're trying to go for as opposed to trying to prompt your way into the solution. Yeah, okay. All right, so that, that's a novel idea. I haven't heard that much. But you also, there is another piece that like going back to the original question, right? Like I, I can count the number of companies that are training foundational models on my hand, but I think that you had mentioned to me before too that there's another aspect of that which is if we make the data easier for advanced experts in whatever field they're in to label it and make it high quality data, then we will find more uses. And it doesn't necessarily need to be with training a foundational model, it can be with other things. Is that your reasoning or what I remember you telling me correctly?

Kenny Daniel [00:19:46]: Sort of, yeah, sorry, I mean maybe maybe clarify kind of what you're asking.

Demetrios [00:19:50]: So it's it. I know that you had mentioned, I think we used the example of a radiologist and if you are able to have a radiologist label a lot of image data and high quality image data faster, then you're going to be able to do more and create better models with that data. And it doesn't necessarily need to be a foundational model and it can be that fine tuned scenario like you were saying.

Kenny Daniel [00:20:21]: Yeah, yeah. So, yeah, so I think, okay, now I follow where you're going this, right? So there's the question of, you know, expertise and sort of pulling that out of the user and the model builder in a certain sense. And historically, you know, there's various services that you could use. There's things like the scale AIs of the world, you know, a million other labeling companies that are out there, which was incredibly valuable, especially in the early days of, you know, doing these models, right, because they, they would go, you know, label self driving data, you know, label text content for harmful content, things like that, annot data in general. But it's one thing if these are tasks that, you know, kind of any human could do, right? You know, identifying a stop sign, great. But yeah, to your point, what if this is specialized? What if this is medical data? What if it's finance, what if it's legal, what if it's sales in some particular domain? And then there's the question of you, you probably can't outsource that. And I know, I know that there's company, you know, there's companies that are definitely trying to hire more experts in house or become experts at, you know, connecting companies that need data with the relevant experts. But I think personally that's probably the wrong way to go about it.

Kenny Daniel [00:21:33]: The benefits that you get from something like scale AI is you get the scale of having all these humans. You can bring them in and swap them out as needed. But if you need a team of expert radiologists, I don't see how scale AI can do that and then keep them on staff, unless it's just on demand and get rid of them. At which point, why not do it in house? And then also, you know, more and more the chatbot is becoming the user interface to a lot of products. Right? I mean, that's Certainly true with ChatGPT, but there, there's a million like it or not.

Demetrios [00:22:07]: Yes, that is very true.

Kenny Daniel [00:22:09]: Yeah. And so how do you control the interface of a chatbot? And it, it's kind of through this data, it's through fine tuning, it's through style control, it's to some extent through prompting and rag and the data that you make available to it. But generally speaking, you know, at a company, it would be kind of crazy to outsource your core user interface to, you know, unskilled labor, right? That's something you want to own as a company is the interface that your users are experiencing and how they do that. And in the world of machine learning and AI, you do that through controlling the data, which controls, you know, how the model behaves. So I'm a big advocate for trying to move that in house as much as possible. So, just as a concrete example, even before I started my current company, I really like JavaScript. I'm sure we'll come back to this. Revisit this in the talk.

Kenny Daniel [00:23:02]: I really like JavaScript because I believe deeply in the importance of user interfaces and good user interfaces. And if you want to build good user interfaces, it kind of has to be in the browser and so it kind of has to be written in JavaScript. Okay, we'll come back to that. So I wanted to build the world's best JavaScript generating code model. This is still kind of an ongoing project of mine, just for fun. And so, all right, you know, I went and I, you know, looked at hugging face, I looked for data sets. There's Starcoder data was a good one that I found, which is basically just a dump of GitHub from some point and started, you know, filtering it down to sort of, you know, the JavaScript code. And I wanted to start training, fine tuning.

Kenny Daniel [00:23:47]: At the time it would have been Llama two. Now I'd probably be going Llama three. I wanted to start, you know, fine tuning on this, but I wanted to first look at the data and get a sense of what was in the data because, you know, I, I am, you know, JavaScript expert, right? And so, like, I should be the one sitting there and determining, like, is this code too short? Is this junk? Is this just comments? Is this just, you know, something that is, shouldn't even be there? And I actually found it surprisingly hard to even look at the data using sort of common contemporary data science tools.

Demetrios [00:24:19]: Oh, wow. Because it was too much data.

Kenny Daniel [00:24:22]: Yeah, I mean, there's a huge amount of it. And I'll go on a little bit of a rant here. I would actually argue that Python is part of the problem.

Demetrios [00:24:31]: Oh, what?

Kenny Daniel [00:24:33]: Shots fired. So machine learning and AI, like, lives in this Python world and Python is great. I mean, there's many reasons why that's the case. Right. I've written unimaginable amounts of Python code in my life, but Python is also arguably the worst language you could choose when it comes to building a good user interface.

Demetrios [00:24:53]: Wow.

Kenny Daniel [00:24:54]: And so I actually think that, like, people in the world of machine learning and AI don't even appreciate how bad they've had it when it comes to user interfaces.

Demetrios [00:25:03]: I mean, we got streamlit, right?

Kenny Daniel [00:25:06]: You're making my point for me exactly. Um, yeah, well, so. No, you're exactly right, though. There's. So the two probably most popular ways to build a UI or interface with Python is Streamlit for one, which again, is great for building demos and, you know, just mapping directly from those sort of Python native objects to something you can display on a screen. But it's so painful when you start trying to create like interactive websites. If the state is updating, a lot things jump around. If you try to ever try to make like a chat interface with Streamlit, it's a nightmare.

Kenny Daniel [00:25:40]: Like trying to get it to actually just jump down to sort of the last message and not, you know, blink the screen every time the state changes. I could go off, I could go off on this for a while, but. Okay, so there's streamlit on one side at least it's customizable though. And then you've got notebooks, which, you know, also Jupyter notebooks are great. I'm not, I'm not hating on Jupyter notebooks at all, but they serve a particular purpose and they're not as general as you might want.

Demetrios [00:26:06]: Yeah. And it's not for user interfaces.

Kenny Daniel [00:26:10]: Yeah, fundamentally not. But it does bring Python stuff into the browser. So, so that's good. So that's what. And that's where everybody lives, right. As, as a, you know, data engineer. So I downloaded, you know, a parquet file from hugging face 2 3rd just aside, but two thirds of all the popular big models on Hugging Face or data sets on Hugging Face are natively in parquet form format. And the, the remaining 1/3 get automatically converted to parquet by hugging Face.

Kenny Daniel [00:26:37]: Right. Like they automatically do this. And I had worked with parquet before, but, but parquet is a very cool format. I think we'll, we'll, we'll revisit that a bit. But this was kind of some of. My first real experience with parquet was downloading the starcoder data and just trying to look at it. And you pull it into a notebook, you do, you know, import pandas, you know, pandas.readparquet. and you give it the file and it prints out this table.

Kenny Daniel [00:27:00]: And the table is not interactive, you can't even paginate through it. You can't click like next rows. And in the case of the data set that I was working with, each cell is an entire source file, but it doesn't display it as a whole source file. It just displays it as a tiny little box that you can't actually see.

Demetrios [00:27:20]: So what a headache.

Kenny Daniel [00:27:21]: Very quickly frustrated that, like I can't even see my data. The most basic, you know, first step of good data science. And so that kind of is what started me down the, the road that I'm on of trying to figure out, you know, how can we build better interfaces to benefit AI and machine learning people.

Demetrios [00:27:40]: Yeah. And I think the, the key takeaway that I'm hearing is there's a lot of Python that's out there. When you wanted to create your JavaScript model, you had a hard time just weeding out the Python and making it a JavaScript native pipe thing. Is that also another piece?

Kenny Daniel [00:28:04]: Yeah, yeah, absolutely.

Demetrios [00:28:07]: Let's, let's also talk about, because we are on the topic of UIs, the way that you envision getting more out of your data from experts, but also from folks that are going to be dealing with these large data sets. Because it's almost like you have two users of a platform, maybe it's two separate platforms. I don't know how you envision it. And one is there's the labeling, the expert labeling platform, so that if I'm a radiologist or a lawyer, I can label quickly and faster and use LLMs to help me have that workflow be supercharged. And then on the other side, it is someone that's in your position that is looking at this gigantic data set and you want to know, where are their holes in the data set? Where can I make this dataset more robust or whatever it may be to figure out the best way to get the highest quality from that data?

Kenny Daniel [00:29:08]: Yeah. So I think that the Persona sort of question is an interesting one. Ultimately. I do think there will always be some specialization of, you know, the people focused on the model building and then the people with the expertise. But I think that's a little bit also of a false dichotomy and that the more you can get the real expert sort of doing the work and not just throwing it over the wall, the better.

Demetrios [00:29:30]: Oh, yeah.

Kenny Daniel [00:29:31]: And I think this is what a good user interface can actually enable. So one example I might go to is tableau. Right before tableau came, like when tableau came along, it didn't really enable people to do anything that couldn't theoretically have been done before. Right. You know, a good data scientist could look at the data, they could come up with queries, they could build a dashboard. In a certain sense, what tableau changed is that, you know, a manager or, you know, like an executive could query, you know, do it themselves. They could play with it and they could build, you know, explore the data, get Insights, you know, using things that they, that that's only kind of in their head and do it themselves. And this enabled entirely new use cases.

Kenny Daniel [00:30:15]: Not that they couldn't necessarily have delegated that to somebody, but just for like little explorations or just curiosities, you're probably not going to, you know, delegate someone to go do that.

Demetrios [00:30:24]: Yeah.

Kenny Daniel [00:30:25]: And so that's kind of what I would love to do here, is that if you make the interface really easy and you make it so that, you know, one person doesn't feel so daunted by, you know, the massive scale of this data, but they can actually start looking at the data, getting insights. Whether you're the radiologist who maybe doesn't understand the first thing about linear algebra, but you do understand if you know a radiology or you know, if an X ray is misclassified or maybe if you see that it misclassified, you know, why maybe you can look at the image and kind of understand like, oh, it's picking up this or it's picking up this or it's picking up the text from the edge of it. And then you can actually start to make data fixes against that.

Demetrios [00:31:05]: Yeah. And recognize. I think I've seen a few companies generate almost like 3D maps of the data like you see sometimes with embeddings, and it will show you like where you have a lot of data points or a strong set of data and then where you have the outliers and you have a bit of edge cases or you just don't have as many examples there for that. And if we're talking about the radiolog example, it might be that there's a certain kind of cancer that you have very well documented, but then other types of illnesses you don't have as well documented.

Kenny Daniel [00:31:46]: Yeah. So I. One thing I think you're getting at a little bit is I really like the area of research of, of. No, just blanked on the name, like semi supervised learning. You know, there's another word for it, I'm blanking at the moment. But basically this idea of, especially in classification, you know, there's easy cases and then there's sort of the boundary cases that have lower confidence and you shouldn't be wasting your time looking at the easy cases. And this also applies to these massive data sets. Right.

Kenny Daniel [00:32:16]: And this is where I think that the models can come in and help is that there's a whole bunch of cases where it's easy. Right. Like, okay, this classified, you know, perfectly with high confidence. Okay. Probably don't send A human to look at that. Right. Look at the edge case, look at the boundary cases where it's low confidence or the model got it wrong, but maybe it shouldn't have. And that's where I think you can again cut down this massive scale of the data to something that is much more manageable.

Demetrios [00:32:43]: And then you would. I do also like the idea of bridging the gap between the modeler and the expert and allowing the expert to come in and look at what is going on with the data set and say, do we have any examples of XYZ or why? What's going on over here? Oh, I can see that we're misclassifying a certain topic or disease or whatever.

Kenny Daniel [00:33:11]: Yeah, yeah. By the way, active learning was the.

Demetrios [00:33:14]: There it is. All right, cool.

Kenny Daniel [00:33:16]: But yeah, so I mean, I think what you're also getting at is sort of another interesting direction that I think that things are going to go, which is using the models to assist the user, but not necessarily even just as a, you know, classify this row as X, but in many ways actually acting as the user. You can think of this with agent style approach to things where if a human data scientist can do this, I learned this in school, I've taught data scientists on my teams this repeatable process of get the data, build the model, look at where it's doing well and iterate to get to the best models. This is, in a certain sense, this is science, not art, or at least it's at least a teachable, repeatable process.

Demetrios [00:34:04]: Yeah.

Kenny Daniel [00:34:05]: And so if you can teach this to a person, why couldn't you teach this to a model?

Demetrios [00:34:13]: Interesting.

Kenny Daniel [00:34:14]: And so I'm really interested in that question of. Yeah, how can you automate more of the work of an ML engineer? Can you have the models look at the data much like a person might and try to form hypotheses about, you know, where is it misclassified? What are the buckets of errors that are happening and you know, which data is relevant to those so that we can actually start to try to fix those.

Demetrios [00:34:38]: Yeah. So it's not only having the model look at each individual data point, but looking at the data as a whole data set and saying, oh, we might want to figure out what's going on over here.

Kenny Daniel [00:34:50]: Yeah, yeah, exactly. I definitely think we're going to see tools like that coming out and being used more and more and ultimately leading to creating sort of, you know, the next great data sets that are out there.

Demetrios [00:35:01]: Wow. Yeah. And so then how does the work with creating a brilliant interface, intertwined with this. And also I, I guess in this case, if it's a model, you don't necessarily need data visualization. Or maybe you do, maybe there's different modalities that you're giving to the model so that it can look at these data sets differently and get different insights on it. I don't, I don't know. Like, I want to know what's your vision on the actual interface and then how you can have a model sync up with that interface.

Kenny Daniel [00:35:40]: Yeah. So in a certain sense you could imagine these being sort of two unrelated things. Like one is the UI rant and then one is the, you know, use models to assist with the data curation. I think that they work together though, really well. And I think that if you take just one approach or the other, it's going to fail. And there are companies out there trying to do things like purely model driven data set improvement, where, you know, literally, you know, you send them a data set in whatever CSV or parquet format, they do some magic and they send you back an improved dataset. But what that's missing is having, you know, the expert in the loop, having the human in the loop who actually understands what they want out of this model. And the only way you can bring them in the loop, like a user interface, is for people, it's for humans.

Kenny Daniel [00:36:27]: Right. It doesn't really help the models. It is the way to keep the human in the loop guiding this model building process. And that's key because we, we've gotten very good at training models. They can learn almost anything that we can define an eval for. Right. And so I, I really like this question of evals. But like, in a certain sense, what is an eval? But it's trying to pull out of people what, you know, the description of what they want the model to do.

Demetrios [00:36:58]: Yeah.

Kenny Daniel [00:36:59]: And so that, that's what I, you know, whether it's creating evals or data sets, like I think these are sort of opposite sides of the same coin where it's expressing the user's intent via data.

Demetrios [00:37:09]: Uh huh. And then just iterating overall, iterating over the outcomes until you're getting closer and closer and passing these evals. Yeah, huh.

Kenny Daniel [00:37:21]: Yeah.

Demetrios [00:37:22]: Okay.

Kenny Daniel [00:37:22]: And so that's why you need the interface. But then on the flip side, you know, why do you still need the model driven approach? Well, that goes back to just the scale of the data. Right. You know, a person doesn't have enough time in their lifetime to look at a whole data set.

Demetrios [00:37:35]: A billion. Yeah, right.

Kenny Daniel [00:37:38]: And so that's where you need the models to act as a lever, and you need the UI so that the user can express their intent.

Demetrios [00:37:45]: Huh. Fascinating to think about. Yeah. Okay, I can see this. All right, well, what else. What else you got in there? Because this is. Yeah, this is blowing my mind. I think the.

Demetrios [00:37:59]: The key question that I have is what unlocks do you see if we can get better data sets faster, Bigger. Better data sets faster, and that becomes a much more seamless process. Like, what does that enable us to do?

Kenny Daniel [00:38:17]: I mean, I think that that is really, like, the key for all the progress that we're seeing in models just in general. And I think it's the biggest bottleneck in a lot of ways, right. Is. Is that data. And, you know, from an engineering point of view, you know, if you've got a pipeline of. Of. Of things, right, and there's a constriction, there's a bottleneck. Right.

Kenny Daniel [00:38:38]: Like, applying engineering effort to, like, any other parts of the pipe is not going to help. Like, the only thing that's going to help is, you know, like, widening the pipe where, you know, the bottleneck is. And I would argue that's the data, and I would argue that's the evals, especially because, again, you know, an eval, I'm going to oversimplify a little here, but in many ways, an eval is just a very clean data set. There's some nuance there. Right? But, like. But in a lot of ways, that's true. And the models have gotten. And we've gotten extremely good at the models learning almost any function that we can define, which is why, like, we're in an era right now where even just defining a good eval can make you famous.

Kenny Daniel [00:39:19]: This. Right. I mean, like, ARC AGI as an example. I mean, Francois Shalet did plenty and all that, but, like, why is. Why is it being talked about now? Because it was the hardest eval out there, and it just fell.

Demetrios [00:39:29]: Yeah, yeah.

Kenny Daniel [00:39:32]: And so the question of, yeah, how do you make better evals? How do you express what you want these models to do? Every time I hear somebody say, oh, I tried, you know, the new 01, I gave it a problem, and it failed. Like, that actually makes me excited. And the question that I go to in my head is, how would you build an eval for that, the failure? Because, like, if you can define an eval, like, I would bet you anything that within 12 months, like, that eval will fall.

Demetrios [00:39:55]: It'll be passing. Yeah, that is that, huh? That is really cool. To think about is how when you know the problem, it is pretty inevitable that we can supersede the problem within a certain amount of time. You just have to know where it's failing. Okay, so going back to this bottleneck piece, which is another fun thing, and I can tell that you are and have been a founder because that's also like in organizations, right? You got to optimize and you have to focus on where there's the bottleneck. Because if you optimize any other piece, it just doesn't make sense.

Kenny Daniel [00:40:39]: Yeah, it just backs up more. Yeah, exactly.

Demetrios [00:40:44]: So then looking at the data and looking at the evals, I mean, are you thinking about in this ui, that you also are going to have ways to create evals? Because evals feel like the other side of the spectrum. Like it's after the model is trained and after the everything has happened, then you're doing evals. But the data sets are over here, very upstream, right?

Kenny Daniel [00:41:10]: Yeah. And you know, honestly, like it's, it's a very early stage startup at this point. I'm a solo founder, so I don't necessarily know what direction it's going to go. I think evals are going to be a big part of the story. I also think, you know, training sets are going to be a big part of the story. I mean, one of the nice things about data is that it does apply kind of across the board. Right. And if you make data improvements, it's also more robust to changes in the technology space.

Kenny Daniel [00:41:35]: Right. If I spend effort and time improving my data set, that's going to pay dividends for years to come. Even when we go from llama 3 to llama 4 to llama 5 or whatever comes out. If you have better data, you can now go, fine tune that on the latest and greatest model. If you're spending that time instead on buying more Nvidia GPUs or hiring researchers to improve the architecture or manual feature, sort of selection things, and then a new model comes out, all that other work gets thrown away, but the data persists.

Demetrios [00:42:08]: Yeah, yeah. Someone was telling me just yesterday that they spent so much time doing all this hacky stuff in the beginning when they were working with GPT 3.5, just because they were trying to get longer context windows, for example. And then new models came out and they had longer context windows built in. And it was like all that work that I did to get that longer context window and it all went out the window. And I've heard people talk about this as like, you want to Be thinking a little bit like, forward compatible. How can I do things that are going to continue to pay dividends, like you said? And I, I don't see a world where you making your data better in every way possible is ever going to be like, all right, well, the new model came out and now it doesn't matter. All this, like, high quality, unique proprietary data that we have doesn't matter anymore because the model can do everything and it somehow is.

Kenny Daniel [00:43:19]: Exactly. Yeah. No, but nobody's ever regretted that. And I think there's a more general principle, which in a certain sense sounds really obvious, but don't build a company betting against the models getting better. Right. Which again, sounds obvious, but people do that. People are like, oh, well, sure, maybe the models are going to solve whatever it is, SQL generation or whatever it is, but today they're not. So if we make just a marginal improvement on the current state of the art on a narrow thing, like, great, but like, you have a very narrow time window, where that's going to be true would be my bet.

Kenny Daniel [00:43:55]: But if you build things, if you build, you know, scaffolding, I mean, in my case, like, if you build a strong ui, I can swap out the model underneath it, you know, in a moment, and the UI is still valuable. Like, that's my personal, you know, bias. But I think that there's a lot of examples out there of companies where you can build something that's still useful today, but only gets more useful as the models get better.

Demetrios [00:44:15]: But wouldn't it be the, the end state me, as a company that is using a model, is there a moment in time where the model's so good? Like you said, don't bet against the model's getting better. The model's so good that I don't need my data anymore. So that kind of assumption that we're saying right now, the investment in our data is always going to pay dividends could be the wrong assumption.

Kenny Daniel [00:44:46]: But it's still not going to be an absolute everything model. Right. But it might solve generic problems. But like, is it going to know about your business or your specific domain or things that just don't exist on the broader Internet for it to train on? And I don't think that's going to go away. And then you, I mean, you look at even particular enterprises, you know, things like building, you know, enterprise, like BI tools or process automation tools or things like that, you could take the smartest person in the world and like, unless they have the context of, you know, what does the business do? What do these tables mean, you know, what does this data like mean? What is revenue? What is, you know, like what are these? Like, it's not about intelligence. Like you need access to this data and like the cleaner and like, you know, better, you know, formatted and structured and you know, junk removed. It is, that's going to be better whether you're a person or a model looking at it or it's the smartest model on earth. Right? You still need that.

Demetrios [00:45:41]: Yeah, that is a great point. And that is one of the reasons actually when I was talking to a guy earlier this morning and he was saying that he built this agent that was working really well on jira. It was like a JIRA agent that they built for the company. And when they built it in their hackathon, it was working brilliantly. And then they plugged it into their real JIRA instance and it was horrible. And the reason for that was because everybody on the JIRA boards would do the least amount of context to get their ideas across to the other human. And the other human's working on the project. They've been steeped in the project for the last two weeks.

Demetrios [00:46:27]: So they don't need a lot of context. They get like, there's a line, you know, even the L looks good to me is something that maybe it would make the agent be like, what's going on? It doesn't understand the acronyms, it doesn't understand any of this context. And so there is a, there's that piece. I don't know if the, it's a little bit of a side getting sidetracked in the conversation on the high quality data. But I do, I do think that the reason I brought this up was you want to tie back the context to. No matter how smart you are, if you're not able to understand the context and be brought up to speed on different projects, then doesn't matter.

Kenny Daniel [00:47:19]: Yeah, yeah, exactly. I know like, you know, the entire field of computer human interaction. Right. Is not going away. Right. Even if we had, you know, the, the God model that is like all, all intelligent. Right. You still need to figure out how it inter with the human systems that still exist and will exist.

Demetrios [00:47:36]: Yeah, yeah. Oh man. So what else you got before we, we get out of here?

Kenny Daniel [00:47:44]: Yeah, I, I mean, you know, I think, I think we hit on a lot of, a lot of the core points. One thing I will just say is that, you know, I'm building a lot of this user interface stuff as open source, so this isn't, you know, showing anything right. As one example, we were talking about parquet stuff earlier. So parquet is a really interesting format. It's a column oriented data format that has its own index in the footer. And if your listeners ever use something like DuckDB to query, you can query against parquet files, even up in say S3 or in the cloud, without having to pull the entire file down. And it's the index that enables you to do that. And so this is really cool.

Kenny Daniel [00:48:22]: Like, you know DuckDB, I love DuckDB. It's awesome, right? It's changing how people also do some of this dataset exploration things. Right. Love that. They have a sticker that I got at their offices the other day. It's like my laptop is faster than your data center.

Demetrios [00:48:35]: Yeah, right.

Kenny Daniel [00:48:38]: But anyway, so they do these cool tricks with parquet where you don't even need to download these gigabyte sized, you know, hugging face things. You can just point it straight at hugging face. And so I started thinking about this, like, could you do that in the browser? And I went looking for a JavaScript parquet library and there were some out there, but they had all gone abandoned. Like you go to like the one that had the most stars when I started the project was Park AJS on GitHub and like @ the top of the repo it had in big letters, like, this project is abandoned. Like, let me know if you want to take it over.

Demetrios [00:49:13]: Oh, wow.

Kenny Daniel [00:49:14]: But you know, being being a good engineer rather than like, you know, taking the reins of like an existing project, I started from scratch writing a new one, of course.

Demetrios [00:49:23]: Yeah, there's too many opinions that they had. Yeah, no dependencies.

Kenny Daniel [00:49:27]: Exactly, yeah, zero dependencies. Just from scratch. Pure JavaScript parquet parser. And opening the first one, opening the starcoder data actually only took me a couple, like a couple weeks. What I didn't fully appreciate going into this project was sort of how sprawling of a format parquet is in order to be as efficient as it is. It has like seven different compression formats supported, it's got like eight different encodings of delta encodings and all this various stuff, which makes it efficient, but is really annoying if you want to support all the parquet files. So what started as a couple week project quickly turned into a six months to release the library hyperce, which I am proud to say is the most compliant parquet parser in existence. Which is, I know, a strong statement, but there's a parquet testing repo on GitHub by the official parquet project and I can open more files from it there, then the official arrow implementation, then the rust implementation, then all of these.

Kenny Daniel [00:50:27]: And it took a lot of work. But what this enables me to do is really cool tricks. Like you can have a file, a parquet file up in hugging face and you can pull it into the browser just the parts that you need using HTTP ranged get requests so you can browse the table and you just fetch rows, you know, 100 through 200 on demand without having to pull in a gigabyte of data data. And so I'm using this to kind of build these UIs, but it's open source and it's out there. There's a hugging face space where you can use this actually to browse the hugging face data sets faster and more efficiently than you could even on like their own built in viewer. I will give a shout out though. Hugging face noticed this work and they gave me an open source grant to continue developing this. So shout out to hugging face, shout out to them.

Demetrios [00:51:14]: Yeah, that's very cool. And, but it's not doing anything specific with DuckDB. It's just that like that was the inspiration behind it, right?

Kenny Daniel [00:51:21]: Yes, that was just the inspiration. Right because so people have done this kind of trick where they will literally compile DuckDB to WASM and embed DuckDB in the browser to do these kind of tricks, which is cool, but it's like 40 megabytes of like compiled like stuff to pull into the browser, which is not like a great experience if you're trying to have a fast loading website.

Demetrios [00:51:43]: Or if you're trying to do anything that day.

Kenny Daniel [00:51:45]: Yeah, yeah, no, exactly. And so like that was where, you know, I built this library minified and compressed. It's under 10 kilobytes in order to implement the full parquet spec and you can pull in just the data you need. And so like I would love to see people using this to like build, you know, compelling user interfaces for AI data.

Demetrios [00:52:05]: Yeah. Oh that is, that is very cool because basically you're giving, yeah, you're giving the reins to someone to be able to just explore all of hugging face data sets within the browser. No need to spend the time downloading, then loading it up, exploring there and then finding out that actually this data set kind of sucks. I should. And yeah, okay.

Kenny Daniel [00:52:36]: Yeah, yeah, exactly. So yeah, so that, that's, that, that's on that and yeah. Well, I think we had also kind of wanted to talk about the Algorithmia days. Right.

Demetrios [00:52:46]: Well, so I want to get to Algorithmia for sure. I was going to mention on the topic of Parquet, there is a really good substack that I saw probably a year ago from J. Chia, I think, who is at Eventual Computing. They have like Parquet Part one and part two and they go into so much detail on everything about Parquet. Like it is the most in depth post I've ever seen on it. And then I talked to Jay about it and he said, yeah, that took us five weeks to write. So to. It was like they knew, they know all this stuff because they've been dealing with it.

Demetrios [00:53:29]: They're very intimate with Parquet and they had to go and they were going back and forth with the. What is it the Parquet foundation or what did you say? There is, there's.

Kenny Daniel [00:53:39]: There is, I think the foundation, I mean it's under the Apache like Software foundation, but I think there's like, I, like, I go to their, their meetings. I forget what they call the group.

Demetrios [00:53:48]: Yeah, the. So the fascinating part there, like I, Yeah, I can't imagine the difficulties in that. I'm trying to find the actual, trying to find the sub stack, but it's basically, I think it's called Daft Engineering Blog and it's a few years back. So I'll, I'll throw, I'll find it later and throw it in the. Oh, here it is. Working with the Apache Parquet file format from July 12, 2023 in the Daft Engineering Blog. I'll throw the link in the show notes too. So anybody who wants to check it out, out.

Demetrios [00:54:25]: And I think you will, I think you will like this. If you haven't seen it before, you may have gone over it when you were doing your research.

Kenny Daniel [00:54:32]: It's entirely possible because I've read like a lot of stuff that's out there and ultimately though, you know, I read the source like you go to the source of truth ultimately, right? The implementations like. I read the Apache Arrow source code for Parquet, I read the Rust client for Parquet, I read the DuckDB source code for Parquet and I will say this. Just shout out to DuckDB. It is the most beautifully architected engineered code base that I have ever worked with.

Demetrios [00:54:57]: Wow.

Kenny Daniel [00:54:58]: Arrow was like everything is split up everywhere. I don't even think I figured out how to compile it. DuckDB has like no dependencies. You just run make. You can make changes in the source you run make. I mean it takes a few minutes but like it's so clean and elegant. Like that was like my primary reference for sure.

Demetrios [00:55:14]: And you were mainly working with DuckDB right? You didn't do anything with like, mother duck duck on top of it?

Kenny Daniel [00:55:20]: No. I mean, no. Like the mother duck people are great and like, they're here in Seattle, local, but. But no, I mean, I was mostly just looking at the source code of DuckDB.

Demetrios [00:55:27]: But yeah, yeah, it is amazing how much traction and how much love they get. And I think, much like you, many developers have gone through that same experience and been like, I want to use DuckDB for everything now can we just make DuckDB that does it all? So it's cool to see that. But dude, let's talk about algorithmia because there is a really cool thread to pull on which is like, you were so ahead of the curve. And for those who don't know what algorithm is, like, break it down for us. What was it? It was sold in what, 2022?

Kenny Daniel [00:56:05]: I think it was 2021. That was sold to Data. I know, I know, it was early, very early. Yeah, no, I mean, yeah. So just to give a little bit of the background and the story there, so I was doing my PhD in artificial intelligence at USC and I like to joke that, you know, I was studying AI before. It was cool. And it was during my time at my PhD that, you know, sort of similar to what we were talking about, the bottleneck thing. Right.

Kenny Daniel [00:56:33]: Like, right now I see the biggest bottleneck as the data to building the best models. But back then, when we founded algorithmia in like 2014, 2015, the biggest problem was often the hosting and the serving of these models. And like, the models, I mean, this was pre Bert. This was pre kind of everything, pre anything. Yeah, but it was still the early days of deep learning. Like, we were starting to see the promise of it and some of the early, early results. And so at. In the very early days, like within the first year of Algorithmia's existence, we had serverless hosting of ML models over GPUs.

Demetrios [00:57:16]: That sounds very familiar. I don't know why, but I feel like I know a few companies doing that these days and they're just cleaning up.

Kenny Daniel [00:57:26]: Yeah, yeah, No, I mean, and I mean, it makes sense, right? Like, it's a hard problem. It's a hard engineering problem to make this work. And you know, we were doing this, you know, just to our. A little bit. Like, we were doing this before even AWS Lambda came out. So we were doing serverless hosting of these things kind of before the concept of serverless was even really like a thing. And still to this day, you can't run lambdas over GPUs yeah, but. Yeah, but so the reason why we started down that path was we wanted to build this marketplace where developers could come, or data scientists or ML engineers could come, they could take their trained models, they could put them up as an endpoint, as an API endpoint, set a price for that API endpoint, and then people could come and like call those models.

Kenny Daniel [00:58:14]: And our idea was, is that, you know, you put up these models, people would sort of the free market would create this competition for people to create more, smarter, faster AI.

Demetrios [00:58:25]: Yeah.

Kenny Daniel [00:58:26]: And it was cool. I was really excited about it. In many ways it looks kind of like what the hugging face hub looks like now. Yeah. And what I will say is I think we were too early. I mean, timing.

Demetrios [00:58:38]: I was going to say that 100%, like 2015, trying to do serverless GPUs, when the demand for serverless GPUs is happening now in 2024, it's like you were a decade too early.

Kenny Daniel [00:58:53]: Yeah, arguably. And, but like even ignoring kind of, you know, the, the practicality of it and the demand for it, like the models weren't ready yet either. Right. So we had people putting up models and they'd be, you know, text to speech or OCR or things like this. And like, sure it would. Or like even more narrow things. Right. And it would solve one problem and it would solve it.

Kenny Daniel [00:59:13]: Okay. But it wasn't generic. So how do you have a free market or when there's no, you know, substitutable goods. Right.

Demetrios [00:59:20]: Like, and so nobody's gonna pay for a model that just does one thing and one company on their data, that type of stuff.

Kenny Daniel [00:59:27]: Exactly. Everything was too narrow, everything was too vertical specific. And so what happened over time was that rather than being a marketplace, we saw people uploading their model model, keeping it private and then calling it themself.

Demetrios [00:59:41]: Oh, then you're like pivot. Pivot.

Kenny Daniel [00:59:44]: Literally though, right? Yeah. And like, so, because that's not a marketplace dynamic, if there's not like a two sided thing, that's much more of an mlops kind of thing. Right. It's how do you host these models? How do you manage these models when you. And it. And like look like anybody could deploy model V1. Right. Like it's not that hard to spin up EC2 and let's like throw it up there.

Kenny Daniel [01:00:02]: But like what happens when you come out with model V2 and you want to, you know, roll it out like, and deprecate the old one? You know what happens when you have 100 or a thousand of these models and 99 of them are never being called. Well, you want to like spin those down to cold storage, but be ready to spin them up if they do get called. And so, you know, building. But then as the models got more general. Right. Probably starting with Bert especially, but obviously, you know, even more so in kind of the modern era era, you actually do like every model can kind of solve every problem, like to varying degrees. Right. Some are going to be better than others, some are going to be faster than others, some are going to be cheaper than others.

Kenny Daniel [01:00:41]: But broadly speaking, like you can take any given problem and you can throw it at GPT3,5, you could throw it at GPT4, you could throw it at Llama, you could throw it at Deep seq and like one might be better than the other, but they all kind of work.

Demetrios [01:00:54]: Yeah, yeah.

Kenny Daniel [01:00:55]: And so I think that works a lot better for say the hugging face hub than like what we were doing back then.

Demetrios [01:01:02]: Well, you even, you see companies, and I kind of look at it as these companies today, like the modals and the base tens and the togethers, even. It, it was like that, that's what Algorithmia was doing back in the day. Right. And there was this gap that happened where 2021, you sold and then it was a year until ChatGPT came out and then it was even longer. I can't even remember when the open source models were coming out that we would put. Potentially it was stability that came out with like, or sorry, I mean, Mistral.

Kenny Daniel [01:01:45]: Was kind of the original, like open.

Demetrios [01:01:47]: Yeah, that was earlier, but that wasn't even that early. I think that.

Kenny Daniel [01:01:51]: No, I mean, yeah, there was, there was that, that was the first compelling one to me. But like M7 billion. That was a great model.

Demetrios [01:01:57]: Yeah. And that was in 2020, end of 2022, was it? Or 2023. No, it was end of 2023. So that's a two year gap that then all of a sudden folks were convinced that the thesis of, okay, open source is going to be as good as what we have. Closed source is just going to take a little longer. And then there was the other piece on, oh, well, if I don't have to deal with all this headache of these large models and the inference around it and trying to like host them, why don't I just outsource that to somebody else? Because it's not like I need this very, I don't need to spend that much engineering time on it because there's other companies that are doing this. And, and also it can Be prices effective?

Kenny Daniel [01:02:47]: Well, just from a practical point of view, like hosting models is like the perfect example of like economies of scale, right? ML ops kind of things, right? If you're one company and I just want to host my models, like I have to now spin up infrastructure for that if it's not being used, I'm just wasting idle resources if I want it to be accessible fast across the globe, right? If I have users in Australia and I want, you know, like 50 millisecond, like time to first token, you physically cannot do that with the speed of light. If you don't have servers in Australia, right. You can't go to US East 1 and come back that fast. And so like now for one company to have to spin up a global CDN of hosting like models, it's. It would be crazy, right? It's the perfect situation for there to be a central company, whether it's a startup or you know, like a big infra provider. But like it's not the sort of problem you should solve as a small player.

Demetrios [01:03:38]: How were you guys doing it? Were you buying GPUs yourself or were you renting them from the cloud companies and then.

Kenny Daniel [01:03:45]: No, we were, we were renting them from the cloud companies. Yeah. And we were fighting with shortages and all of that, but before it was.

Demetrios [01:03:53]: Actually from real shortages, huh?

Kenny Daniel [01:03:55]: Yeah, yeah.

Demetrios [01:03:56]: You were fighting with all the crypto miners, probably.

Kenny Daniel [01:03:59]: Yeah, yeah. And then there were. Well, and you're talking about that window of time, right, where before ChatGPT there was a crypto crash. There was like that brief, beautiful moment when GPUs were abundant. But also you made the point of Algorithmia got acquire 2021, late 2022, ChatGPT came out and it's impossible to not think about that alternate history where we had hung on for one more year or something and ChatGPT comes out and now we've been in the ML hosting over GPU space for almost a decade. At that point probably could at least.

Demetrios [01:04:39]: Raise at a pretty good valuation, I could imagine, considering some of these fucking companies. Oh my God. And well, because the other funny piece on this is that there's another company that was in almost like your batch, we could say, that was doing similar things. And I would consider that like octo, ML, Octo AI. And they also got bought though. And so it was. They got bought much later, but still they got bought. And I think it's trying to.

Demetrios [01:05:17]: It's like the new batch of folks that came out and solely focused on the LLMs or the AI narrative. They ran away with it. It. In a way, yeah.

Kenny Daniel [01:05:34]: I mean, and like, look, especially after, you know, going separate ways from, from the acquisition and all that, and I'm starting to think about my next company. I mean, my mindset went to like ML infrastructure, ML ops, hosting. I still have a lot of thoughts about that. Right. I spent a lot of time thinking about like, well, yeah, say we had hung on or had gotten funding or, or maybe even not. Maybe if it wasn't Algorithmia, but if I just wanted to build from scratch a similar platform but for LLMs and there would have been differences. Right. Like we had some assumptions with Algorithmia that it was one model, one GPU, or we could host multiple models on one GPU, but we couldn't do multiple GPUs for one model.

Demetrios [01:06:10]: Oh.

Kenny Daniel [01:06:11]: Because they weren't that big then. It wasn't that big then.

Demetrios [01:06:12]: Yeah, they weren't that big.

Kenny Daniel [01:06:15]: And so like, you know, that's, that's a bridge that you could cross. But like, it might require some serious engineering. So I definitely spent a while thinking about like, how would I have architected an hosting platform now for LLMs. But also, you know, there's a lot of smart people and companies that are now in that space and thinking about it, although I will say I don't know that any of them have really won yet. Like, there's not like a clear thing. I mean, I looked at this the other day of like, I have these, you know, serialized, like GAF or whatever model files. Like how do I host these in a serverless sort of way. And there's obviously solutions, Like I was kind of honing in on replicate and some of those kind of ones, but like, but there's not a clear winner.

Kenny Daniel [01:06:53]: And it's crazy to me also how bad AWS's approaches have been, but they.

Demetrios [01:06:59]: Just don't care, man. Like, let's call it spade, a spade. But it. One thing that is fascinating to me on this is we had a DD on here from Kleiner Perkins probably a month ago or two months ago, and he was talking about how if you look at the AI Space and the MLOps and the LLM ops, and the agent ops, whatever ops you want to call it, the ones who are actually making money are the GPU infrastructure folks, because you can do AI with whatever orchestration tool you want or whatever data cleaning tool you want or whatever, all that is optional. What's not optional is GPUs. If you're doing AI, that is where the rubber hits the pavement.

Kenny Daniel [01:07:50]: Yep. Yeah, I mean, that's certainly true that there's a lot of money to be made on GPU arbitrage at the moment, right? There's absolutely, yeah. Like the togethers of the world and the things like that, you know, raking it in. Because like anybody, anywhere you can get GPUs. I don't know how that's going to play out. I mean, I don't think demand is going away for GPUs. They're obviously like cranking them out as fast as they can, but I don't know, it'll be interesting to see.

Demetrios [01:08:19]: Yeah, and like you said, there's no clear winner. It looks like there's a lot of success all around the board and the GPUs. The real money that's being made in the GPUs is the inference side of things. And I'm sure you saw that too. Like you were just doing inference, right? You weren't doing any training, you weren't doing any sort.

Kenny Daniel [01:08:41]: Yeah, and that's actually, that is something I was pretty proud about, about algorithmia. Just in the sense that like, a lot of people either conflated the two and they just said, oh, well, a GPU is a gpu. If you want to do training, if you want to do inference, like, it's all the same, but like, the workloads are so drastically different that like, you really wouldn't actually build the same tools to do training versus inference. And so the fact that we focused on inference was also, I think, something that was ahead of its time, because it is also, I mean, that's the sort of thing where you want, like I said, you get the benefits of scale. You can swift, you know, you better utilize slack capacity when you have a wide range of demands. There's economies of scale if you want to go global and deploy to different data centers. Whereas for training, right, you want a big pile of GPUs in one room with very high interconnects for a short period of time or maybe longer if you're doing big training runs. But like.

Kenny Daniel [01:09:33]: And so it is very different. And then I think in the world now, we're actually seeing that a lot. Like, there's obviously money to be made in training, especially if you're Nvidia. But inference demands are off the charts, right? Like Claude or Anthropic can't keep up with demand because they don't have enough GPS for inference. And with the O models, the O series of models, you know, I mean, the speculation is that like, that's what OpenAI is really like trying to build these data centers for is not for more training, but they just see the demand for inference of, you know, test time, compute and you know, like chain of thought reasoning, you know, at test time, just going crazy. And yeah, I mean, I, I anticipate inference to be increasingly large percentage of the total, like GP spend.

Demetrios [01:10:18]: Awesome, dude.

+ Read More

Watch More

RecSys at Spotify
Posted May 14, 2024 | Views 6.4K
# LLMs
# Recommender Systems
# Spotify
How LlamaIndex Can Bring the Power of LLM's to Your Data
Posted Apr 27, 2023 | Views 2.8K
# LLM
# LLM in Production
# LlamaIndex
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com