MLOps Community
+00:00 GMT
Sign in or Join the community to continue

No GPU Before PMF

Posted Mar 11, 2024 | Views 167
# Dust
Stanislas Polu
Software Engineer & Co-Founder @ Dust

Stanislas Polu, the co-founder and engineer of, is an alumnus of,,, and

+ Read More
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

Why building in Gen AI does not necessarily mean training your own model. Where the space is going and how startup can optimize in this moving ecosystem.

+ Read More

No GPU Before PMF

AI in Production


Adam Becker [00:00:05]: Thank you. Next up. Are you here, Stanislas?

Stanislas Polu [00:00:08]: Yes, I am.

Adam Becker [00:00:09]: Okay. Good to see you. How are you, man?

Stanislas Polu [00:00:12]: Doing good.

Adam Becker [00:00:13]: Okay. So, Stanislas, I mean, you'll introduce yourself, and what I can tell folks is you've spent some time at stripe and at OpenAI, and now you're going to tell us a little bit about it, how founders and early stage folks should be spending their attention a little bit right when they're just starting out. So I'm sure a lot of people in the audience can benefit from this. The floor is yours, your screen is right here, and I'll see you in ten minutes.

Stanislas Polu [00:00:42]: Thank you very much. Hello, everyone. I'm Stan. I'm the co founder of Dust. We basically build a platform that brings humans, the company data and models. The goal is to let every employee create assistance to supercharge their work. That covers from incidents, response assistance for engineers, code assistance for engineers, assistance that create prepare your meeting notes for product managers, et cetera, et cetera. Started working on dust 18 months ago.

Stanislas Polu [00:01:14]: Before that, I was a researcher at OpenAI for three years, where I worked on studying the reasoning capabilities of large migrate models, and in particular the mathematical reasoning capabilities of large market models, and in particular in the context of formal mathematics. So I have the kind of unique position of having trained thousands of models using probably millions of GPU hours over my three years of work as a researcher, and at the same time being part of the community of people building product in the space for the past, soon to be two years. So I think I wanted to share some thinking there about how the ecosystem is shaping and how the technology is shaping. Maybe not some answers, but at least some directions for funders and people building in the space to think about where the technology is going and how it should impact the way they think about products and their infrastructure. So I think the one interesting thing that has been a very differentiating force between different startups operating in the AI space is b two C and B two B. It's very interesting to see that we are definitely in the b two b bucket ourselves. So really try to sell to companies. But when you think about all the companies that are doing B two C stuff, when you're doing B two C, obviously you have large volumes.

Stanislas Polu [00:02:46]: And so generally the order of priority for B two C companies is currently being cost, speed, and performance. Basically, if you think about perplexity, you have millions of people asking most of the time very trivial questions. So you want something that doesn't cost too much because you're serving all those millions people. You want something that is very snappy because that's the thing that makes the product sticky, especially in BC setup. And then last, only you're thinking about performance. I think for the B two B use case like us, we have a completely inverted order of priority because we're capable of charging more. We really think first about performance of models, speed second, and cost third. And I think I'll come back on the cost questions.

Stanislas Polu [00:03:33]: But basically when you are in a bit we set up, that's really a setup in which you can probably really just not think about cost. Today. It's fine to do crazy things that cost a lot for even a simple product interactions, because as I'll make the argument a little bit later, there's a good reason to believe that that cost will diminish very drastically. And so if you are going in the B two B motion, it's fine to absorb that cost during that transition period. The second point that I wanted to touch base is about fine tuning, because it's really interesting how everybody talks about fine tuning and everybody wants to, especially when you talk to enterprise, everybody wants to have their own fine tuned models on their own data. And I think it's a little bit of a fallacy that is important to unpack and to think about. Basically, first, fine tuning of large rogue models is something that, scientifically speaking, is not super well understood yet. One very simple example of that is that we still don't know how to do online training.

Stanislas Polu [00:04:46]: That's how bad we are at fine tuning. We've created an artifact through pretraining that is called a large language models. And we have a new feed of data coming, which is the feed of data coming from the world, and we don't know today how to incrementally update those models. So we're really bad at fine tuning. There are things that we do well, like alignment and stuff like that. But the science of fine tuning is really not well understood, and so it's important to understand that. And second, fine tuning on company data, as an example, is something that is meant to not work. It's something that many people are offering to many companies because that looks cool to have.

Stanislas Polu [00:05:26]: Your model has been fine tuning on your data, but for use cases that are varied and that are generally the ones that happens when people interact with models for productivity reasons, that fine tuning is very likely to not work. Let me explain why. If the tasks that you're interested in are very varied and quite informal, then only the largest models will really be good at that. If you take a small model and you try to ask very varied questions about complex stuff, you're generally going to have much worse response than with a large model. And so you're stuck with large models for those kind of questions. And now if you try to fine tune a large model, generally the data set that you have access to, let's say all the data that has been generated by a company, is really too small to effectively fine tune a large model. And so the reason why is because the model is so gigantic that you won't be able to move the internal knowledge graph of that model sufficiently. And so what you're only going to be training the model to is to given the knowledge graph to invent more data that looks like the company data.

Stanislas Polu [00:06:38]: And so you're bound to have a lot of hallucination.

Adam Becker [00:06:43]: Just quick question, just in case, are we supposed to see a screen other than what we're seeing? It shows no PMS.

Stanislas Polu [00:06:52]: I just had one slide, so that's fine. Good, you can even remove it. I'm happy to show my face full screen. And so I think one motto that we have at dust is no GPU before PMF. And I think it's important to realize that trying to train your models as a startup might be your core strategy. And there's a lot of startups that are really proficient like that. We can think about Mistral as an example, but for many startups out there, it's probably the wrong strategy. It's very fashionable to train your own model.

Stanislas Polu [00:07:30]: It's very fashionable to use gpus, it's very fashionable to create a mode around training. But the reality is that you're just creating a small rock that might be washed out by the next generation of models. And so it's very important to think hard about whether you want to open that box, really. Startups are not meant to do research, and fine tuning still is something that is within the research realm as of today, to kind of give some usable stuff for the audience. I think it's interesting also to think about how the ecosystem will shape in the future. It's very interesting to realize that GPT four is soon to be two years old, meaning that the first version, the first end of training of GPT four was almost two years old. Given the space in which we are, that seems like an eternity. So for those past two years, that model has been the best model, and most of the competition has been catching up with it.

Stanislas Polu [00:08:40]: But it's interesting to also realize that OpenAI hasn't released the new model since then. And it's interesting to ask yourself the question as to why. One thing is that scaling those models at the scale is necessary to go to the next scale after GP four is extremely difficult. And the second is that the scaling lows that govern the performance of models as you increase the compute are scaling lows. So that's basically power rows with a very steep exponent, meaning that as you add more compute and more data, the amount of compute that need to be add at the end to really make a difference become gigantic. And so basically we have two alternatives in front of us. Basically, in the past two years we've had GPD four as the king model. That is much better than any other model.

Stanislas Polu [00:09:29]: And all of the open source and other players has been playing catch up to GPT four. We are today entering a world where we have sales competition to GPT four. Gemini probe 1.5 and Gemini Ultra are Sayus competition to GPT four. And even the open source community is catching up with GPD four. I think it's legitimate to think that Mistral is capable to have a model that is at GPD four level within the year. And so we really have two interesting alternative in the future. It's whether GPD five or the next generation of model will be substantially better than GPD four or not. And if you think about the scaling lows and the fact that GP four has been there for two years, there's kind of a high probability, or an interesting probability that the next model might not be, from a human perspective, might not be substantially different than GPT four.

Stanislas Polu [00:10:21]: And so if that's that scenario, that's going to be very interesting, because all the competition will catch up with OpenAI and we'll live in a world of highly competitive token providers, and the cost of token will just go to the ground very rapidly. And then the second alternative is GPT five becoming being released, or the next model being released, whether it comes from OpenAI or Google or any other actor. But the next big model being released is substantially better than GPD four, in which case we're probably going to end up in a dynamic that is similar to the past two years, where we have one leader that is capable of making a somewhat high price of token and all the competition trying to play catch up. Given the fact that GBD four is two years old, I think my conviction goes more and more towards we might not have a massive improvement of models by scaling. And another signal towards that is the fact that most of the differentiation today comes from going to multimodal going to longer context and not going to a bigger model. And so that's going to be a very interesting ecosystem that we're going to live in as all that condition catch up. And I think it's something you have to take into account when you think about what's going to be the good model to use and what's your strategy around training your own model, et cetera. So I hope that's been useful to get some direction in that space and some ideas on where the space might go.

Stanislas Polu [00:11:42]: And thank you very much.

Adam Becker [00:11:44]: Thank you very much. Stanisas please stay in the chat. I would love to ask you questions for about a full day. I would like to have you this is fascinating, like I'm teaming with questions. I also think that as a slogan, so many things should not happen before product markets fit. Most people that I talk to in startup world are like, I'm trying to do that thing, just don't do any of that. Just product market fit 1st 100% but the dynamics that we're likely to see and how we should be reacting to the scaling laws, right. And how those scaling laws are already going to impact the strategy that startups should already be taking into consideration.

Stanislas Polu [00:12:31]: It's fascinating.

Adam Becker [00:12:32]: Sam, thank you very much. Thank you very much for all of your insights.

Stanislas Polu [00:12:35]: I'll stick around. There's any questions.

+ Read More
Sign in or Join the community

Create an account

Change email
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Posted May 19, 2022 | Views 317
# Run:AI
ML Drift - How to Identify Issues Before They Become Problems
Posted Dec 08, 2021 | Views 482
# Monitoring
# Presentation
# ML Drift
# ML
# Fiddler