Building Products // Panel 2
George Mathew is a Managing Director at Insight Partners focused on venture-stage investments in AI, ML, Analytics, and Data companies as they are establishing product/market Fit.
He brings 20+ years of experience developing high-growth technology startups including most recently being CEO of Kespry. Prior to Kespry, George was President & COO of Alteryx where he scaled the company through its IPO (AYX). Previously he held senior leadership positions at SAP and salesforce.com. He has driven company strategy, led product management and development, and built sales and marketing teams.
George holds a Bachelor of Science in Neurobiology from Cornell University and a Masters in Business Administration from Duke University, where he was a Fuqua Scholar.
Asmitha Rathis is a Machine Learning Engineer with experience in developing and deploying ML models in production. She is currently working at an early-stage startup, PromptOps, where she is building conversational AI systems to assist developers. Prior to her current role, she was an ML engineer at VMware. Asmitha holds a Master’s degree in Computer Science from the University of California, San Diego, with a specialization in Machine Learning and Artificial Intelligence.
Natalia is an AI Product Leader who was most recently at Meta, leading Responsible AI. During her time at Meta, she led teams working on algorithmic transparency and AI Privacy. In 2017 Natalia was recognized by Business Insider as “The Most Powerful Female Engineer in 2017”. Natalia was also an Entrepreneur in Residence at Foundation Capital, advising portfolio companies and working with partners on deal flow. Prior to this, she was the Director of Product for Machine Learning at Salesforce, where she led teams building a set of AI capabilities and platform services. Prior to Facebook and Salesforce, Natalia led product development at Samsung, eBay, and Microsoft. She was also the Founder and CEO of Parable, a creative photo network bought by Samsung in 2015. Natalia started her career as a software engineer after pursuing bachelor's degree in Applied and Computational Mathematics from University of Washington.
Sahar is a Product Lead at Stripe with 15y of experience in product and engineering roles. At Stripe, he leads the adoption of LLMs and the Enhanced Issuer Network - a set of data partnerships with top banks to reduce payment fraud.
Prior to Stripe he founded a document intelligence API company, was a founding PM in a couple of AI startups, including an accounting automation startup (Zeitgold, acq'd by Deel), and served in the elite intelligence unit 8200 in engineering roles.
Sahar authors a weekly AI newsletter (AI Tidbits) and maintains a few open-source AI-related libraries (https://github.com/saharmor).
There are key areas we must be aware of when working with LLMs. High costs and low latency requirements are just the tip of the iceberg. In this panel, we hear about common pitfalls and challenges we must keep in mind when building on top of LLMs.
Hello. Let's see. We have George Natalia Sahar, and thank you all so much for joining us. I think this is everybody. We're missing one person. Ah, we're missing Sam. Ah, he is. Oh man. Sam, you were the most important person here. He's, I added you last. Alright, well I'll let you all take it away. Thank you so much for joining us.
All right. Hey, thanks so much, Lily. I am in Hardy agreement. This is a very exciting topic and one that I. Uh, get a lot of questions and comments around, Hey, LLMs are super exciting, but how do we actually build products around them? And that is what we're going to dig into in this panel. Uh, I will quickly introduce our panelists, uh, so that we can dig right into the heart of the discussion.
Uh, we've got a Smith oiss. A Smith is a machine learning engineer at Prompt Ops where she's applying LLMs to enterprise observability for DevOps engineers. We've got George Matthew. George is the managing director at Insight Partners, uh, where he is invested in companies like weights and biases in Fiddler.
We've got Natalia Barina, a former AI product leader at Meta, where she focused on issues like transparency, control, and explainability. And we've got Sahar Moore, uh, who's currently at Stripe, where he's leading the work around making LLMs. Thing at the company. I am, of course, Sam Charrington. I am host of the TWI l AI podcast.
Uh, if you've not, uh, come across the podcast, I encourage you to visit twi l ai.com or your podcast platform of choice. And check us out. We've got, uh, 600 plus episodes on ML and ai, including lots on LLMs and generative ai. For you to dig into. Uh, I asked our panelists for their spiciest takes on l l m, and, uh, we've got some really interesting responses there.
And we're gonna start there, uh, just to get the, we'll start at spicy, uh, Natalia, why don't you kick us off with yours? Yeah, my spicy take is that l l m hallucinations are a feature and non bug, um, why people lie. We can't verify even what people say. So why do we expect that we'll be able to verify what machines are saying and how can we even differentiate, uh, truth from lie.
There's a lot of gray area in this. Um, My, again, the other part of it is I don't think we'll ever get rid of hallucinations. Um, it's a fools errand to try. Instead, the right thing to do is to use LLMs for the right use cases where they can shine. And in particular, that means around creativity and inspirations.
And, uh, yeah, that's my hot, spicy take. Awesome. So Howard, why don't you go next? Yeah, first, I, uh, also agree with Natalia that hallucination is going to be stick around for some time. We need to have great ways to mitigate those. Um, so on my end, I actually think that, um, open, open source language models are the future.
Um, we got this recent paper from Berkeley about the false promises of open source at lms, but at the same time, since Metas release in last February of Lama. We can think about how open source CMS can be quite powerful. So first they can run at the edge, um, which is a great fit for everything that is privacy related, use cases, and also independent of internet connection.
So, We have like great work like ml, C L L M and also Lama C P P. It gets us really close to being able to run these amazing technologies on at the edge. Developers can fine tune them for specific purposes, so you can achieve better performance for less compute and less latency. And they also give you a lot of flexibility without predefined constraints or policies, which is sometimes an issue.
For example, when you use the open ai. Um, API and licensing is no longer an issue. We got like some really strong, powerful LLMs, um, that are now commercially permissible, like Falcon and Microsoft's release. Um, this paper orca a few weeks ago, that basically shows how smaller LMS and open source LMS can also perform well, and when you imitate the right things and not necessarily the style, like maybe other open source ATMs.
So exciting futures. And looking forward to see what people will do with those, um, open source. Yeah. Yeah. Yeah, my spicy take is that, you know, we're all here because of the generalized lms. Everyone's talking about it, but I think the future is going to be very domain specific, um, specialized models where each company, each organization's data, their propriety data is going to be the key going forward.
Everyone's going to have their own data, have their own specialized models, and they're going to release these as. Like a model as a service where the da, the proprietary data is going to be the big thing. Awesome, awesome. And George. I love how all of our commentary is like somewhat interrelated to each other.
Mine is that we're going to run out of public corpus to be able to train these large language models at scale, and that's gonna happen in the next, uh, half a decade when, let's just say the G P T five style models start to emerge. Mainly because you know, we're approximately training on 20% of the human corpus and we've exponential every last run of generative pre-trained model at scale up to this moment.
So if there's another 10 X happening in terms of where the extent that the training data is going to come into training the next generation model, we're literally just gonna run out of. Publicly available human corpus. And so coming back to Quis point as well as Sahar point a moment ago, we are going to move to a world where private data, particularly on smaller form factor, large language models, likely, uh, emerged into around open source, which is where some of these small form factor large language models are most relevant is, um, Where this sort of half a decade and beyond is gonna go for the future of building transformer based architectures.
Now are these spicy takes, so spicy that, uh, there's some argument, uh, that anyone on the panel would like to, uh, pick on any of the takes we've heard thus far, or anyone in the audience for that matter, anyone jumping out to? Uh, disagree. I guess I could go, I'm not disagreeing, but there is another point that the, there are open source models, you know, like companies where, like right now we're seeing so many companies out there that they're training, they're not training these models, but they're using LMS into their products without any, like, without a huge ML team, right?
There's no management of. Your training, your whole ML pipeline, your whole ML infrastructure, and with these open source like these APIs that OpenAI cohere have, it's so easy for these companies that don't have an ML expertise to include LMS into their product. So I do see it coming from that side too.
There's a role for services. Yeah. Got it. Got it. Yeah. I think if you were taking the other side of this argument, you could make the case that, hey, the quality of these smaller form factor. Open source models might not be ever as good enough as the large resource houses and what they're effectively producing.
And over time, the research houses that are producing these large language models are going to have additional capabilities. You know, like, you know, the function libraries that emerged like the PEs that emerged, uh, about a few months ago where the. There's just going to be a general arms race where the open source committee and the use of, you know, call it the sort of public data sets, is not going to be enough.
Uh, and you're gonna see, you know, the, the, the research houses take on a lot more workload when it comes to leveraging private data, multimodal sets of data. And, uh, there's likely a chance that like, you know, sort of the things that are kind of being predicted, at least as we, we indicated, uh, could have an alternative feature.
So you could kind of argue very easily for the, the, the countervailing point here that this is all gonna stay, um, very much in the hands of three or four. You know, well-funded research shops, um, being open ai, coer, anthropic, AI 21 as the, the, the continued leaders in this space. But, um, it doesn't seem that way right now.
It does seem like there is, there, there is a prevalence towards small and four factor models and open source in particular. Actually, just maybe adding on that is if we look at timeline. So I actually think that. At some point it'll be open source models will be good enough. So now we have GPT 3.5. We already have models reaching this level of performance.
So maybe in one year from now we'll have like GPT four or even more than that potentially. I know it's kind of hard to think about it at this point, but if we take history so far, uh, and we believe that some of those papers about open source models achieve this performance. So I think it's a matter of time we get there.
And then the other question is potentially as we are building products in production, What are the requirements for such products? Do we actually need a GPT four like model to do question answering on documents or answer? Answer very specific questions. So I think that as we build more peaks and shovels in this space, we improve on evaluation.
We better understand what are the use cases we would like to serve. We can get to better results, even with smaller and open source models. But this will be an iterative process for sure. Well, we'll touch on all of those themes, uh, and in particular model selection a little bit later. I want to take us to one of the very first issues that a product builder needs to think about, and that is the degree to which their use case is a good fit for LLMs.
Right now, LLMs are this shiny, magical hammer and everything is a nail just waiting to be smacked with it. Um, But, uh, Natalia, you've got some really interesting experiences thinking about applicable use cases for LLMs and AI generally, and I'd love to hear you chime on what you've, um, you know, frameworks for thinking about this and in particular, Are there kind of recurring use cases that people get excited about that are ultimately doomed to failure because they're not a good fit?
Uh, if you wanna throw that spicy take in there. Yeah. Yeah, absolutely. So I think I would start off by thinking about what are the fundamental characteristics of L L M that really make them shine And it's really about flu. So this is why LLMs are a big deal and why everyone's excited about them. They have this uncanny ability to give answers that are surprisingly human-like this is why we have so much hype.
Um, and that's really what fluency is. It's the ability to articulate very natural sounding answers that are human-like, and that is their strength. However, on the other hand, Uh, as we already touched on, we have also hallucinations, or as other people like to call them confabulation, where LLMs sound perfectly plausible.
However, it can be disastrously wrong. Um, we can't trust them, and so, A good way to think about a framework for use cases around the, is around two axes and you can visualize the accuracy. Xi is kind of on the on X, so it's your horizontal and think about fluency as your vertical. Axes. And then if, you know, you could, you could kind of map your use cases along those two lines.
And if you do that, you will see that places where you don't need accuracy, but you might need, you need fluency. And you really, uh, benefit from LLMs are things like writing a greeting, writing poem, writing fiction. Writing children's stories. So these are really creative, uh, use cases where you need lots of inspirations.
Then you could think about, you know, other, on the other end you might have, um, use cases where you need a lot of accuracy, but you don't need fluency. So these are really about, um, Being precise on the information and getting the right answer. Um, so really to me, those are kind of the two buckets, the creator and productivity use cases versus the second bucket, which is more around decision making.
So, So I'd consider, you know, like where do your use cases fit along those two lines. And once you map them, I would recommend thinking about, you know, what is your business decision framework? And there's really a few aspects of that. And they are, well, are LLMs going to make your business obsolete? Go all in and invest.
Two, can they boost your revenue? So this might be customer support, call centers, et cetera. Then you have to think about, do you build or buy. Third, can you help? Can you get, um, a competitive advantage? And this is like that vertical scenario where that, that is beautiful and I can't wait to see those come to fruition.
Um, and then finally, you might be in a position where you just, you know, l LLMs make no impact to your business and product. And you could experiment, but you don't really need to apply them. So that, those are the two, two kind of frameworks. One is around use cases, and then the second one is around making that business decision and seeing what your situation is.
Um, before. Making an investment or before even beginning to experiment, um, which I, I highly encourage everyone to experiment. That's pretty cheap. And then that's where I'm gonna conclude. Anyone want to, uh, jump in and elaborate? Yeah, maybe I can add a few to my 2 cents. So from my experience, um, as right, but also elsewhere, I feel like the biggest gap that will make us work more with the LMS or even feel more confident, it actually, again, like evaluation.
Um, so in the world, let's imagine a world where we, every time we change the prompt, we can immediately know how it well performs on like bias, toxicity, how often does it hallucinate, and also how it correlates with our own performance. Benchmark, then I will feel a lot more confident deploying something like this into into production.
So in a way, it reminded me of how like deep learning and machine learning was like a decade ago where it was really, we were all like a bit scared to deploy something to production because might yield the results. Were not, we don't expect. But then evaluation, monitoring, like what we now call ML ops, really made it a lot more easy and, and streamlined and therefore increased confidence in deploying these kind of applications.
Yeah, that fear is a real element that I see a lot, um, particularly on the part of, you know, product owners, executives. They, um, you know, it's easy to get enamored by a, a chat G p t demo, but when you think about the brand risk, uh, that's possible When your ll m your customer facing l l m starts, um, While hallucinating publicly, it, it really, uh, you know, put some pause.
Uh, and especially with the evaluation frameworks not fully, fully baked. Has anyone come across any useful ways to kind of drive team alignment and alleviate that kind of fear?
Natalia, um, I would recommend, I'm gonna plug for my own work, but one of the things we developed, um, at Meta was something called an AI system card. And it's essentially a way to publicly report on the risks and to present the evaluation. And actually this is what chat GPT did or, uh, OpenAI did with the latest version of, of, of g uh, gpt is they published their, um, AI system card.
And they literally went through and they said, okay, this is how we looked at all the safety risks. This is how we evaluated toxicity. It's a very exhaustive, um, document. There's 40 pages. Um, There's a lot in this space. Um, and the other thing I would say is, you know, you have to think about evaluation is expensive.
Uh, evaluating your AI products is gonna take time. It's gonna take resources. So you have to figure out, like, you know, how much can you, can you do, but uh, I think everyone should absolutely do it, and I encourage everyone to go through and look at the G PT four system card, which is, um, a really good guide for a very comprehensive evaluation of an L L M and all sorts of issues that can go, you know, we, we mentioned privacy, security, um, just robustness, bias, toxicity.
There's so much in here. Um, yeah, there's, there's a lot to do. Yeah. Yeah. Uh, prompting has obviously been raised quite a bit at this event so far. Um, and it's core to LLMs and it's in a lot of ways a new way to create software and systems, uh, with a rule book that's being developed as we speak. Uh, Smith, uh, can you talk a little bit about your experience with, uh, building prompt based applications?
Sure. I would love to, I think prompting is a whole topic on its own. We could do like a whole panel just on that. Um, I've, you know, the past six months, it's the whole, like how we build the application. It's changing so much with these prompts and you know, with tools like land chain out there, it's very easy to get a POC out over a weekend.
But now pushing that poc, pushing your idea into production, that's the hard part. And there are a lot of things that you can actually do with prompts and. Let's take, take for example, like, you know, one use case that everyone's been talking about is like a document question answering system, and you can provide a lot of context along with the prompt that you develop.
And this can be, you know, document one with its URL one, document two with the, with its. Significant url, but how the output comes out, you can always structure this output. You can prompt it, say, Hey, I want this output in a particular format. I've seen that help a lot. Let's say like you want it in a J S O N format with X, Y, Z, and you know, markdown tags, all these help.
But another thing that also helps is adding examples. Lots of examples, relevant context into your prompt. Has helped a lot. For me at least, it's like a weak R L H F. Kind of like a few shot examples that you can add within your context of your prompt, but not just adding examples of how you want this prompt to complete.
But you can also get a lot of people say, Hey. I add examples, it over fits to these examples. But another way to overcome this overfitting is by adding this relevant examples of the contextual retrieval everyone's talking about with like vector databases out there. Um, fetch the relevant context and you can provide a list of examples there.
And you can do vector database search, like semantic similarity search from what is the user's query, get relevant context added into your prompt, and that that has. I've seen do a great job there. Um, another thing when it comes to, um, getting these lens into production is that I've faced is the latency, right?
As you chain prompts, you know, there's so many agents chaining of prompts, latency has been a huge issue and you can actually solve this by a lot of UX features. Depending on your use case, you can. Streaming is a great thing that we've seen chat. G p t do. You don't feel the latency because chat, g p t keeps throwing you, throwing you these messages.
Another thing is you could, you could split your prompts into multiple things, like for example, like document searching, first, fetch the documents, provide that to the user, let the user look into the documents while you fetch the answer. So splitting your prompts and getting intermediate messages out there is, does wonders and hiding the latency is what I've seen.
And, um, yeah, I think that's my overview, like a really short overview of like, prompting tips that I've learned. Awesome. Sahar, you have a take on that? Yeah. I mean, one, uh, maybe, uh, One thing I found really useful for avoiding hallucination. So first we know the normal tips actually like act as something.
The way I think about GT and LMS is this huge tree of many activations, neurons, and you always want to help it understand what you expect it to do. So the more information you can provide within the prompt and help it know what path to activate, the easier it will be for it to help you in achieving your goal.
So specifying the format, um, Stating what role it should act has always been useful. One concrete tip for avoiding hallucinations within the prompt. In many cases you can think of, I think about LLMs as like these satisfiers. So they always want to provide you with an answer. So just by giving it an out, they will actually follow.
So for example, if you want to classify an Amazon review as positive or negative, but maybe sometimes you're missing context or you're getting input, that is not a review. You can always ask it to provide maybe like say otherwise return an A or error, and then by doing so, it'll actually avoid hallucinating an answer that is actually wrong.
Mm-hmm. And, and return an NA or some other predictable form. Uh, talk more broadly, uh, about your experience combating hallucination, you know, what kinds of, uh, things have you run up against at Stripe and, and how have you approached it beyond the, this prompt trick you mentioned? Yeah, so I mean, first, um, hallucination is where LMS makes data make up data that is incorrect.
And I feel like what's the main difference between other ML and deep learning, um, models before is that they do it really confidently. So it's really hard to tell if it's actually wrong or not. That's one, one of the biggest shortcomings, um, building up on Natalia's point. It's very use case dependent. Um, so if you are doing, working with a brainstorming partner, it's one thing versus building maybe a behind the scenes classification model or.
Pipeline that will then have substantial downstream tasks. So first about hallucination. Understand what is your use case, how, what is the impact of a mistake, then how ways to mitigate hallucinations. Uh, what I've seen very useful is first giving the NLM out, as I mentioned, but also prompt training. So there's like a few approaches like self-reflect and uh, also like another video I've seen recently, smart G pt, where you basically ask the, you ask a question, then you ask the LM to.
Criticize the answer as an expert in this space, and then provide even the resolution. And by doing so, it increases performance, um, substantially at the cost of latency and cost. Um, so that's another idea, like prompt training and, and kind of you get an answer and you check if the answer is, is actually correct.
And another option and mitigation path is actually citations. So forcing the l lm to quote the source for its answers, um, is really useful for using hallucinations. And it will also allow your users or yourself to check if they generated the response. Makes sense? So you can actually go to that page in that PDF and see if this makes sense.
And lastly, I've seen this is relatively nascent and recent. There is this research called, um, L L M Blend from Allen Institute of ai, where they used, they combine multiple LLMs, so each LM kind of picks in different areas. So if you combine them, them all, and know which one to call in the relevant time, and it might help you improve performance overall and avoid origination.
Yeah, we're, we're seeing a bit of that in some of our companies that are, you know, doing some more advanced work here. So, Jasper, for instance, is a great example of someone who's actually using a blender style model to basically orchestrate across multiple LMS to just understand and produce the best results.
Where there is effectively, you know, sort of blanks in the latent space of the model, one of the other areas. That we're also kind of noticing as a pretty significant amount of, um, opportunity is, um, where you can take, um, call it visualizations of the latent space of a model and be able to understand where there's effectively holes in the latent space and, um, The folks at, at, uh, GPT for all the NOIC team was actually doing some pretty impressive work here in terms of visualizing the latent space of the model and then helping understand where to introduce additional retraining runs to improve the fidelity of models over time.
So that's, that was a called pretty exciting area that we're seeing, um, opportunity, particularly when you're dealing with a lot more small form factor models that you're trying to train for domain specific purposes like retraining. Um, Based on where there's missing sort of elements of the data set to train the latent space properly is, is, is something that's exciting.
Uh, George, in one of the earlier sessions, Josh Chopin had this great point about the, the companies that we really admire for doing a great job with. Uh, ai, Tesla was an example he gave, didn't just kind of throw out this model and kind of declared victory. They built this continuous feedback and improvement loop around their product, and that's really the, the differentiator and why we admire those companies.
What have you seen in terms of, uh, companies building these kinds of feedback and improvement loops around, uh, l l M based models in the wild. Yeah. I'm glad you asked that question, Sam. I, I, I think when you look at the best LMS and just transformer based architectures that are going to continue to evolve in this space, they're inherently all gonna have, um, some reinforcement learning associated with them and whether that, Be a human feedback loop, a machine feedback loop, a human and machine feedback loop.
I mean, just think about the, the fundamental difference between what we saw with chat GT early on and more recently. Right? It wasn't just the fact, like, okay, they started to kind of evolve. The model itself and, you know, why are there are sort of so many less hallucinations around chat gpt today? Well, it turns out that it, it is the feedback loop itself.
It's the R lhf that improved the model effectively over time. So I, I think we're going to see quite a bit more, um, call it model fidelity model improvement by the fact that there is just good feedback loops just naturally, uh, part of any sort of, um, you know, call it. Model that's being built for the purpose of these domains, uh, as they're kind of improving the application experience around it.
I don't think this is a journey where you kind of just send a model off into production and then, uh, never call it improve it. Beyond just introducing the next generation of the model. I think in most cases there's gonna be some kinda feedback loop for reinforcement learning that's gonna continuously improve models over time.
And when we think about that kind of continuous improvements, um, It's hopefully relevant or related to some kind of evaluation metric. We've touched on evaluation previously, uh, Sahar, dig into that a little bit deeper. What, what have you seen working from an evaluation perspective for these kinds of models?
Yeah, first of all, start by saying that, um, I'm so surprised how we are quite behind on evaluation as, as a community like AI space. It doesn't prevent us from running fast and deploying things into production, but still, there's something to be made there. I know that many folks are working on making evaluation better, like link chain and, and other smaller startups, but there's still a gap.
Um, I feel like when it comes to evaluation, there are a few ways. Um, I, we, we go about that. So the first one is, In an automated manner. So some papers have used, um, LLMs, like GPT four to score the generations. So because generations are more like open-ended text, more subjective, um, LLMs lend themselves really well.
One useful tip is actually use, if you're using GPTs for three point, uh, five, you should use GPT four to, to judge um, the inferior models. So using superior models to judge inferior models works work actually surprisingly well. Um, you also need to make really clear what are your kind of metrics and what you're measuring.
For example, you might want to have, um, a benchmark or a set of examples for hallucinations, set of more examples about like toxicity and what you define in your company as things that you not make their way into the user. You might want to have more performance like products, performance specific benchmark for your specific use case.
So hopefully we will get more and more tools that help us streamline these processes. Um, and I expect other players like lane Chain ways and biases also release a few tools to help us get it there. Um, but still it's a lot more manual than we, I have is expected it to be at this point. Uh, Smith, what have you seen?
Yeah, I think, um, I, I chime in with, um, Shar, like what he said. I think another thing to bring in is prompt versioning. That's been a big thing. Prompt versioning, like you change a bit of your prompt, it, it can change like the output of something that worked before and it deviates. So benchmark questions, that's been a big thing.
Um, at Proops we have like, A bunch of benchmark questions that we have that, that we want to evaluate against. And another thing is that, like Sahar mentioned, you could have like G P T say, Hey, do you think this answer is similar to the answer that I expect? Right? And we could even do what I've done before is have like a scoring method where you have, Hey, this is the question.
These are the keywords I expect in the output, right? It doesn't have to be the exact output, but these are the, the answers should be, Having certain keywords, these are the significant keywords and you could check against those keywords. And the other one is you could do like a quick semantic similarity.
Do you think that this answer is similar to this answer? Have a score, and you could have like a whole evaluation score. Like, uh, weighted against these. And another thing to make sure is that the benchmark questions that you have should span across multiple areas. It shouldn't focus. And as you find benchmark questions, just keep adding them in and it's, it's not gonna, you know, you could keep running this maybe daily or at some frequency and see when it deviates.
Actually, I, I'm curious, Smita, how do you, um, track all those feedback requests and responses from music, something in production. Um, by feedback request, do you mean like the, you, you found a new case where your lm mm-hmm. As expected, what do you do? Yeah. So I think, um, at that point you have that, what, what is that question that deviated and I would think about, like, what is the answer that I expect at that point?
What did the user expect? And then I could add that to the benchmark, or I could even add like, As I mentioned earlier, examples, my whole, um, like, I like the semantic, like the Vector store database. Add some examples there to improve the, um, improve like the output at that point. And that's something that, you know, you need to have, like, um, George said that you need to have this entire.
Feedback loop and it's all, there's, there's no right way to do it. And there's no nothing right now. So it's gonna keep evolving. And are you finding that, um, you're able to put all that feedback back into the process through, um, through prompting or through r l hf? Are you also needing to apply heuristics, reg Xs, or whatever those may be to.
Um, you know, to enforce constraints on the l l m. It's, it's definitely a lot of, um, like you mentioned, like Reg X matching. I think that's been a big thing because it's not just going to be that the output, like it's not gonna be that the output always is gonna follow the same format. Right. I'm gonna prompt it to, it can go, it can go some way.
So making sure that like the j maybe like the J S O N format or like tags. Things like that. I think like Open Eye came up with the function calling yesterday that they released. That's a huge thing that mm-hmm. The model's been fine tuned on. So that's, that's where, that's where it's heading. People know Rex and, and Heuristics is the dirty secret of Yep.
Ai. Always has been. Exactly, exactly. So we, we've talked about, uh, chaining, uh, blender, pipelines, all these things. Um, one of the, um, you know, the challenges that they introduce is, Uh, these APIs aren't free. You know, whether we're talking about open AI or inference calls aren't free. So there's an economic component that's particularly important for folks that are trying to field products.
Uh, Natalia, can you talk a little bit about how. Uh, product builders should think about the economics of these kinds of models. Key thing to understand, which I think everyone on this panel does, is that the more explicit detail and examples you put into a prompt, of course, the better model performance.
Um, as Sahara talked about, you know, one of the ways to assuage, uh, some of the hallucination issues, but that means that, uh, you get. The more expensive your inference, uh, will get with those longer, more detailed prompts. So I think that's the, the first key thing to understand. Um, you know, if you think about open ai, they charge both for input and output tokens.
Um, depending on the task, it, it, you know, it adds up. But overall, prompt engineering is still pretty cheap. An easy way to experiment as opposed to building a whole ml. Product. So I'd encourage people to, to do, to really use prompt engineering. Um, You know, the next thing I'd say is set up your experiments and be very, experiments are not free.
So be very thoughtful about how you set up your experiments. Make sure you're solving the right problems. Uh, set clear expectations depending on, like, I've worked in large organizations, which means a lot of alignment, a lot of buy-ins. So you have to explain to leadership team, partner teams what are, what kind of expectations they should have from your experiments.
Um, minimize the risks. Time box your experiments, think about build versus buy and have a clear data story. Um, so I would say, you know, all of those things experiment in order to figure out your costs. Uh, and then I would say the second part is something we had already alluded to in the earlier conversations.
Consider smaller models. Smaller models will be cheaper and there will be tradeoff around accuracy, but that might be a really good way to minimize your costs. Um, and, um, yeah, that, that, those are my recommendations. I want, I wanna hear what, what, uh, other people think. As I was curious from George's perspective as an investor, you know, you, you're seeing a lot of these economic conversations, uh, You know, what are, how are founders kind of thinking about it and having those conversations with their investors?
Yeah, I'm, I'm glad you asked that. I think we're seeing particularly some of the co in companies that are more, call it earlier in their growth journeys. They're the ones that are experimenting faster and, uh, trying things in a way where, Um, they're bringing products to market in little league weeks, right?
So I'll give you a good example. Mid-stage growth company. Honeycomb is in the observability market, right? And they build, you know, next generation observability solutions. They introduced a better query builder. So that instead of, you know, asking questions from a distributed tracing standpoint, using a very esoteric query builder to ask the question, they introduce a natural language overlay.
Now very, very straightforward to introduce something like that, using a finely tuned l l m, uh, for, you know, introducing a product. Feature like that within weeks. And interestingly, that became, I would say, the most actively used feature inside of the Honeycomb product within two days of it launching. Wow.
And so that ability to just kind of try something. Experiment, iterate, launch, uh, and move quickly is something we think is a pretty profound opportunity, particularly for any startup that's going through this journey. We're also, by the way, just quickly finding the fact that a number of larger scaled out incumbents have a pretty tremendous advantage here, which is the private data itself over time.
We don't think this is a moment where just incumbents will fall by the wayside because just generative AI movement is happening. It actually turns out they have one of the, the most important weapons inside of, inside of this sort of, uh, scale and capability that's gonna occur in this space. And that happens to be the private data itself and capital and.
Uh, Shar you wanted to jump in? Yeah, just adding on the tennis point about how do we, how do we make it make sense from a cost perspective? Also find, if you have longer context windows fine tuning is also quite helpful. Uh, we have found that, so you can fine tune the model. It's a one time pass you have to pay, but then, Future generations will have weight loss prompts and tokens, sorry, weight, less token tokens.
And then another meta point is usually what, what I've seen is we have two different type of companies, right? We have the enterprise companies, which so far I haven't heard from anyone working in this space, not only as Stripe, but in general that having cost is an issue. So given the impact it usually has on the bottom line business and for enterprise companies, it's still like a blocker for startups.
I would encourage actually startups to currently kind of put this aside. Of course there will be a path for this becoming cheaper and cheaper. If you look at the trend from the last 12 months or so, LLMs are becoming cheaper and more powerful. So I wouldn't let anyone kind of be blocked by these profit margins unless it makes no complete sense.
And then that's a different thing. Yeah, I totally agree there. Um, working at a startup, Like it's, it's, the cost isn't a huge concern right now because the value to the customer and the expectations of the customer is increasing. They do always expect like, Hey, I just wanna ask it in normal English, right?
That, that's always been like, I don't wanna remember this query language. That's one thing. And. Coming to Natalia's point about like the tokens you are charged for the input tokens. And the output tokens. So adding prompt engineering to constrain, right, constrain your output. You could either say, you know, like if, if your use case is like some summarization use case, keep the length to, to like a short extent because we've seen that the completions can go on and on.
Mm-hmm. So constraining it, having like the formatting output, formatting expectation, that can reduce it as well as semantic caching, if that's something that. For your use case, it makes sense if you don't need the completion to happen every time. If the similar question has been asked before, just get that up.
But you don't need to do api, open API calls all the time. Yeah. Just one, one quick observation I wanna make. We're talking about costs, but in reality for enterprise. They're at the very beginning of the journey where they're scared of the risks and even waiting in. And so once they get past that, the, I think costs, uh, might be a consideration.
But by that time, I think the cost will be so reduced. We're seeing every day we're seeing smaller and smaller models on edge. I'm super excited about an open source. I think increasingly, like the story on cost is changing very quickly. Costs are, are coming down. Very fast. Yeah. Yeah. I think we're coming up on time, but I don't see anyone here to kick us off, so, we'll, we're past another question while, uh, I think we're doing a service until someone comes and says, ready for the next session.
Uh, we did say we wanted to come back to model selection. Uh, uh, feel free to do like one more question. Okay. Okay, perfect. Um, you know, we've, we've, throughout this talked about how there are new models emerging all the time. Um, we just talked about how the models, you know, there are smaller models, bigger models, uh, you know, still, um, You know, there are, uh, advantages, advantages to, to both of those.
Um, so Howard, talk a little bit about your experiences from a model selection perspective. How much time are you thinking about when you are approached with a new product opportunity? You know, what's the, the right model to start with and, and where do you start that conversation? Yeah. So, um, we usually think about it in two ways, right?
We have the open source models and the commercial models on the commercial side, and coming from more of an enterprise perspective, the number one factor to start experimenting is what do you have access to? So like getting something off the ground and working, let's say with cloud from Anthropic requires a lot of, requires agreement the same way that Open the Eyes and Azure and all the other cloud providers.
So usually if I want to experiment, I will go with what's most available at this point. Um, but then when it comes to performance, this. Still, like it all comes back to evaluation. That's why I'm so excited about this space. Um, there is like a LM leaderboard and there is also like the author leaderboard for, um, um, back to the Basis and Embedding.
We also post in the channel in the chat. So what we usually do is. We start with something like, uh, GPT four, which is usually the most performance. If it works relatively well, we can always downgrade because GPT three, 3.5 is cheaper, but more importantly, faster. Um, so depends on the use case, but I would advise folks to start with the most, the more powerful model, see if it works really well, iterate the prompts, and then potentially downgrade if it makes sense.
Um, then when it comes to open source models, I feel like the more we will build our own evaluation, it's literally in the last four to eight weeks, um, this amazing progress is happening. The easier it'll be for us to also incorporate these models. And lastly, I will say, Companies like AWS launching SageMaker, like with, with walking face and other lms, but also, um, bedrock and, and these kind of generative AI capabilities help us all experiment a lot faster.
So I would expect probably, I know the next six months or so, we will have a smarter engines slash layer to allow us to plug different language models and experiment a lot faster. And then, Potentially using even more than one l l m for the same use case based on our latency requirements, cost requirements, and accuracy metrics.
Awesome. Awesome. Demetri. Demetri, what's up everybody? This is, uh, angle two. Um, trying to do both stages at once, which you can imagine is a lot of fun. And I've been listening in on this. Absolutely loved it. Thank you all for joining. I wanted to just come on here for this moment because, I, uh, have to say like, George, I've gotten to meet you in person.
It was incredible, you blew my mind, and I hope to meet the rest of you as Samantha Sahar and Natalia, and of course Sam in person. But I wanted to just wrap it up with this, that if anyone is not listening to Sam's podcast, go and do that. He has been absolutely. Inspirational for me and this whole community and so much help that I had to drop in here and tell everyone that.
And cuz I know you're too humble to say that yourself. Preach Chase. So for the people that do not listen to Chimble this week in machine learning and ai, get out there, listen to it, we'll drop a link in the chat. And I know that I'm pretty sure most of the panelists. Are huge fans of this, and I'm pretty sure you probably got some fans out there listening right now.
So thank you all for doing this. This was awesome. Thanks everyone. Thank.