MLOps Community
+00:00 GMT
Sign in or Join the community to continue

LLMs For the Rest of Us

Posted Jun 28, 2023 | Views 325
# LLM in Production
# Proprietary LLMs
# Runllm.com
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zillis.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io
Share
speakers
avatar
Joseph Gonzalez
Professor, Co-Founder & VP of Product @ UC Berkeley, Aqueduct

Joseph is a Professor in the EECS department at UC Berkeley, a co-director and founding member of the UC Berkeley RISE Lab, and a member of the Berkeley AI Research (BAIR Group). His research interests span machine learning and data systems.

He is also the co-founder and VP of Product of Aqueduct, an open-source MLOps framework that enables teams to run ML tasks on commodity cloud infrastructure.

+ Read More
avatar
Vikram Sreekanti
CEO @ Aqueduct

Vikram Sreekanti is the co-founder and CEO of Aqueduct, which builds the first open-source control center for machine learning in the cloud. Vikram received his Ph.D. in serverless computing and ML infrastructure from the RISE Lab at UC Berkeley.

+ Read More
SUMMARY

Proprietary LLMs are difficult for enterprises to adopt because of security and data privacy concerns. Open-source LLMs can circumvent many of these problems. While open LLMs are incredibly exciting, they're also a nightmare to deploy and operate in the cloud. Aqueduct enables you to run open LLMs in a few lines of vanilla Python on any cloud infrastructure that you use.

+ Read More
TRANSCRIPT

Link to slides

 Hello? Hi, everyone. Hi. How's it going? Good. Well thank you guys so much for joining us. Where are you joining us from? I'm from Berkeley, so joining from the East Bay, California. I'm in Oakland, so right by, right by Berkeley. Nice, nice. I'm in, uh, Washington, DC and we got flooded with the smoke this past week, but I feel like you guys are probably used to that.

Sadly. Yeah, sadly. We, we got lucky this year. This year wasn't, uh, wasn't too bad, but yeah, a couple years ago it was pretty, pretty painful. Yeah, man. Hopefully it is not the new norm, but who knows what the future holds. Um, well, on a brighter note, I will give you guys the floor. Um, and thanks so much for joining us.

Thank you. Thank you for having us. So thank you for, you know, joining in. Um, hi, I'm Joey Gonzalez. I'm joined here with Bikram Chiante. Um, and today I wanna talk about the stuff we're doing at LMS and what it really means for the rest of us. So, gonna start with a bit of introduction. So, as I said, I'm Joey Gonzalez.

I'm a professor at uc, Berkeley. I'm one of the co-directors of the uc, Berkeley Rise and Sky Computing Labs. I'm a co-founder and VP of product at Aqueduct. So I've been doing research for a long time at the intersection, machine learning and systems. I've worked on parts of Apache Spark. I built GraphX, did a lot of work in computer vision, natural language processing.

I did a lot of the early work in distributed training systems. Then some of the kind of, uh, semi the work in, uh, prediction serving systems. Um, and then I've also been doing a fair amount of work in autonomous driving in robotics. So I do a lot of work in the whole space, machine learning, uh, machine learning systems.

And I'm joined by Vikram, who is one of my former, former students and now my boss and c e o at Aqueduct. Um, and Vikram's work has been studying how to, you know, simplify cloud infrastructure. Uh, and he's been building on cutting edge serverless technology with applications, distributed computing databases and machine learning.

Uh, and at Aqueduct he's really focused on how to commercialize this cutting edge research. So I come here wearing two hats. I don't have the hats on me right now, but I I, my, my hat of the, you know, Berkeley perspective, you know, where I get to see where LMS are headed, kind of the cutting edge research.

And then from my aqueduct perspective, what it means for, for everyone thinking about kind of how these LMS are being incorporated in products today. Um, and so in today's talk, I wanna kind of combine these two perspectives and say what, what they mean for the, the, the field. So starting from the Berkeley perspective at Berkeley, I'm lucky to be part of a leading academic research center on generative ai.

We've built some of the core technology behind everything from chat G P T to staple diffusion, the reinforcement learning, the, the algorithms, even the underlying systems are being developed by colleagues, uh, and, and my group at Berkeley. So there are a lot of innovation across the model architectures, evaluation techniques, inference optimization, distributor training techniques, and I'll try to highlight a few of those in my talk today.

One of the things that's important to keep in mind is that at Berkeley, our goal is to push boundaries on what's possible. To think about the next set of problems that people face, sort of the, the 1% problems that the, the frontier of technology. Well, aqueduct is a little different At Aqueduct. Our goal is to really understand what is happening today, what's easy, what's hard, what's missing?

What do people need to do to make LMS real now? Um, and this is really driven by real world use cases. Talking to ML practitioners, understanding their challenges, limitations they face with basic machine learning, all the way to LM technology. So our goal at Aqueduct is to enable every software team to be able to build LM powered applications.

So those are my two hats, you know, and I've been doing this for a while. Uh, and I have to say, you know, for the first time these two trajectories are colliding. What we're doing in research is, you know, supposed to be the distant future is, you know, affecting what happens kind of next week. Uh, and we're seeing this intersection and LMS have kind of supercharged this interaction.

So I wanna highlight that, and I'm gonna start with the, the perspective from Berkeley. So I said, uh, Berkeley is an innovation center for, uh, lm. So I'm gonna actually now just focus on some of the stuff in my group. Uh, so we're doing a lot in this design of systems, both for training and serving these models.

For a couple years now, we've been thinking about how to make it easier for anyone to train very large models using whole clusters of GPUs. Our, we have a whole set of, of projects around compilers optimization techniques as part of the, the bigger ALPA effort, and we've been transitioning that technology not just for training large models, but now fine tuning and serving them.

We also thinking a lot about how we use memory more intelligently to be able to take these models and put them on edge devices that train them on edge devices and also even in the cloud to make them, you know, 10 x 30 x faster by being more intelligent about how we use memory. So building on these technologies, we've also been thinking a lot about what that means for the models themselves.

And, and my group launched an effort to build what we, what is now, I guess today, one of the best open LLMs, the Kuya model. Um, and as part of that effort, we started to rethink how we benchmark LLMs. And so I thought for today's talk, it actually more fun to to discuss what's happening on the right here, what's going on in the space of open LLMs and the kind of implications they have on, on, you know, where industry is headed.

So we have to go way back. To the beginning of 2023. Uh, the foundation model on which a lot of the, the recent open work in my group and others in, in academia, um, is built on this llama model. Uh, and this is a, a, a second iteration of, of one of the big models developed by, by meta. Um, one of the innovations in this LAMA model that really made it a a turning point for, for research is it's actually built on much better data than some of the earlier foundation models.

Uh, trained better, and in fact, smaller models trained better, uh, on better data results in significantly better performance than the much larger models that that preceded it. Now, the problem with the LAMA model is it's not very good instruction falling. It just kind of finishes the, the sentence. And so I have some colleagues at another university in the South Bay.

Um, in fact, so my, my former advisor, uh, have been playing around with how to take that LAMA model and actually make it, follow instructions, you know, chat like a human, like chat. G P T would, uh, to be more human aligned. And they developed these techniques using selfs instruct, uh, to create an instruction dataset to fine tune the LAMA model and they produce the alpaca model.

Which was a pretty big step forward. It was an open model that we, in a research community could start to evaluate these conversational ais, but it was developed at Stanford, uh, and we're at Berkeley and we can do better. Uh, and if we refer back to this, this underlying truth in the design of, of AI broadly, uh, and then, you know, critically of LMS as well.

It's really all about the data, and there's a better source of data that we can build a better model on, and that source of data is shared. G p t, it's a website that aggregates people's entertaining, uh, fun, hilarious, uh, insightful conversations with ai technology like chat, G p T. Um, and this website is good because these are really high quality conversations, multi round conversations that people found engaging, engaging enough that they wanted to share with someone else.

So these are essentially hand annotated 70,000 conversations. You're able to pull from the website, uh, before the APIs, the public APIs are turned off. Um, and so there's about 800 megabytes. It's not large by by large language model standard. It's actually a tiny amount of data, but it's really high quality data.

So with the Kuya project, our first key innovation was to download this data and remove HTML tags, which is kind of embarrassing, but the reality is that was enough to get a really high quality data set. We then basically followed what our colleagues in the South Bay had done at Stanford, um, to fine tune a LAMA model.

Uh, and then we did something a little different, um, because we had a model that that's we believed was better instruction following. And, and most NLP benchmarks don't really capture that. We created a new benchmark using g PT four to evaluate these models. The idea here is that we would give each, give this model and other models, uh, open-ended questions like compose an engaging travel blog, uh, post about a recent trip to Hawaii, highlight cultural experiences and must-see attractions.

And something like alpaca goes. I've composed a travel blog post about most recent trip, my most recent trip to Hawaii, whereas Kuya goes, Aloha fellow travelers. If you're looking for a travel tropical paradise with a rich cultural and breathtaking scenery. Look no further than Hawaii. So we have these, you know, two different responses and then we asked GPT four to be the judge, assess them on engaging this and insightfulness factful, uh, fact.

And we actually scored on multiple criteria and built a, a, a scoring system to rank each of these models. So we did this and it was pretty exciting. So here's our Kuya model at 13 billion Parameters. Uh, here's my colleagues at Stanford. So we were better than the Stanford model. Uh, we were comparable to a model being developed by Google Bard.

Uh, and you know, this is GBT three five. So we're, you know, the far right model here. Uh, is arguably one of the state of the art models that's not GD four. And, and we calibrate that at a hundred percent. Uh, and so we are getting close to what G P T is, is able to produce. And this was an open effort done by a few students over a week, you know, of a few hundred dollars to, to do a, a fine tuning training run.

So very exciting results for us. Somewhat of an existential crisis for some of our colleagues in Big Tech. I know again, a few students, couple hundred dollars, a 13 billion per model, and it's pretty good. Colleagues at another big tech company, you know, spend 10 million on a 540 billion per model over several years, and it's also about the same.

Um, This, this blog got a lot of press. Uh, I wanna first say that it's wrong. Uh, they're not the same. Uh, those models can do a lot more and, and the, the people and teams behind them will actually be able to make them do much more. We're already seeing, uh, in, in newer incarnations and a, a much broader set of applications.

But in these basic open-ended tasks, uh, fauna was pretty comparable. So this is an exciting step forward. It was also a little concerning for us that like, how are we actually that good? Maybe our benchmark is wrong. Um, and so we decided to create a new benchmark, which would use open conversations with humans in the wild.

Uh, it's a living benchmark where you, you can go right now to arena.lms.org and talk to any pair of ais chosen at random. And we're adding new ais as quickly as we can or as quickly as people will give us GPUs. Um, and then you ask them a question. You can ask them logic questions, coding questions, and then you evaluate their response.

And this gives us a really rich signal about how people interpret these models in the wild. So we took these, these pairwise, competi, uh, competitions, we call them battles. Um, and we used the chess rating system, the ELO rating system, to then score each of the models based on all these pairwise battles.

Uh, and what's kind of neat is we see a similar result. GPD four ranks pretty high in that list. Uh, and then Claude, uh, a new entrant, uh, you know, ranks just below GPD four, and just above GPT three five. And then here's our model. Kuya. Our open model is actually doing pretty well. Here's Google's latest model, Palm two, uh, and it's performing comparably to Kuya.

Um, and then here's kinda the open source models, and down here is the, you know, the Stanford model. Um, so it's a pretty cool leaderboard. Uh, there are some flaws in this approach, and we actually just launched another paper that discussed some of the flaws in this approach. One thing to keep in mind is that Palm, for example, will abstain in weird questions, whereas a lot of these models will give an answer and humans like an answer, not, uh, abstention.

Um, and so when we start to think about like how we evaluate these, we need to have a richer benchmark. Um, and so we've just developed a, a new benchmark that goes much further than what we did in, in these earlier, uh, experiments based on the data we collected with, with this arena. So what did we learn from this experience?

Uh, well maybe the first signal, and I'll come back to this again, is that things are happening fast. Uh, I'd like to tell you, you should stop training LLMs. You should use our LLMs, uh, unless you're my grad students and you should keep training LLMs. But you know, the rest of the world should just train, you know, should just use M'S that were created.

Unfortunately, we're starting to see some signal that you might still need to fine tune l LLMs, taking open source models and extending them. Maybe what I think we're seeing more, and I'm gonna come back to one more project, is that. The innovation is, is gonna move more and more from the LMS themselves to how we use the lms, how they will drive the, the apps that we create.

And maybe the best example of this is a brand new project for my group, which is thinking about how we use LMS to solve a problem we have in my lab. And that is grad students need access to GPUs, uh, and we are asking them to, to use multiple cloud providers and, and new startups. And they all have different APIs.

So we've been using LMS to make it so you can say, go launch. Four. This is gorilla. Go launch four VMs with eight a 100 s uh, in US West one. And then the idea is that we have a, a fine tuned LM that's designed to take these kinds of instructions. It goes and looks at documentation according to the questions you were asked, and then uses that documentation to directly launch these services to make it easier for humans to interact with the cloud.

This is a big step forward for us. Cause now we can use more, more compute. Uh, and this generated a lot of interest, uh, in, in, uh, in, in the public. In fact, before we even had a chance to promote the paper, it landed an archive and it was getting thousands of stars. And it was just in the top of Hacker News, uh, last night.

Uh, so it's a project that's pretty exciting and it's really maybe pointing to where the world is heading and it's, you know, heading away from just building these LMS to really using to solve interesting problems. So what does this really mean for the rest of us? So I can't stress this enough. Things are happening fast.

What's happening in academia is transforming what's happening in industry and it means that we need to start to, to react more quickly. Uh, and my colleague, DRA here has been thinking a lot about what it takes to enable people to react more quickly. So I'm gonna hand it over to him. Thanks Joey. So what I wanna spend the rest of the talk focused on is how Joey and I have been thinking about all this innovation that's happening on the left side of the screen here, the new models, the new tools that are coming out, and how that innovation actually translates into L l LM powered applications.

For those of us who don't have the opportunity to work at, you know, one of the top research institutions, we've been talking with a lot of folks. Um, who've been building these applications and, and trying to do some of this ourselves, and as we've been doing this, we found that this translation has something of a missing link.

This, this missing link is the tooling that lives around how you actually take a model. Um, whether it's an open source model or a hosted one, package it up in a way that will connect back to an application and, and make sure that the application does what you would expect. This is really one of, uh, if not the next critical challenge that we are, are focused on here is how we can take these LLMs and build real applications around them.

And as, as Joey said, this space has been evolving really fast. And the good news is that, um, there's a ton of tools, open source, especially tools that have been popping up in the last five, six months that help us make this transition more easily. So as we've been thinking about this missing link, we've started to notice some patterns and the pitfalls, the, the, the traps that people fall into as they're building these applications.

The first one, the most obvious one, is probably building or running your own model. Um, Emmanuel, who's a, who's another speaker at the, at the conference here, had this tweet the other day that I really appreciated his old advice. Used to be, uh, to always start with the simplest model, but now his new advice is start with the largest model.

You can easily cry. And those last two words are doing, I think, a lot of work here. The, the models that you can easily try are most often the hosted models. Of course in, you know, uh, healthcare, finance or some more, uh, uh, sensitive industries, you're gonna not necessarily be able to use these models in production.

But to get started to prove that you can build an application that, again, takes the innovation happening in, in research and applies it into, um, an industry application. The place to start is probably by picking, uh, a hosted model, maybe using cog face as an API for.

Once you have a model, um, the next pitfall, uh, is really around data. Not using any data, trying to construct a very clever prompt. Uh, that, that encodes a bunch of information, but turns out to be very brittle. Or on the flip side, using the wrong data, oftentimes stuffing all of your data into a prompt, which can both be very expensive and can also lead to some kind of jumbled up answers at the end of it.

The solution here is, is, uh, perhaps obvious to many of you to pick up one of the many very powerful vector databases that have popped up. Um, Recently, uh, and to use them to, to retrieve the right data at the right time, contextualize your, uh, prompts and, and get better answers. Once you have the code and the data, the next challenge comes around how you actually build these applications.

Changing together, prompts, uh, experimenting with different models, different techniques. Um, and at this point you all are probably, uh, yelling the, the answer at your screens. Um, I think everyone here is probably familiar with. Tools like Link Chain and LAMA Index that allow you to chain together multiple prompts, try out different models, different techniques, plug things in, and and experiment very quickly.

Now all of this is probably very familiar to most of the folks here, but something that, that we found in our conversations is that the space of people who are interested in LLMs is growing really, really fast. And there's always new engineers, new people coming in, trying to build these applications and, and running into many of the same challenge and not knowing necessarily where to start with the tool.

But even for the kind of the initiated, the folks who know a lot about LLMs, we started to notice some, some trends in the kind of things that, uh, the kind of pitfalls that, that these folks are running into. And one that's perhaps a little counterintuitive is relying on the model to do everything that you need.

These models are obviously very powerful. They can process very complex pieces of information, generate really interesting answers. But at the end of the day, to use these models, just like with any machine learning model, you're going to want to take some code that retrieves some data from api, cleans it, featurize it, passes it into the model, takes the result, validates it, just little bits of code that live around the model.

Itself and at least now today, most of the time, this just looks like python functions that live around your, your model indications. Now, at this point, we can put a box around this and we can call this roughly the ideal l l m stack, the. Tools here allow you to build a prototype, put together an application that proves the point on, on real data with, with LLM implications.

But the next pitfall, maybe the most painful one here is actually forgetting about everything else that comes after this. What do I mean by everything else? Well, the list is probably pretty long, uh, but I have a few examples here. The first one, um, is cloud infrastructure management. Thinking about where all of these systems run, where does your code run?

How does it link up with the rest of the application that you have? Essentially connecting the wires so that this l LM powered application you've built can connect back to your team's application or your company's product. Once you have it running, um, you have to think about the data that's coming into it, not just how you process the data and put it into a vector database, but how you get access to, uh, the right systems at the right time, have the right les support so that you're not leaking data unintentionally and kind of just have general governance around, around what the model is doing.

After that, um, you know, your application's running, uh, and you want to make sure that you're tracking what's happening, the inputs and the outputs in case something goes wrong or you get a complaint. You actually have the, the telemetry and the data that you need to, to, to bug. And then maybe most painfully, at least for me, there's things like budgeting and, and cost management where, you know, you can run up very large bills with many of these hosted APIs or even running your own open source model very quickly.

And so having a sense of how much money you're spending, what's actually happening under the hood, and whether you're kind of staying in inbound. Like I said, this list is not meant to be comprehensive. There's probably a ton of other things that, that are supporting pieces of functionality, infrastructure that's required to make these applications real.

But these are some of the things that Joey and I have been thinking about and have heard from the folks that we've been talking to in this space. And so that's really influenced our, our thinking and looking ahead. We're, we're really focused on what's required to enable that next generation of, of applications.

We only have a couple minutes left, so we'll leave you with a couple hypotheses that we've been forming that have influenced how we're approaching this, this problem. The first one, which is probably uncontroversial to everyone here, which is that the pace of innovation, um, Is incredible. At every layer of this stack, the models themselves, the supporting infrastructure like the databases, the Python libraries, the number of incredibly smart people who are focused on every one of these problems is, is a little mind boggling, and there's going to constantly be new stuff coming out at every layer of this stack.

As a result of that, the cutting edge technology here, Is being adopted faster than ever by, by top teams. Joey mentioned this earlier about how the gap between what's happening in research and in academia is, is smaller than than ever. Um, and that's the, that plays out when, when you see that people are taking models like Vicuna that are just a couple months old, um, or even when they were a couple days old and really starting to experiment and build applications.

And combining both of those, the last hypothesis is that teams are going to need the infrastructure. All of these surrounding components, these boxes in blue, they're gonna pick the tools that allow them to move fast and keep up with the, with the times. We don't really have the space for opinionated infrastructure anymore.

You can't really be, uh, in a place where you can say, Hey, I support this tool, but not that tool. There's gonna be so much going on in this space that we're gonna need to keep up both from an infrastructure and from an application perspective. So the goal here is really to, to start a conversation around, uh, some of these hypotheses and what this infrastructure looks like.

If you are, um, interested in the, the applications here, or you're, you're building some yourselves would absolutely love to, to connect. Please reach out at, uh, any of the, um, any of the communication media here, and thank you all for, for taking the time to listen.

I feel like because it's not an in-person conference, you don't get like the massive applause. So it's just me. Here I am. Um, we have a few count just as much. Thank you. Um, we had one question actually. Jay, thanks so much. And also Jay, um, you should um, message me after, cuz we'll send you some swag. Um, From Jay, he asked how to reduce LA latency associated with any hosted models, and can you suggest the strategy to increase context length of LLMs.

Yeah, uh, I can talk a lot about that. So we've been, a lot of our research and prediction serving has been around how to reduce latency. Um, it's often a, an interesting trade off. Uh, we've been looking at how to, to play with different memory bandwidth tricks, uh, to improve throughput, allow for better batching.

Often, uh, it doesn't actually increase latency, but it gets better, uh, hardware utilization, oh, sorry. It doesn't decrease latency, but it gets better hardware utilization. Um, we've been looking at a lot of work in quantization. To get better memory bandwidth, which could translate to reductions in latency.

Um, the, the, on the topic of context, uh, there, there's a really cool new research from, again, colleagues at Stanford and other places about how, how to deal with these larger contexts. Um, it's still pretty new and it's unclear if it's gonna translate to, you know, models that could be used in practice. Um, what I've seen as a better solution for people thinking about context.

Is to really focus on retrieval methods and being more intelligent about what context you provide. Um, in part because models get distracted. Uh, so you stuff too much in the context. That's often actually a bad sign. That's awesome. And maybe we can do one more, um, from Aaron Hoffman. Do you see any other languages than Python becoming relevant in the near future?

That's a great question. Uh, and I think the short answer is yes. Absolutely. Uh, one of the cool things about, um, about just the explosion of interest in LLMs is that I think it's attracted a much broader audience of people than were previously interested in. You know, the, the old world of, of machine learning and you know, if you just kind of Google around or go on GitHub, you'll see that people have built the equivalents of things like link chain in tool in languages like Go or Rust or even, you know, link Chain itself I think has JavaScript TypeScript findings, and I think we're gonna see a lot of the stuff is going to.

Really expanded in scope and we're gonna start with Python because machine learning has historically been done in Python, at least for the last 10 years. But I think absolutely you're gonna wanna see bindings in, in other languages and really allowing people to kind of more deeply connect LLMs back to the applications stack that they're, that they're building.

Definitely. Awesome. Thank you guys so much. Well definitely um, head over to the chat. I think people might have some other questions, but unfortunately we are gonna have to kick you out for the next, uh, folks on the agenda. Thank you so much. You having us.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Emerging Patterns for LLMs in Production
Posted Apr 27, 2023 | Views 2.1K
# LLM
# LLM in Production
# In-Stealth
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com