MLOps Community
+00:00 GMT
Sign in or Join the community to continue

LLMs and Beyond with Lepton // Yangqing Jia // DE4AI

Posted Sep 18, 2024 | Views 419
Share
speaker
avatar
Yangqing Jia
Founder @ Lepton AI

Yangqing is the co-founder and CEO of Lepton AI, a startup helping enterprises run AI applications efficiently, at scale, and in minutes. Before this, Yangqing served as VP of Alibaba on Big Data and AI. He led AI infra at Facebook, and before that did research at Google Brain. He created and contributed to open-source libraries like Caffe, ONNX, and PyTorch. Yangqing graduated from UC Berkeley with a PhD in EECS.

+ Read More
SUMMARY

LLMs have become the de-facto standard in modern AI toolchain, but it still comes with a lot of confusions - quality, speed, cost, etc. In this talk, I will share a few observations we have regarding LLM, both from an algorithm engineer and an infra engineer perspective, on how we should best utilize LLMs in our daily operations. I'll also touch a bit topic on how enterprises think of their IT and AI strategy, given that the fast changing computation pattern is disrupting conventional cloud in unprecedented ways.

+ Read More
TRANSCRIPT

Adam Becker [00:00:08]: Next coming up is Yangqing. Yangqing, can you hear us?

Yangqing Jia [00:00:11]: Yes, I can. You guys.

Adam Becker [00:00:12]: Okay? Yangqing, I'm stoked to hear about leptin and what you've been up to. And you have your screen share. Let's see.

Yangqing Jia [00:00:20]: Okay, cool. Sounds great. Can you see the slides?

Adam Becker [00:00:24]: Yes. Okay, take it away. I'll be back soon.

Yangqing Jia [00:00:28]: Thanks. It's great to meet you guys. And yeah, I'm not going to be talking too much about the products and things like that, but I kind of feel that it's a good time to basically talk about in general LLMs and the common experiences and findings that we get from running LLMs and applications ourselves. So as a Chinese ethnic group like origin, I kind of basically have heard quite a lot of interesting and sometimes scary stories about LLM being like AGI and super intelligence, things like that. And I kind of want to basically demystify that a little bit and share my thought about what elements are. So we are very familiar with all those kind of typewriters, qwerty and things like that. 26 keys plus some. When I was a kid, I actually did type in on chinese typewriters, which is kind of weird because it has thousands of keys.

Yangqing Jia [00:01:16]: You can't really basically learn how to type with the typical typewriter setup. This is what a chinese cipher looks like. And what it does is it actually has all those keys as little foundries under the hood. And that handle goes around that board of a couple hundred keys and try to find the key and then type it on the scroll on the top, and you can see those keys as tokens and things like that. And one of the key advantages, or one of the key challenges back in the days is actually how to type the things very quickly. So these are the little characters that are on the board. And the layout of that board actually allows you to type things like fast Orlando slow and things like that. Sounds like next token predictions.

Yangqing Jia [00:01:59]: What we did back then was basically to go to the libraries and start counting, counting what kind of characters or what kind of tokens happen after, what kind of tokens before it, and then basically optimize the layout so that closer buy tokens or basically more frequently happening together. Tokens are going to be grouped together. If you look at the middle bottom parts, there's Zhong Huo, which is China, and Zhongyang, which is Sang fu and Zhonghua, which is the chinese ethic group. They often come together. And so you basically put all those characters together close by so that when you click one key, you're able to be moving to the next key very closely, very close by, so that you type fast. History tells us that by carefully rearranging those keys, the typing speed is actually increased by something like 90% or something like that. This is exactly what alums do today as a veteran in this AI world, and I kind of feel that when we are actually thinking about AI being very intelligently doing AGI and general intelligence, we should always remember that larger models at the very bottom is actually just doing next token predictions similar to all those kind of old typewriters do. The difference is that in the old days we only look at the previous token, and today we're actually able to use GPU's and stuff like that to basically calculate a really long context and using complex statistics and transformers to actually get a much more intelligent token predictions.

Yangqing Jia [00:03:24]: And that kind of nature actually leads us to observe quite interesting LLM behaviors that could basically guide us what kind of things could be done or kind of this couldn't be done. So lms can actually generate really reasonable responses, but you know, like it's still relatively silly. And hallucination is a very prominent thing. They work more slowly than the conventional ML models or other systems, and there isn't kind of like a standard way, despite an open AI standard actually gives things like very similar false sense of interchangeability. We feel that basically today, when you're actually implementing LLMs into your actual workloads and things like that, a lot of things had to be taken very carefully. One is how to actually reduce hallucination. Two is actually how to fit your prompts and your expected runtime into the latency that your application needs. And then the third one is basically you actually want to incorporate a bunch of other knowledge, especially rag and stuff like that.

Yangqing Jia [00:04:22]: So that ulms cannot only just extract information from the training that it learns from, but also the very fresh information, such as search results, such as like your own enterprise data and stuff like that. One thing that we definitely observe is that when the information is collected properly, and when you actually use an AI properly, then building applications has been much easier than before. Lepton is an infrastructure company, and we basically help application companies to run a algorithms, especially all of them and all the other gen AI algorithms, as easy as possible. But at some point we wanted to basically learn how to build an app. So we built an app which is free. It's called Elmo chat. And what it does is a very interesting, simple job. We see a lot of information today through Chrome, and there's that really long website and we sometimes don't really want to spend ten minutes going through the whole website or basically watching the YouTube video and things like that.

Yangqing Jia [00:05:20]: So we built this app, this kind of plugin sitting right next to your browser, doesn't do anything when you don't need it to do anything. But if you actually want to read a post that's read it on and things like that, you can call Elmo. And then what it does is basically it extracts the information from the webpage, sends it to an LLM, and giving you the summaries, the opinions and things like that. So imagine if you're doing market research of documentation or you're watching a really long YouTube video like Jensen talking for an hour. You can actually get a vrade quick summary of the content and then basically the key opinions and things like that, and key takeaways for YouTube videos. I think we also have timelines and things like that. For the other things such as, say, keywords in there. You can actually use Rag, we call it insight that basically goes to Google, search for the related results, and then gives you a summarization of what that is.

Yangqing Jia [00:06:18]: And those kind of things actually gives you a like reduces your research time of a webpage or PDF and something like that from minutes down to something like a couple of seconds. Now normally in the old days, it was really difficult to actually build an app like this. But for this app, we actually built it within two days and were really nice. It was actually built by our head of Frontend gentleman, which was the only front end gentleman in the team. He basically used the wizard LM model that we actually have an API with, and we would like build it in a very short period. We also tried to do multilingual support and things like that. We're a very nice gentleman who's living in Tokyo, and we basically did a very quick interactions about some sort of keywords in the prompts. And then we actually enabled multilingual support by just translating this secret sauce, which is basically the gist of what Elmo does, that secret sauce from English to Japanese.

Yangqing Jia [00:07:14]: And everything is like working really well. And we see quite a lot of similar application companies actually using prompt engineering and using fine tuning to actually achieve very similar results and reducing their application development time from months to days. And so through looking at those, we kind of summarized what kind of practical experience or practical guides we find very useful when building an app today. Normally we kind of start with available element services. OpenAI and anthropic are really nice close source ones. And normally, as a nuclear company. We actually recommend people to actually start with the closest models that are normally given best performances and see what the state of the art results could be. And with that, you can then basically prepare a knowledge and either put it in a knowledge base or do whatever kind of other conversion approaches to get the information to a format that LLMs are actually understandable.

Yangqing Jia [00:08:19]: Are understandable to LLMs. I kind of tend to basically not call it just rag, but ag or like augmented generation. Right. The reason is because information has not only come from retrieval or search, it could come from like the Elmo case, from the chrome web page and parsing the web contents. It could actually come from like taking a screenshot of your screen and do ocrs and things like that, and then sending those information to the, to the LLMs. Normally, then there are actually two paths that people take or the applications mandate. One is relatively qualitative, like say rag or like summarizations or the conversations instance. And in those cases, human.

Yangqing Jia [00:08:58]: The loop kind of practice looks normally the best. And for Elmo, for example, we actually use ourselves as dog footers and basically getting these applications as quick as possible into people's hand and start basically doing sanity checks if it just makes sense, and then tracking qualitative measures so that we know how to improve the prompts, the contents, and things like that. The other one that's kind of relatively less known today, but is actually super important, is actually the quantitative evaluation that in the old days, the machine learning models used to do so sentiment analysis. Imagine if you're actually doing customer support. You might want to see from composition history if the user is actually happy or not. And those are classical quantitative results. And in those cases, I think we should basically just bring in the old school machine learning kind of knowledge and basically start basically doing like, say like cross validations and other kind of standardized ML ops things to make sure that we actually are getting these embedded into the rest of the logic as quick as possible. Normally when we actually find those approaches, data starts accumulating, and then you can basically either basically start building your own custom data or basically like on creating mode by going into vertical applications and improve the user experiences.

Yangqing Jia [00:10:20]: And we see that people are starting to use more and more fine tuned models, smaller models, when the application gets more clear and more verticalized and stuff like that. I think quick time constraint. I just want to talk about a bit of high level overviews. We feel that over the last one year and a half, AI Infra has become the new pillar that a lot of applications need in addition to the general web services and also the data services like databricks and Snowflake. And what we are is we're a bunch of folks behind cafe, Pytorch, Onnx, and cloud native platforms. We basically help our clients to build from the APIs to begin with, and then eventually to dedicated models by fine tuning deployments and all those ops and stuff like that. In the crazy days of all those GPU shortages and things like that, we've done a lot of work in basically making the infra easier for our users. And recently we actually posted an h 100 buying guide detailing everything that you need to know about h 100 and all those like, high performance GPU operations and things like that.

Yangqing Jia [00:11:23]: So if you need Ellen MPIs, if you want to basically build your own models, and specifically if you need GPU's with a top notch cloud native platform, feel free to reach out to us and we'll be more than happy to chat and help with this, I think. Yeah, that's ten minutes, and I want to stop here and see if there are any questions that you might have.

Adam Becker [00:11:45]: Nice. I don't think that we have any questions quite yet, Yang Cheng, but I will tell you this much. The chinese typewriter, this is brilliant. What a brilliant analogy.

Yangqing Jia [00:11:59]: Thanks.

Adam Becker [00:12:00]: I'm stunned by it. Do you have an idea of how they did it? Was it all just like the daily? Was there like a person that just statistically evaluated kind of the. Yeah, the likely next word.

Yangqing Jia [00:12:12]: It was. It was crazy. Yeah, it was basically the old days of data collection and data cleaning. You go to library, you sit there, you basically like have a bigram count. Basically, you start writing like, basically like, like stroke. Stroke social. Five strokes of five, and then just like count. Right.

Yangqing Jia [00:12:26]: That wasn't computers and things like that. Nowadays you probably use Python, but it was just like four pencils.

Adam Becker [00:12:32]: Yeah, yeah. And I imagine it's just the. Maybe like, the latin typewriters never needed that sort of thing simply because. Because of, like, their more phonetic nature of the, of the language.

Yangqing Jia [00:12:44]: Is that the idea? Exactly, yeah. So that was actually like a very nice talk. And my thought was actually inspired that by that. If you go to YouTube and search for, like, the chinese typewriter, there's a very nice gentleman talking about his research on comparative culture in some way. Right. Basically, in the qwerty kind of world, we never thought about the difficulty of, you know, like all those, like, not what you. What type is what you get type model. Right.

Yangqing Jia [00:13:10]: And you actually need to basically like, figure out alternative ways of hardware and also the software. And he basically went on to talk about whatever, like, kind of like non phonetic languages. How do we actually solve these kind of like, input output interface problems? For me, the particular interesting part is like, it resonates so much.

Adam Becker [00:13:32]: It'S virtually one to one. Yang Cheng, thank you very much. Please stick around the chat if folks have more questions. And we appreciate you and your time. And best of luck with Lepton.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Beyond the Hype: Monitoring LLMs in Production
Posted Jun 20, 2023 | Views 733
# LLM in Production
# Monitoring
# Arize.com
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io