Join us for two days of talking with some of our favorite people at the forefront of using LLMs in the wild, and an in-person workshop in San Francisco on how to build and deploy LLM based apps hosted by Anyscale.
There will be over 50 Speakers from Stripe, Meta, Canva, Databricks, Anthropic, Cohere, Redis, Langchain, Chroma, Humanloop and so many more.
This all started after we put together the LLM in-production survey and realized there are still lots of unknowns when dealing with LLMs, especially when dealing with them at scale. We open-sourced all the responses and we decided if no one was going to talk about working with LLMs in a non-over-hyped way, we would have to.
Let's discover how to use these damn probabilistic models in the best ways possible without sacrificing the necessary software design building blocks.
Expect all the fun and learnings from the first one. DOUBLED.
And remember, there will be some sweeeet sweet swag giveaways.
Huge Shoutout to all our sponsors of this event, find more info about them below.
Join us in San Francisco for this LLM-based applications workshop, hosted by Anyscale, where you'll use libraries like Ray, HuggingFace, and LangChain to build LLM-based applications based on open-source code, models, and data. You'll learn about scaling, LLM fine-tuning and inference, along with trade-offs, and use embedding models and vector stores. This is a great opportunity to learn about using modern deployment tools to run your application online and continually improve it.
Learn more here: https://home.mlops.community/public/events/ray-workshop-2023-06-15
Plus a little LLM in production survey report tl;dr summary
Large language models are fluent text generators, but they often make errors, which makes them difficult to deploy in high-stakes applications. Using them in more complicated pipelines, such as retrieval pipelines or agents, exacerbates the problem. In this talk, Matei will cover emerging techniques in the field of “LLMOps” — how to build, tune and maintain LLM-based applications with high quality. The simplest tools are ones to test and visualize LLM results, some of which are now being incorporated into MLOps frameworks like MLflow. However, there are also rich techniques emerging to “program” LLM pipelines and control LLMs’ outputs to achieve desired goals.
Matei will discuss Demonstrate-Search-Predict (DSP) from my group as an example programming framework that can automatically improve an LLM-based application based on feedback, and other open-source tools for controlling outputs and generating better training and evaluation data for LLMs. This talk is based on their experience deploying LLMs in many applications at Databricks, including the QA bot on our public website, internal QA bots, code assistants, and others, all of which are making their way into our MLOps products and MLflow.
What do we need to be aware of when building for production? In this talk, we will explore the key challenges that arise when taking an LLM to production.
The journey from LLM PoCs to production deployment is fraught with unique challenges, from maintaining model reliability to effectively managing costs. In this talk, we delve deep into these complexities, outlining design patterns for successful LLM production, the role of vector databases, strategies to enhance reliability, and cost-effective methodologies.
Language models are very complex thus introducing several challenges in interpretability. The large amounts of data required to train these black-box language models make it even harder to understand why a language model generates a particular output. In the past, transformer models were typically evaluated using perplexity, BLEU score or human evaluation.
However, LLMs amplify the problem even further due to their generative nature thus making them further susceptible to hallucinations and factual inaccuracies. Thus, evaluation becomes an important concern.
Building a chatbot is not easy....Or is it? We need:
An embedding model that translates questions to a matrix. A Vector database to search. LLM to generate the answers.
We can orchestrate the job using Langhcain with minimum development.
Using Wardley Maps we can understand value chains and map out the landscape. Using this to develop strategies and understand where to target our efforts.
Take a moment to randomly match with others in this event by participating in the networking sessions. To access the random introductions click on the match tab in the left sidebar.
Bring your prompts to the chat cause we will be improvising songs from the audience's suggestions!
Proprietary LLMs are difficult for enterprises to adopt because of security and data privacy concerns. Open-source LLMs can circumvent many of these problems. While open LLMs are incredibly exciting, they're also a nightmare to deploy and operate in the cloud. Aqueduct enables you to run open LLMs in a few lines of vanilla Python on any cloud infrastructure that you use.
In the last LLM in Production event, I spoke on some of the ways we've seen people use a vector database for large language models. This included use cases like information/context retrieval, conversational memory for chatbots, and semantic caching.
These are great and make for flashy demos, however, using this in production isn't trivial. Often times, the less flashy side of these use cases can present huge challenges such as: Advice on prompts? How do I chunk up text? What if I need HIPAA compliance? On-premise? What if I change my embeddings model? What index type? How do I do A/B tests? Which cloud platform or model API should I use? Deployment strategies? How can I inject features from my feature platform? Langchain or LlamaIndex or RelevanceAI???
This talk will detail a distillation of a year+ worth of deploying Redis for these use cases for customers and distill it down into 20 minutes.
This workshop focuses on the crucial task of constructing and managing datasets specifically designed for reinforcement learning from human feedback (RLHF) and large language model (LLM) fine-tuning. We will explore the utilization of Argilla, an open-source data platform that facilitates the integration of human and machine feedback. Participants will learn effective strategies for dataset construction, including techniques for data curation and annotation. The workshop aims to equip attendees with the necessary knowledge and skills to enhance the performance and adaptability of RLHF and LLM models through the use of Argilla's powerful data management capabilities.
As Foundation Models (FMs) continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This talk will describe our work on applying foundation models to structured data-wrangling tasks like data cleaning and integration. We will present our results to evaluate FMs' out-of-the-box capabilities on these tasks, as well as discuss challenges and solutions that these models present for production deployment.
There are key areas we must be aware of when working with LLMs. High costs and low latency requirements are just the tip of the iceberg. In this panel we will hear about common pitfalls and challenges we must keep in mind when building on top of LLMs.
How to use a Large Language Model (LLM) to create memes? We’ll discuss the unique dataset of ImgFlip, the selection, and fine-tuning of a commercially usable LLM, and associated challenges. Of course, we’ll also demonstrate the model prototype itself. We will also discuss the challenges we anticipate facing with the productionization of an LLM that is used by millions of users.
Autonomous AI agents have gotten a lot of attention recently, but they're mostly just toys. What are the primitives that we need to build more reliable agents, and what are the main business use cases that agentic automation will enable over the next few years?
It’s silly to think of training and using large LANGUAGE models without any sort of input from the study of language itself. Linguistics are not the only field of knowledge that improve LLMs, as they are the intersection of several fields, however, they can help us not only improve current model performance, but also clearly see where future improvements will come.
Humanloop have now seen hundreds of companies go on the journey from playground to production. In this talk we’ll share case-studies of what has and hasn’t worked. We’ll share what the common pitfalls are, emerging best practices and suggestions for how to plan in such a quickly evolving space.
This session provides an overview of the evolving landscape of Generative AI, with a focus on the latest trends and technologies that shape this field. Designed with startups in mind, the talk offers practical insights on how to adapt and leverage these advancements to enhance their products. Attendees will acquire valuable knowledge to navigate the dynamic landscape of Generative AI, enabling them to stay up-to-date and harness untapped potential for the success of their startups.
Here’s the truth: troubleshooting models based on unstructured data is notoriously difficult. The measures typically used for drift in tabular data do not extend to unstructured data. The general challenge with measuring unstructured data drift is that you need to understand the change in relationships inside the unstructured data itself. In short, you need to understand the data in a deeper way before you can understand drift and performance degradation.
In this presentation, Claire Long will present findings from research on ways to measure vector/embedding drift for image and language models. With lessons learned from testing different approaches (including Euclidean and Cosine distance) across billions of streams and use cases, she will dive into how to detect whether two unstructured language datasets are different — and, if so, how to understand that difference using techniques such as UMAP.
Put down the screen for a moment, close your eyes and bliss out in between the sessions
Take a moment to randomly match with others in this event by participating in the networking sessions. To access the random introductions click on the match tab in the left sidebar.
While we've seen great progress on Open Source LLMs, we haven't seen the same level of progress on systems to serve those LLMs in production contexts. In this presentation, I work through some of the challenges of taking open source models and serving them in production.
The rapid adoption of large language models (LLMs) is transforming how businesses communicate, learn, and work, prioritizing AI safety and security. This captivating and insightful talk will delve into the challenges and risks associated with LLM adoption and unveil AIShield.GuArdIan – a game-changing technology that enables businesses to leverage ChatGPT-like AI without compromising compliance. AIShield.GuArdIan's unique approach ensures legal, policy, ethical, role-based, and usage-based compliance, allowing companies to harness the power of LLMs safely. Join us on this riveting journey as we reshape the future of AI, empowering industries to unlock the full potential of LLMs securely and responsibly. Don't miss this opportunity to be at the forefront of responsible AI usage – reserve your seat today and take the first step towards a secure AI-powered future!
Access to foundational models is at every developer’s fingertips through commercial solutions or in the open source. However, these models are not competent enough to perform specialized tasks. Differentiation becomes more challenging in this world. We’ll walk you through how you can develop custom models using fine-tuning and data-driven techniques such as self-refinement to create differentiated AI products that solve problems that were previously unattainable.
Large Language Models are an especially exciting opportunity for Operations: they excel at answering questions, completing sentences, and summarizing text while requiring ~100x less training data than the previous generation of models.
In this talk Sophie will discuss lessons learned productionising Stripe’s first application of Large Language Modelling - providing answers to user questions for Stripe Support.
Large Language Models require a new set of tools... or do they? K8s is a beast and we like it that way. How can we best leverage all the battle hardened tech that k8s has to offer to make sure that our LLMs go brrrrrrr. Lets talk about it in this chat.
Document Question-Answering is a popular LLM use-case. LangChain makes it easy to assemble LLM components (e.g., models and retrievers) into chains that support question-answering. But, it is not always obvious to (1) evaluate the answer quality and (2) use this evaluation to guide improved QA chain settings (e.g., chunk size, retrieved docs count) or components (e.g., model or retriever choice). We recently released an open source, hosted app to to address these limitations (see blog post here). We have used this to compare performance of various retrieval methods, including Anthropic's 100k context length model (blog post here). This talk will discuss our results and future plans.
Retrieval augmented generation with embeddings and LLMs has become an important workflow for AI applications.
While embedding-based retrieval is very powerful for applications like 'chat with my documents', users and developers should be aware of key limitations, and techniques to mitigate them.
The impressive reasoning abilities of LLMs can be an attractive proposition for many businesses, but using foundational models and APIs can be slow and full of bumpy API latency windows. Self-hosting models can be an attractive alternative, but how do you choose what model to use, and if you have a latency or inference budget, how do you make it fit? We will discuss how pseudo-labeling, knowledge distillation, pruning, and quantization can ensure the highest efficiency possible.
You think you got prompting skills? been reading too many Reddit threads thinking you can crack the code? Well, let's see what you are capable of!
Come join us on track one for some fun games and great swag giveaways.
Generalized models solve general problems. The real value comes from training a large language model (LLM) on your own data and finetuning it to deliver on your specific ML task.Now you can build your own custom LLM, trained on your data and fine-tuned for your generative or predictive task in ten lines of code with Predibase and Ludwig, the low-code deep learning framework developed and open sourced by Uber, now maintained as part of the Linux Foundation. Using Ludwig’s declarative approach to model customization, you can take a pre-trained large language model like LLaMA and tune it to output data specific to your organization, with outputs conforming to an exact schema. This makes building LLMs fast, easy, and economical.In this session, Travis Addair, CTO of Predibase and co-maintainer of open-source Ludwig, will share how LLMs can be tailored to solve specific tasks from classification to content generation, and how you can get started building a custom LLM in just a few lines of code.
You can’t build robust systems with inconsistent, unstructured text output from LLMs. Moreover, LLM integrations scare corporate lawyers, finance departments, and security professionals due to hallucinations, cost, lack of compliance (e.g., HIPAA), leaked IP/PII, and “injection” vulnerabilities. This talk will cover some practical methodologies for getting consistent, structured output from compliant AI systems. These systems, driven by open access models and various kinds of LLM wrappers, can help you delight customers AND navigate the increasing restrictions on "GPT" models.
Copilots embedded within SaaS applications have become one of the dominant ways of leveraging LLMs within products. In this lightning talk, I’ll review some of the dominant UI paradigms and features, general design patterns and system architectures, and top challenges and future frontiers of production copilot systems.
Large Language Models (LLMs) have shown remarkable capabilities in domains such as question-answering and information recall, but every so often, they just make stuff up. In this talk, we'll take a look at “LLM Hallucinations" and explore strategies to keep LLMs grounded and reliable in real-world applications.
We’ll start by walking through an example implementation of an "LLM-powered Support Center" to illustrate the problems caused by hallucinations. Next, I'll demonstrate how leveraging a searchable knowledge base can ensure that the assistant delivers trustworthy responses. We’ll wrap up by exploring the scalability of this approach and its potential impact on the future of AI-driven applications.
Many researchers have recently proposed different approaches to building recommender systems using LLMs. These methods convert different recommendation tasks into either language understanding or language generation templates. This talk highlights some of the recent work done on this theme.
This would be a talk on the learning on building Code Suggestions, my team has takes in reference to Model, ML Infra, Evaluation to Compute and Cost.
Its gonna be special! We Promise!
How will we teach large models to behave in organisations at scale? We’ll be discussing both the technical and user experience challenge of hundreds of humans influencing one agent. Who must it listen to? How must new learnings be represented? How can we make the labeling experience of LLMs be an ongoing collaboration between people using them?
What do MLOps and LLMOps have in common? what has changed? Are these just new buzzwords or is there validity in calling this ops something new?
Writing art prompts can be challenging, and thats why LLMs are the best prompters for AI art. In this talk we will explore how LLMs make fantastic prompt artists and are capable of constructing very expressive art prompts which lead to fantastic works of art for all use cases.
Take a moment to randomly match with others in this event by participating in the networking sessions. To access the random introductions click on the match tab in the left sidebar.
Get out your family game night skills
For 2 whole years of working with a large LLM deployment, I always felt uncomfortable. How is my system performing? Are my users liking the outputs? Who needs help? Probabilistic systems can make this really hard to understand. In this talk, we'll discuss practical & implementable items to secure your LLM system and gain confidence while deploying to production.
This talk describes how we think about collecting RLHF data at Surge. We highlight the risks of collecting low-quality data for RLHF and describe some of the practical strategies we use in our full-stack RLHF data collection product.
There has been remarkable progress in harnessing the power of LLMs for complex applications. However, the development of LLMs poses several challenges, such as their inherent brittleness and the complexities of obtaining consistent and accurate outputs. In this presentation, we present Guardrails AI as a pioneering solution that empowers developers with a robust LLM development framework, enhanced control mechanisms, and improved model performance, fostering the creation of more effective and responsible applications.
In this quick talk, Omar will talk about RLHF, one of the techniques behind ChatGPT, and other successful ML models. Omar will also talk about efficient training techniques (PEFT), on-device ML, and optimizations.
Take a moment to randomly match with others in this event by participating in the networking sessions. To access the random introductions click on the match tab in the left sidebar.
Cat Cow and some Down Dog to take your minds on the LLM hallucination
The main problems faced by LLMs such as hallucinations, lack of domain knowledge, or outdated info are all data problems. How do we fix these data problems? Add a layer on top of the LLM with the ability to search the data we need to use.
As new LLM-driven applications reach production we need to revisit some of our traditional AI Governance frameworks. Diego will provide a brief introduction on what is changing in a critical step to seeing more of these applications go live.
Evaluating the performance of language models (LLMs) is a pressing issue for companies working with generative AI. Defining what makes a model "good" and measuring its performance are challenging due to the diverse range of LLM applications. Existing evaluation methods, including benchmarks and user preference comparisons, have limitations in scalability and objectivity. The future of LLM evaluation lies in scaling testing with machine learning systems, such as reward models that capture user preferences, and simulating user sessions to generate comprehensive test cases. These approaches will help developers select models, create effective prompts, ensure compliance, and enhance LLM quality.
LLMs are tremendously flexible, but can they bring additional value for classification tasks of tabular datasets?
I investigated if LLM based label predictions can be an alternative to typical machine learning classification algorithms for tabular data by translating the tabular data to natural language to fine tune a LLM.
This talk compares results of LLM and XGBoost predictions.
We will talk about best practices to ensure security, reliability, scalability and speed of LLM deployments.
Perplexity AI is an answer engine that aims to deliver accurate answers to questions using LLMs. Perplexity's CEO Aravind Srinivas will introduce the product and discuss some of the challenges associated with building LLMs.
Cause every once and while you just gotta move ya body.
Take a moment to randomly match with others in this event by participating in the networking sessions. To access the random introductions click on the match tab in the left sidebar.
It’s clear that test-driven development plays a pivotal role in prompt engineering, potentially even more so than in traditional software engineering. By embracing TDD, product builders can effectively address the unique challenges presented by AI systems and create reliable, predictable, and high-performing products that harness the power of AI.
Large language models (LLMs) have revolutionized AI, breaking down barriers to entry to cutting-edge AI applications, ranging from sophisticated chat-bots to content creation engines.
LLMs also provide an easy-to-use and efficient way to perform high quality feature extraction of any language tasks for downstream consumption. This significantly reduces the time to bring models into production as well as operational cost.
In Canva, part of our product development involves understanding the content of our users, both consumers (who search for our content) and creators (who produce the content). This talk will explore two examples of how LLMs such as GPT-3.5 have been leveraged to help solve these tasks with higher accuracy, at greater velocity, and at reduced cost.
What are some of the key differences in using 100M vs 100B parameter models in production? In this talk, Denys from Voiceflow will cover how their MLOps processes have differed between smaller transformer models and LLMs. He'll walk through how the main 4 production models Voiceflow uses differ, and the processes plus product planning behind each one. The talk will cover prompt testing, automated training, real-time inference, and more!
Track 1