Large Language Model (LLM) is such an existing topic. Since the release of ChatGPT, we saw a surge of innovation ranging from education mentorship to finance advisory. Each week is a new opportunity for addressing new kinds of problems, increasing human productivity, or improving existing solutions. Yet, we may wonder if this is just a new hype cycle or if organizations are truly adopting LLMs at scale …
On March 2023, the MLOps Community issued a survey about LLMs in production to picture the state of adoption. The survey is full of interesting insights, but there is a catch: 80% of the questions are open-ended, which means respondents answered the survey freely from a few keywords to full sentences. I volunteered to clean up the answers with the help of ChatGPT and let the community get a grasp of the survey experiences.
In this article, I present the steps and lessons learned from my journey to shed some light on the MLOps survey on LLMs. I’m first going to present the goal and questions of the survey. Then, I will explain how I used ChatGPT to review the data and standardize the content. Finally, I’m going to evaluate the performance of ChatGPT compared to a manual review.
The MLOps Community’s survey is composed of 17 questions about the use cases, tools, and concerns for adopting LLMs in production. 110 people replied anonymously to the survey, and the responses can be accessed at this address. The questions asked of the participants are listed below:
0. What is your position/title at your company?
1. How big is your organization? (number of employees)
2. Are you using LLM at in your organization?
3. What is your use case/use cases?
4. Have you integrated or built any internal tools to support LLMs in your org? If so what?
5. What are some of the main challenges you have encountered thus far when building with LLMs?
6. What are your main concerns with using LLMs in production?
7. How are you using LLMs?
8. What tools are you using with LLMs?
9. What areas are you most interested in learning more about?
10. How do you deal with reliability of output?
11. Any stories you have that are worth sharing about working with LLMs?
12. Any questions you have for the community about LLM in production?
13. What is the main reason for not using LLMs in your org?
14. What are some key questions you face when it comes to using LLM in prod?
15. Have you tried LLMs for different use cases in your org?
16. If yes, why did it not work out?
Except for questions 1, 2, and 7, participants were free to provide any text they wanted for the rest of the questions. Thus, we can find answers such as “Entity matching, customer service responses (souped up/targetted FAQ)” to Q3 (cell H71) or “Thta it will hallucinate something that we won’t pick up in the report editing phase” to Q6 (I67).
These answers are rich in information, but they are also tricky to analyze: How can we extract the relevant keywords? Can we summarize the content without losing too much information? Is it possible to automate this process and avoid time-consuming human reviews? I know it would be difficult to apply classical NLP techniques (e.g., TF-IDF, Spell Checkers, Named Entity Recognition, …) with this variety of answers, and this is where ChatGPT API comes to the rescue!
The survey analysis was my first time developing with ChatGPT API. I thought it would be a good match since the task was open-handed and required a lot of background knowledge to find and extract the information. On the other hand, I didn’t want to use or fine-tune smaller LLMs to avoid the pitfalls of creating a new model for a one-shot task.
To get more background on prompt engineering, I read the Prompting Guide and the Open AI Cookbook before jumping on this task. I found out that the API was easy to use, especially compared to more complex libraries such as deep learning frameworks. Moreover, it seems intuitive to express requirements in natural language. My main struggle was to understand how to convert ChatGPT outputs to programming data structures with Python.
I used Google Colab to clean up the survey. The notebook can be accessed at this address. In the rest of this section, I’m going to highlight the main structures that helped me work on this use case.
MLOps Survey: Preparation – Large Language Models (LLM) – 2023
The code snippet below shows the Open AI model used for this experiment. The “gpt-3.5-turbo” corresponds to the same model used by the ChatGPT application. The ChatGPT API exposed a single endpoint to create chat completions from user messages (i.e., POST https://api.openai.com/v1/chat/completions).
I used the following function to associate the user inputs with the model outputs. The function takes as arguments 1) the ChatGPT model, 2) the instructions to perform the task in natural language, 3) the user inputs from a single column, and 4) the size of the batch (i.e., how many user inputs are processed at a time). The function then converts the instructions and input batch to ChatGPT messages and sends them to the API endpoint. Finally, the model output is parsed to Python data structures and combined into a dataframe.
The text snippet below shows the prompt associated with Question 3: “What is your use case/use cases?”. The first four sentences described the task to be done by ChatGPT. The next two instructions explain the output format for the model. For simplicity’s sake, I choose to use JSON lines (i.e., JSON records separated by newlines) to easily parse the output with Python. Ultimately, the last sentences give an example of the expected input and output of the model.
Here we can see that the task is quite complex, as the model needs to understand both the fields associated with the answer (e.g., analyze logs -> Computer Security) and find common NLP tasks from the inputs (e.g., question and classify text -> Questions Answering, Text Classification).
During my development, I found out that ChatGPT gave me good results 80% of the time. Still, this was not sufficient to ship the model output as-is. Thus, I performed a manual review of all the user answers, which was clearly the most tedious part of the whole experience …
Out of this manual review, the most common errors I found were:
On the other hand, I found some pretty impressive benefits:
Following my manual review, I created another notebook to visualize the results of the survey. You can also find the spreadsheet generated by ChatGPT at this address, and the spreadsheet reviewed manually at this address.
MLOps Survey: Visualization – Large Language Models (LLM) – 2023
Let’s now check some visualizations to better grasp the final results. The following figures show the answers to Question 4: “Have you integrated or built any internal tools to support LLMs in your org? If so what?”. The users provided two types of information: the purpose for integrating LLMs (first figure), and the tools used to support the integration (second figure).
For the final step, I wanted to evaluate the performance of ChatGPT compared to my manual review. To do so, I extracted the values from the spreadsheet 1) based on the raw output of ChatGPT API and 2) following my manual review. I then performed a side-by-side comparison of all values.
Evaluation Process:
You can find the evaluation notebook at the following address. Note that this is a harsh evaluation for ChatGPT, as I didn’t take into consideration results that are partially good. Either the model gave good results (+1), or I had to change something to get the desired results (-1). There is no in-between, even if the model provides some added value in the process.
MLOps Survey: Evaluation Notebook
We can see in the plot below the number of good and bad answers per information extracted. Some information was easily fixed by ChatGPT (e.g., topics, tools, approaches, tasks, …) while others were more challenging to the model (e.g., challenges, reasons, title, …). My conclusions for this evaluation are 1) it’s easier for the model to extract common knowledge for low-variety answers (e.g., NLP tasks), and 2) high-variety answers such as titles and reasons are more diverse and thus more open to interpretation.
The plot below shows the final evaluation of the model for my use case. Overall, ChatGPT API gave more good answers (514) than bad ones (289). Qualitatively, even bad results contained relevant values that improved the process. This made the manual review less painful than I first expected.