Building Recommender Systems with Large Language Models
Sumit works as an MLE at Meta in the Recommender Systems domain. Previously Sumit has worked in RecSys at TikTok, NLU at Amazon, and speech processing at Samsung.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Many researchers have recently proposed different approaches to building recommender systems using LLMs. These methods convert different recommendation tasks into either language understanding or language generation templates. This talk highlights some of the recent work done on this theme.
See it is out and I've got Sumit coming onto the stage. There he is. What's up dude? Endometriosis. Love the teacher. Thanks man. So you are one of the few people that I hear talking about recommender systems or just recommendation engines with LLMs. And I am so excited by this topic because again, going back to this LLM in production report, in the use cases section, when we ask people what are they using, the LLMs for recommendations, we're actually one of the strong use cases.
So it is. I, I would say in my mind at least, uh, the work that I see you sharing, you are at the forefront of this. And I love that you are here and you're gonna school us on a little bit of it. I'm gonna put 10 minutes on the clock. I'll be back and I will see you soon, man. Yep. Thank you so much. Um, I hope you can see my screen.
Um, yeah. Awesome. So, hello everyone. Um, LLMs have emerged as this powerful tools for a wide range of NLP tasks. And recently there has been a significant interest in the recommendation systems community on, um, using these LLMs to enhance various aspects of recommendation systems. Um, today I'll briefly highlight how, um, large language models are being used in recommendation systems.
Why should they be used in the first place and what are some of the associated challenges? So before we begin, um, a little about myself, um, my name is Sumit Kumar. I work as a senior research engineer, a senior machine learning engineer, uh, now at, at Meta. And, um, I mainly work with content recommendation platforms.
Um, previously I worked as a recommendation systems, m l e at TikTok, um, uh, research scientist at Amazon, and a speech recognition engineer for Samsung. Um, so one of the big motivations for using LLMs for recommendations is that LLMs encode a massive amount of external knowledge that can supplement the user behavior data that we commonly use in recommenders.
So, for example, because of its web scale knowledge and LLM can recommend, uh, user to buy turkeys. Uh, when it is Thanksgiving, but, uh, traditional recommender system may not be able to do that if, uh, there is no log. The click behavior that relates Turkey with, with Thanksgiving. Okay. And LLMs have shown, uh, strong zero short and q short capabilities, uh, which can help a lot in the recommender systems where we often deal with, um, challenges like data sparsity and gold starts.
And, uh, we can also utilize the high quality text feature representations from LLMs, um, to more effectively model the data that we handle in recommendation systems such as user profiles and item descriptions. Um, and so on. So one way to understand or look at the current state of this line of work is to, uh, categorize it into, uh, discrimative and generative approaches, uh, for recommendation in the discrimative uh, paradigm.
The language models have. Mainly being used to provide embeddings for the downstream tasks. Uh, bursaries of models usually fall in this category, which are rather smaller language models. Um, and they can be further classified into fine tuning and pro tuning. Um, in fine tuning. The pre-trained language model is, uh, Tuned with data specific to the downstream task.
Uh, for recommendations. Uh, this data usually contains user item interactions, um, item descriptions, user profiles, and other contextual information. Um, in prompt tuning, the tuning objective of the downstream task is, uh, aligned with the, uh, pre-training loss. So for this presentation, our focus will be on the generative, uh, paradigm, which can be further categorized into non tuning and tuning methods.
Um, non tuning work includes, uh, prompting methods, which, uh, where the researchers assume that the LLMs already have the recommendation capabilities, and they try to trigger these capabilities by introducing specific prompts, um, in in context learning. These prompts also include some demonstrative examples.
Um, tuning work includes prompt tuning and instruction tuning. Although the delineation between the two is not very clear, but some of the literature calls it, uh, prompt tuning when they, uh, when the parameters of the LLMs are, uh, being fine tuned on a specific task, um, and instruction tuning when they're tuned on multiple tasks with different type of instructions.
Um, this is an example prompt from this research paper from Alibaba where they. Um, evaluated the chat GPTs, uh, zero short and few short recommendation capabilities. Um, in this prompting method, the prompt consists of, uh, task description that describes the recommendation task in natural language, uh, behavior injection component that injects user item interaction information into the prompt and an output format indicator.
The same paper, further added some demonstrative examples, um, uh, into the prompt to, uh, get these recommendations from a chat G p t in, uh, few short settings. Um, and this study, and a few others have shown that, uh, zero short and few short recommendations can be random guessing, but, um, this, or maybe some carefully designed heuristics as well, but they still cannot surpass the performance of a traditional recommendation model.
That is trained specifically for a given task and, um, task specific data. So to overcome these shortcomings, several researchers have proposed frameworks to, uh, fine tune large language models with recommendation data. Um, this frameworks use user item interactions to create instructions that are then used to fine tune the LLMs.
Um, there are also several frameworks that take a foundational model approach, such as this P five model, um, that extensively pre-train their model on a, on a number of recommendation tasks with the same, um, language modeling objective and everything is under course, uh, takes paradigm. Um, zooming out a bit in the recommendation space.
Uh, LLMs have been used for, uh, data augmentation, um, for, uh, encoding text features. They have been used for this. They've been used as a conversational tool, um, that also decides whether to continue, uh, talking to the user or to call the backend API to further refine the current set of candidates. Um, some researchers have also used them as read anchors alongside, um, the traditional retrieval.
Model. And in many papers they have also been used directly for generating recommendation outputs. Um, so why should you use an L L M for recommendations? Well, um, L l M says, uh, external world knowledge can supplement the. Um, the behavior data and recommendations and in few short settings, they can adapt to new information without having to retrain or, uh, change the model architecture there.
Zero short performance. Um, there's, uh, can help in, uh, mitigating some of the data parts, city and gold start issues that are very common in, uh, recommended assistance. And through a chat based interface, users can now. Um, directly interact with the recommended system, confident, and, um, they can also use natural language and uh, they can do all of this.
Uh, uh, when you compare with the traditional recommended system. They only are passively involved, uh, through implicit feedback. And as a byproduct of chain of thought reasoning, LLMs can justify specific recommendations in natural language, which can increase the transparency of the recommendation algorithms.
And using LLMs can also simplify some complex feature engineering steps like, uh, some of the feature pre-processing and the EM embedding methods. And I, I believe it's equally important to be aware of some of the problems with this team as well. Um, LMS may recommend items that are not present in the candidate set, and they can be, um, highly sensitive to the design of the input prompt.
Um, they may give you an answer that is in incorrect format or very verbose when you simply ask for, I guess, or no question or a rating, uh, on a scale of one to five. Um, they can also be highly, uh, sensitive, like I mentioned to the input prompts and, uh, deciding how many demonstrations to include in the prompt.
Um, what kind of demonstration, uh, to be included is also an open, uh, problem right now. And IDs, um, have been, ID like features have been very successful in the traditional recommendation models, but, Um, incorporating them into prompts can be really challenging. Um, there have been lots of research papers on that, um, team as well.
And online recommender systems are real time services and they're extremely time sens, time sensitive as well. But these prompt generation and LM inference steps, they have significant amount of time, cost as well. Um, there can be a huge gap between the universal knowledge that's encoded within the parameters of L L M and the specificity of the user behavior patterns that we see in the private domain data.
And of course, data security is another related concern. Um, limited context lens with some of these LLMs, um, can make it really hard to incorporate, uh, a substantial amount of, uh, behavior information or the sequences into the prompts. Um, Some studies have also, uh, shown that in, um, If there are some popular items, for example, bestselling books, um, that might appear more frequently in Chad GT's pertaining Corpus, they're also more likely to be ranked higher when Chad GT was used for, uh, ranking, uh, tasks.
Uh, some elements are also been shown to have position bias where changing the order of the input items significantly changed the LMS ranking output as well, so that that sort of makes them suboptimal for, uh, re-ran. Um, and some of these LMS have been shown to generate harmful content, reinforce social biases and exhibit unfairness to, uh, sensitive attributes like gender, race, and so on.
And of course there are already a lots of research work on how to potentially mitigate some of these problems. And, um, using LLMs for recommended systems has been a very active recent. Um, research area and there has been lots of exciting stuff happening. Um, but that's all from my side for today. Um, if you'd like to reach out, you can find me on my social submission here.
I also write a blog and a newsletter on various information material topics. Um, thank you for listening and thank you materials for inviting me. Dude, you're too kind, man. How could I not have you on here? It is, my honor. It's so cool to see this and ah, I love it. So. I'm gonna be in San Francisco in two weeks and I hope to see you in person.
And for now I'm gonna kick you off the stage because it is super late where I am and I'm trying to finish uh, at a decent hour. I appreciate this, Sumit. We'll be in touch man. Thanks so much. Thank you so much.