LLM in Large-Scale Recommendation Systems // Aditya Gautam // AI in Production
speaker

Aditya is a seasoned Machine learning practitioner, currently leading the foundational integrity efforts for Llama models. He led several LLM applications to enhance Facebook recommendation and ranking algorithms at scale. Some of his contributions in Reels include user interest exploration, trend detection, quality improvement and safeguarding policies by detection violation and mitigating misinformation. He holds a master’s degree from Carnegie Mellon University, has worked in Machine learning at Google and has been a founding engineer of an AI startup at Area 120 (Google Incubator). Aditya has been quite active in the Generative AI community and is actively contributing through different speaking, panel and research engagements. He is an active speaker and shares his expertise and work at prominent conferences and summits, including the AI in Production Conference , AI Agent Conference , Generative AI Summit , and the Databricks Data + AI Summit , among others.
SUMMARY
The advent of Large Language Models (LLMs) has significantly transformed the landscape of recommendation systems, marking a shift from traditional discriminative approaches to more generative paradigms. This transition has not only enhanced the performance of recommendation systems but also introduced a new set of challenges that need to be addressed. LLMs has several practical use-cases in modern recommendation systems, including retrieval, ranking, embedding generation for users and items in diverse spaces, harmful content detection, user history representation, and interest exploration and exploitation. However, integrating LLMs into recommendation systems is not without its hurdles. On the algorithmic front, issues such as bias, integrity, explainability, freshness, cold start, and the integration with discriminative models pose significant challenges. Additionally, there are numerous production deployment and development challenges, including training, inference, cost management, optimal resource utilization, latency, and monitoring. Beyond these, there are unforeseen issues that often remain hidden during A/B testing but become apparent once the model is deployed in a production environment. These include impact dilution, discrepancies between pre-test and backtest results, and model dependency, all of which can affect the overall effectiveness and reliability of the recommendation system. Addressing these challenges is crucial for harnessing the full potential of LLMs in recommendation systems.
TRANSCRIPT
Click here for the Presentation Slides
Demetrios [00:00:05]: I'm gonna bring out our next speaker. Aditya, where you at dude? Where are you at dude? Let's see where. Hey, there he is.
Aditya Gautam [00:00:14]: Hit him.
Demetrios [00:00:15]: Address, bro. I'm excited for your talk man, because this is something that I keep having conversations about and I really cannot figure out if it's something that we should be going down the rabbit hole on or if it is a bit of a waste of time because recommender systems are so dependent on time like latency constraints and LLMs are notoriously bad at latency and you're going to be talking about LLMs in large scale recommender systems. I'm so excited. I'm going to hand it over to you and I'm going to figure out who won these headphones in the meantime.
Aditya Gautam [00:00:52]: Okay, sounds good. Okay, thank you very much. So I'll get started. I'm Aditya, I'm currently a machine learning lead at Meta. I've been working on the FB recommendation and ranking space for quite a while. Before this I was working in Google in the computer vision and machine learning space. In this talk we will be talking primarily about like what are the use cases and the challenges of LLM in the recommendation space to get started. Hold on.
Aditya Gautam [00:01:23]: Okay, so the overview we will go through, what are the recommendation system evaluation happens in the architectural space? What are the different use cases and modes in which LLMs are used? We'll briefly go through that and what are the current state of the most advanced model we have then the production challenges and solution. This is where we are going to spend majority of our time talking about like different kind of like challenges in the data modeling and the infrastructure aspect. A lot of them are kind of like very less talked about in the community and not very seen like and there is not a lot of research done in those areas. So that's what I think would be interesting because we talk a lot about recommendation system in general. But like what are the hidden issues that actually comes up when you are deploying these very large scale models on infrastructure? Like what are the algorithm bias and other things that comes with it to give a high level overview? Recommendation problem is basically given millions and billions of content to a user. How do you figure out what are the top n like 10, 20, 30 contents that should be given to a particular user based on the user history and context? Generally there are two stages. One of them is the retrieval stage which is a candidate generation where you say that from the pool of these millions of users you are going to bring out some thousands or hundreds of Content which you think would be pretty good for the user. Depending on the history.
Aditya Gautam [00:02:54]: What are the signals you have? This is a model which is optimizing for the recall. This is usually a two tower system model and the model which is not very complicated when you have those hundreds or thousands of candidates, then you go to a little bit of more comprehensive, high precision model to do the ranking. And then there is another controller, another kind of things which is specific to the user history and other things before you actually give the items to the users. So this is just a high level. And to go to the evolution from the architecture point of view, earlier, before pre neural network, we have this like this matrix factorization component, like the structure architecture that has been used for a while in which the users and the items are prerepresented into the latent space. And then the inference or basically recommendations were made on that. They were pretty naive in that way. And then comes like this second phase of neural networks.
Aditya Gautam [00:03:54]: And in the neural networks we have like this two tower system which is widely used in industry. In one tower you have the user embeddings and all the features and everything, whatever comes from the user part. Then you have the item tower which is taking all the item features, item embeddings and everything. What happens is that the final layer or the embedding layer that comes out of this architecture are then interacted in a cosine similarity or any similarity matrix. Whether it's a click through or any other objective function. What you have, that's how the models are optimized and trained on. Then once you have the embedding, you can cluster it out, figure out different use cases in retrieval and ranking space. Then comes a new paradigm shift a couple of years back, which is LLM.
Aditya Gautam [00:04:44]: Then on the left side what you are seeing is that what are the ways in which LLMs are used either as a feature encoder, feature engineering ranking in the scoring function basically next random prediction and pipeline controller where it says that whether to use a tool or just keep chatting with the chat box. On the right side is basically how it is adapted like in a training on inference phase. So we'll go through that in detail in the LLM use cases. Primarily there are three ways on which like how the LLMs are used. One is to get the comprehensive and conceptual understanding and then give it to the traditional recommendation system. Say that this is what I'm extracting. Please use that as a feature. Second one would be upgraded version of that where you are using the token embedding, giving it to the recommendation system like two tower or something, whatever you have in place and using that as a feature.
Aditya Gautam [00:05:38]: So you are taking the world knowledge from the LLM, putting it to there which was currently missing in a lot of recommendation system. Third one is the one where primarily like a paradigm shift I would say where LLMs are basically recommending the content. They are not like an auxiliary or supporting the main like the existing legacy system. So in the first case what you are seeing on the left hand side is architecture where LLM is trained. Like there are embeddings in the standard manner. You have all the items in the user embeddings or different features in that textual domain. But you can also, I would say you can extend it to the multi domain structure also if you have images, videos and all. So in this particular thing you are just using one embedding, like the conceptual embedding and then giving it for the additional information gained to the existing recommendation system.
Aditya Gautam [00:06:35]: Now in the second one architecture what you are doing is you instead of giving this one conceptual embedding, you are giving all the token embedding. That means you are extending this information phase or the information which is being given is not conceptualized on one embedding. But you are giving different paradigms and different things for model to learn. But still this is just feeding to the either a neural network or existing recommended system for like your objective problem. What you have in the third one which is which is basically like a shift in the domain which there is a lot of research work that has happened in this particular especially in last one year around. So in this LLMs are given everything. Basically you are a recommendation system like it's more about prompting fine tuning on this side. And then you are given the user all the user filters or basically the user features.
Aditya Gautam [00:07:31]: You have been given the items which are generated or retrieved by a retriever. And you say that okay, I'm a retriever system. Like okay, then you are given this particular feature. You have these items, candidate items. Please provide me what is the best item? What is the recommendation? What is the next item for a particular user? Given this history, I want to re emphasize on this one history. Here we have a retriever which could be a traditional model or any other way you can have a LLM also as a retriever. And then once you have retrieved top end candidates and you want to figure out how to re rank them or basically give the top first candidate or something for recommendation, that's what you are providing in a prompt to the LLM. So one is prompt and second would be history of the users and third would be basically the candidates that you would want to get a recommendation on from the LLM.
Aditya Gautam [00:08:26]: When it comes to fine tuning and prompting, the standard techniques are still applicable to recommendation system. Also we have in context prompt learning, prompt tuning and instructor tuning which are pretty standard. I would say everyone is well aware of these things. Now I'll spend more time on the production challenges and what are the different solutions. Each of them are pretty I would say comprehensive in its own thing. It could be a hard talk in each one of them but we would not go into detail. We will be very high level finding what are the problems and how do we find solutions in different phases within our stack. The first one is a data and the systematic challenges.
Aditya Gautam [00:09:08]: We'll start with the user consumption. What happened in any recommendation system is that you have this bunch of users which are very small percentage 1020 percentage but their consumption is approximately 50 to 80 to 90%. So what happens because of that? This particular very frequently users are having a much higher representation, much higher representation in the data. Basically the training data and the model performance and accuracy for this user is also comparatively much higher. We might say that okay, that's good, it's 80% doing a better job on 80% of click through data for 20% of users. But from any organizations or the recommendation perspective, the gap in our north matrices when you're talking about monthly active, daily active users, it's come from that infrequent users. That's where the delta comes in. So bringing this making our models and data everything inconsistent consistent towards this infrequent user would actually help you a lot.
Aditya Gautam [00:10:09]: So what are the solutions we have? We have. You can do the feature engineering do add. Add more user based features. On the time span engagement category in the training stack there is a sample reweighting of that an auxiliary task paying a penalty to our popularity items during in the modeling aspect we can have a common trunk and then you have a different kind of like a neural network or any kind of transformer for each of these users depending on similar to like moe what we are seeing in LLM. But like with that we are doing the same kind of concept for different user category. And then at the end like you want to evaluate your system for each of these users to see that if there is an existing bias or not. Like yeah. Then the second kind of bias that comes is the recency bias where this recommendation system forgets the interest watch user has been engaged in a long term history.
Aditya Gautam [00:11:05]: So this is something you can try it on YouTube, just find a new interest, like search for that, it will come up and you just click a little bit more on that. And what you would be seeing that some of the interests that you have been probably following for like good 5, 6, 1 year that will go down and some of them will still be there, some of them will like completely out from your feed. So how do you make sure like this kind of like thing does not exist in your recommended system would be that first maintain a proper data logging depending on the retentions, legal policy and other things, whatever the longest you can do. And then during the training make sure the user history embeddings or user historical context is well represented in your model along with the current interactions and the recent history that will make sure it is just taking not the recent item into consideration, but the long term history. You have been looking into machine learning content for a long time. If you are looking into football or like suddenly you're going into politics, that doesn't mean I'm losing interest on that one. It's just like there is a more resurfacing happening on that. And then in the ranking maintain diversity or kind of like make sure you are doing a re ranking of the candidates with the old interest and the current new interest so that both are in balance.
Aditya Gautam [00:12:25]: This is the new thing that comes up where a lot of content is now AI generated. You are seeing that a lot on TikTok and other social platform and this is a very ongoing research problem. Like this is creating a lot of imbalance, misinformation. Like a lot of monetary aspect is also getting changed when it comes to the distribution for creators and all. So how do you solve this? One is to of course like one is a policy aspect where you make sure that platform policies are not doing or doing this thing depending on how you want to do project your platform. Second would be to keep human in loop along with the machine learning classifiers. And this human loop will make sure that before the content reaches to this very viral post, it is being automated or basically some way to block it or reduce the content. Including human of course the LLM based multimodal classifier where you can detect the content whether it's AI generated or not and having a precision recall standard way of doing it.
Aditya Gautam [00:13:25]: But again this is a hard problem. I would say this is a very interesting research for next one year. I would say freshness. This is another good problem where what happens is that let's suppose if you engage in a particular content or something. And then we want to understand how long does it take for a new model deployed to have those interactions and this new behavior captured into the main ranking or a retrieval model. When you click on a device, from your device, the data go to the backend system. Data warehouse processing happens in a mini batch or a macro batch depending on the framework and the systematic pipelines you have. It goes into the training data set.
Aditya Gautam [00:14:05]: Then the retraining of the model happened, then online validation happened. Candidate deployment to make sure the model is good. And then the full model rollout happens. This delta, if we increase this delta, the system is going to lag. If we reduce this one, you are going to see that it's more responsive, that system is getting more interaction, more behaviors is captured in a better way. You would see that a lot of your top line matrices would also go up. It has a substantial impact if you reduce this time to get into some of the solutions. I would say logging, make sure that you are logging near real time features.
Aditya Gautam [00:14:45]: And when you're training the model, you are not just taking the historical feature. What the users have seen in last one month, one day, what is the interaction or sessions or interest and other things that happen in last 15 minutes. You cannot go to a minute latency. That's pretty hard that you can do for a retrieval. If you like football, retrieving that football content is possible. But when it comes to model thing and incorporating that behavior into model, it's a little bit hard. So you can do 5 or 10 minutes depending on the systematic like how optimized you can make your pipelines and everything. And then doing a mini, frequent mini batch training where you are actually kind of like frozen a lot of last layer, just removing, just keeping unfrozen for some of the layers to make sure that nitty gritty behaviors is captured.
Aditya Gautam [00:15:35]: And then do a full training after let's say daily or whatever the schedule you have on your site, then deployment. When you're doing canary, make sure you are already deploying it. And when the canary pass, you are making sure that prod deployment switch happened right away rather than making sure the canary passed and then kind of like just happening on site. So coming to the models and algorithm side, the one of the main problem that happens, which is a very well known problem, is a popularity bias. So you might have seen this in like a lot of TikTok and other places where a content which is getting viral, is getting more viral, like it is getting more click through. So what it does, it creates a problem of Unfair content distribution plus there is a ecosystem imbalancing plus monetary aspect. And it also create problem for the new surface which are pretty good in content to get even distributions and the small creator to have a good say if they are actually good. How do you make sure this would not happen? This is again a hard problem.
Aditya Gautam [00:16:40]: I would say it's basically a battle in a way. How do you want to optimize your recommendation system? In the training side you can have the upsampling and down sampling of this content. On the algorithm side do some regularization and penalty for popularity or something. On the post ranking, make sure you are able to re rank or do some mechanism to see if this problem is getting bigger or better depending on users. What is the user history? What is the content which is coming up? What is the diversity of the content? If a user is basically just competitively new, you would want to still show the popularity because that's a very well vetted content. That is a content which has been upvoted and has a high like rate and other kind of things. So yeah. And on the model evolution side I would suggest make sure all these content buckets, you have the long tail, short head kind of thing.
Aditya Gautam [00:17:39]: These are all very well evaluated in terms of like whether what is the performance coming of what if there is any like are we seeing any kind of like degradation in the problem for the long tail or distinct tail problem the content then like it makes it gives you an understanding and insight if your model is actually having a popularity bias or not and how much of that problem is already out there. Oh, another thing which is kind of like very less talked about especially in the recommendation system is like model dependency. So you have you based on the diagram you might be seeing. Okay, this is like a retrieval. We have a millions of content, there is a retrieval model and there is like a ranking model. And then you are just doing some kind of re ranking later on and giving it to the users. But this is not how it actually happens. What happens is that each of these models are using a feature.
Aditya Gautam [00:18:28]: Let's suppose for the content signals, content understanding, integrity signals and other kind of thing. And all those signals are coming from those specific models. Any change or blockages or performance issues or anything happens in one model you would see that being propagated is basically a stack of thoughts which happens a lot of other model gets inflicted because of issue in one model. How do you make sure this would happen? Of course, model isolation and replacement of the features input what is coming from other models in case the model fails or if there's any issue in the model and then doing a proper integration testing this is easier said than done. But if the model is using features from four of them during your evaluation phase, if one of the model goes down, what is the impact of that? If there is a higher impact then you would want to have an on call, an alert or something set up on your side to make sure that if any dependency is coming up it is impacting the model which you are responsible for, your team is responsible for. There is some kind of like I would say SLA kind of thing happening with the dependency in the model. Clickbait and the misinformation is another aspect where what happens is a lot of controversial content. Get more engagement because people like it.
Aditya Gautam [00:19:51]: People like all the spicy stuff and everything. With that you have more click through more data goes into training data set the models, learn more of those content and it consider that as a good content like as. As oh people are liking it. It's like one of the best content like let me recommend more of it. Let me retrieve more of those content because similar users are liking it. Similar item item like John like retrieval are also like recommending the same thing. So again this for kind of like coming up with some of the solutions for this one. There are tools to do the grounding of LLM and riot like for the content and the misinformation making sure this is the right content.
Aditya Gautam [00:20:32]: This is there is some ground truth, some betting is done on that having the human in loop is important for this because this could have like the PR issues and other kind of like a political, geopolitical and other sensitive issues that usually happens in like some recommendation system and having a specific small language models like to fine tune on a particular objective function. Misinformation, clickbait, spam, those kind of things. And in the post ranking removing those kind of content or removing it from the start of the funnel. So whatever works in your system. But this is one important problem. And then there are existing LLM common issue which I would not go into detail. I think everyone here is aware of this thing. Hallucination is their non deterministic output.
Aditya Gautam [00:21:17]: You ask LLM the same thing again it's might give something different finite context length which is pretty important for recommendation system because you are giving a user history in all the text content and some users especially the power user that's where you have this needle in a haystack kind of problem where a lot of those contents are not very well represented and the history User is not well represented within a prompt framework. So how do you get rid of or reduce some of the impact of these issues would be like you use rag some kind of temperature variation. Context parallelism is a really good way to make high context length. But again it doesn't fully solve the problem. It just helps in doing a parallel execution solve the input problem, especially when it is big. And red teaming would be another of course. And human in loop is like almost everywhere because it's sensitive matter in the infrastructure side. I would say the first problem is in any organization is of course like the GPU allocation.
Aditya Gautam [00:22:25]: There is like whole Nvidia thing happening. Everyone is like want to get grab as much of GPUs as possible. So once you have a GPU because you know that at least in present time this is a very kind of like considered like a goal for gold diggers are like in a way for finding the content or basically for training your model. When you have the limited gpu. There are different kind of operations that you want to do. One is you want to experiment with those models. You want to fine tune your LLMs. You have a production traffic that you are serving and this production traffic cannot be disrupted.
Aditya Gautam [00:23:04]: With that you are doing a lot of R and D work. How do you make sure that there is a proper allocation of GPUs for each of these teams? One is to give a static things you said okay, 20% is for development work, 10% is for RAD and experimentation. And then of course for the production traffic which is important, you want to solve the production in the same way make sure there is no like any issues on that side. So on the solution side I would say rather than having a static like to a dynamic allocation. If the users let's suppose in the daytime or in the evening has a much higher usage during night it goes down. Based on that dynamic allocation you are making sure that it's making a little bit more optimal. Another way to solve this problem would be RL based optimization algorithm which is a little bit harder because you need a lot of data from your organizations to say that what is the optimal strategy? If you are doing 10 experiment and you are getting some input on those experiments and you have used let's say x amount of GPUs. So there is like the problem formulation would be okay, you have a x GPU amount.
Aditya Gautam [00:24:17]: These are the model architecture. You are doing data training or whatever the. Whatever the objective you're trying to solve what is the output of that. So some way to learn these Things and make sure when you're doing experimentation and R D you are optimizing or just kind of like getting some understanding from the previous data and experiment run and like yeah, basically just helps with a little bit of more optimizing. What are the experiments to run and not. Yeah, in the training. LLM training is definitely hard, especially if you are training the foundational model. We all are well aware of that there are certain number of companies which can do that.
Aditya Gautam [00:24:57]: And orchestration of these hundreds and thousands of GPUs across the multi data centers is a hard problem and it is expensive. There are prediction issues that comes in the training. If the model breaks in between, how do you make sure what like what should be the ideal checkpointing thing? And if one of the multiple machines breaks because of the memory issues, how do you recover from those things and make sure the training is not disrupted? It takes the checkpoint and just resume it from there. There are a lot of really good solutions and optimization that has been done in the space and is already like a very active research area. I would say that if you are interested in this space go to the hugging phase. They have a really really good blog they have open about approximately two weeks ago. So the strategies are the data parallelism. So you're not just doing the parallelism with respect to having a mini batch of data or the layers or within a layer within Tensor.
Aditya Gautam [00:25:55]: You are doing every possible parallelization you can think of. It's like a 5D parallelism that happens in this particular thing. There is another aspect which is more like a 03 which is a pretty well known way to do better training and also inference on that side in which you distribute the gradients optimizer and the params of the models of certain number of layers to a dedicated gpu. So you say that okay, GPU one like take one two three layer. GPU two take two, three, four layer. So in the forward and the backward pass these machines just collaborate with each other to make sure that you are able to like do the proper training. Because nowadays it's not possible to fit LLM into one gpu. Even though there is a lot of advancement that has happened in the RAM and like the your NIC and other kind of stuff.
Aditya Gautam [00:26:51]: Gradient checkpointing and MLA which recently came out from deep SEQ or like other definitely good ways to make sure you are doing an optimized training and like basically just make sure that your cpu, the model cost, everything is like in line and you are doing, you are squeezing each and every CPU and the memory out of GPUs for your training. Another problem is of course the latency and the inference where you cannot fit the LLM into one particular gpu. That's a thing of a past. There are small LLM coming up in this particular thing which are very good capability but for certain problem like reasoning and all you still need those high memory models which are really good and which cannot be compressed directly into the small models. And of course like there is a problem for the high latency with this like very like layers of transformers inference happening and then there are concurrencies issues when you're doing inference. So I would say like some of the solution which is being used in the industry right now would be like of course to have like the multi GPU pipeline for inference making sure if the model is not able to fill one machine you have multiple GPUs like what we talked about in the training. Similar to that you can have different GPUs doing inference. Of course that this would add latency.
Aditya Gautam [00:28:16]: There are optimization like mixed precision where the precisions of the optimizers PARAM and gradient states would be different or even within a model within param sometime the attention would have a different precision and your feed neural network would have a different precision. So there are different kind of ways to do this. Then there is a quantization for post quantization training. Training aware quantization knowledge distillation to condense knowledge from a big model to the small one. KV coaching okay, that's a little bit of mistake but yeah KV caching is one of the area where like a lot of work has been done and this really proved to be very effective. Fuzzy lozenge catching that means of a prompt has been done and it's a similar like does it make sense to call the backend system or can we return the results based on existing similar query or something speculative Decoding is another really good work that has been done in the area where you have a smaller model which is basically doing the next token prediction and then it is being given to the big models on that side. So yeah I would say if you are more interested in the field like we have these resources, there are really really good resources on the side, very good diagrams and like a lot of medium blogs along with a survey paper that has been done on this field. So feel free to like take a look and get more into detail.
Aditya Gautam [00:29:47]: Sorry. And yeah if you want to get in touch like I would love to connect with more folks from this meeting.
Demetrios [00:29:55]: Dude, so awesome. I've Got one quick question for you because I've been doing a really bad job at keeping time and got to keep it moving quick. Have you seen and if so how both LLMs and traditional ML recommender systems being used together? Because I know maybe you have the LLM that is more powerful and can help with cold starts but then traditional ML recommender systems can be more useful and not fall over for some of those reasons that you mentioned in the talk.
Aditya Gautam [00:30:30]: Yeah, so I think so how the space is moving is basically like right now LLMs are taking more of that like ranking stage. Like that's focusing a lot on that. But when it comes to the retrieval there still is not like let's you have a millions of billions of content right right now even to store or even to retrieve those content. LNs cannot do in a contest. We can just say that okay, there are billions items. Can you give as a. You are a retriever. Can you give top hundred cons? That's not going to work.
Aditya Gautam [00:31:02]: Retrieval is one phase or I would say an area in the recommendation system where a traditional system would still have a lot of value. The embeddings that would be coming from the content recommendation, the clustering, how you figure out similar items or user user similarities, those would still be sought by traditional systems. Like LLM can be used as an embedding systems. But those aspect is something which is like I would say I haven't seen a lot of like research or kind of like a replacement that is happening on that area but there is this new paradigm coming out where you see that oh, these are the content retrieved by traditional recommendation systems. Now please tell me more about it. Rather than going in a discriminative way of like oh this is the cosine similarity. Let's find like the click through and all those objectives.
Demetrios [00:31:48]: So excellent man. Yeah, it's a fascinating field to think about how like the retrieval doesn't really work at least at this point. So you almost have to like hack it in a way and figure out other ways that you can make it work. So yeah, there. There's a few questions coming through in the chat and a lot of folks are asking about the resources slide. Can we drop that into the chat? Or maybe if you want to just drop the links in here then I can port it over to the chat also in the interest of time though, I'm gonna get you out of here and thank you. Good old Paco on but dude, thank you. This is such a fascinating topic for me.
Demetrios [00:32:39]: I. I really appreciate you going deep on it.
