This post was originally written by Jake Noble for the tecton.ai blog.
Have you ever wondered how TikTok can recommend videos to you that were uploaded minutes ago? Or how YouTube can pick up on your brand-new interest immediately after you watched one video about it? Or how Amazon can recommend products based on what you currently have in your shopping cart?
The answer is online recommendation systems (or recommender systems or recsys). These systems can generate recommendations for users based on real-time contextual information, such as the latest item catalog, user behaviors, and session context. If you’re interested in building an online recommendation system or trying to take your existing system to the next level, then this blog post is for you.
I worked as a tech lead on the YouTube Homepage Recommendations team from 2018 to 2021, focusing on long-term user satisfaction metrics. While at YouTube, I learned from some of the best minds in the business, who had been building and refining the world’s largest recommendation engine for close to a decade. Since 2021, I’ve been with Tecton and have worked with many smaller companies building and refining their search and recommendation systems.
In this blog post, I will give an overview of online recommendation systems, the various approaches for building different subcomponents, and offer some guidance to help you reduce costs, manage complexity, and enable your team to ship ideas.
While online recommendation systems can vary substantially, they are all generally composed of the same high-level components. Those are:
Candidate generation is responsible for quickly and cheaply narrowing the set of possible candidates down to a small enough set that can be ranked. For example, narrowing a billion possible YouTube videos down to ~hundreds based on a user’s subscriptions and recent interests. A good candidate generation system produces a relatively small number of diverse, high-quality candidates for the rest of the system.
There are many different options that can make effective candidate generators:
Every approach will have its pros and cons with respect to complexity, freshness, candidate diversity, candidate quality, and cost. A good practice is to use a combination of approaches and then “union” the results in the online system. For example, collaborative filtering models can be used to generate high-quality candidates for learned users and items, but collaborative filtering models suffer from the cold-start problem. Your system could supplement a collaborative filtering model with recently added, popular items fetched from an operational database.
At one or more points, the recommendation system will need to look up or compute data/features for the user and the candidates being considered. This data will fall into three categories:
product_category=clothing
or average_review_rating_past_24h=4.21
user_favorite_product_categories=[clothing, sporting_goods]
or user_site_spend_past_month=408.10
user_has_bought_this_product_before=False
or product_is_in_user_favorite_categories=True
Your system does not necessarily need to use a purpose-built “feature store” for feature retrieval, but the data service needs to handle the following:
product_average_review_rating
feature is a minute or two stale.) Feature caching may be done at multiple levels; e.g., a first-level, in-process cache and a second-level, out-of-process cache like Redis.user_site_spend_past_month
is pre-tax during offline training and post-tax during online inference, you may get inaccurate online model predictions.product_average_review_rating_past_24h
or user_recent_purchase_ids
) are a popular and powerful class of features, and your data service should have a scalable approach to building and serving them.Filtering is the process of removing candidates based on fetched data or model predictions. Filters primarily act as system guardrails for bad user experience. They fall into the following categories:
item_out_of_stock_filter
that filters out-of-stock products.recently_view_products_filter
that filters out products that the user has viewed in the past day or in the current session.unhelpful_review_filter
that filters out item reviews if a personalized model predicts that the user would rate the review as unhelpful.Even though filters act as guardrails and are typically very simple, they themselves may need some guardrails. Here are some recommended practices:
item_out_of_stock_filter
, can cause outages if the upstream data service has an outage.Finally, we get to the “normal” machine learning part of the system: model inference. Like with other ML systems, online model inference boils down to sending feature vectors to a model service to get predictions. Typically these models predict easily measurable downstream events, like Click Through Rate, Probability of a Purchase, Video Watch Time, etc. The exact ML method, model architecture, and feature engineering for these models are huge topics, and fully covering them is outside of the scope of this blog post.
Instead, we will focus on a couple of practical topics that are covered less in data science literature.
“Pointwise” ranking is the process of scoring and ranking items in isolation; i.e., without considering other items in the output. The item pointwise score may be as simple as a single model prediction (e.g., the predicted click through rate) or a combination of multiple predictions and heuristics.
For example, at YouTube, the ranking score is a simple algebraic combination of many different predictions, like predicted “click through rate” and predicted “watch time”, and occasionally some heuristics, like a “small creator boost” to slightly bump scores for smaller channels.
ranking_score =
f(pCTR, pWatchTime, isSmallChannel) =
pCTR ^ X * pWatchTime ^ Y * if(isSmallChannel, Z, 1.0)
Tuning the parameters X, Y, and Z will shift the system’s bias from one objective to another. If your model scores are calibrated, then this ranking function can be tuned separately from model launches. The parameters can even be tuned in a personalized way; e.g., new users may use different ranking function parameters than power users.
Have you ever noticed that your YouTube feed rarely has two similar videos next to each other? The reason is that item diversity is a major objective of YouTube’s “listwise” ranking. (The intuition being if a user has scrolled past two “soccer” videos, then it’s probably sub-optimal to recommend another “soccer” video in the third position.)
Listwise ranking is the process of ordering items in the context of other items in the list. ML-based and heuristic-based approaches can both be very effective for listwise optimization—YouTube has successfully used both. A simple heuristic approach that YouTube had success with is to greedily rank items based on their pointwise rank but apply a penalty to items that are too similar to the preceding N items.
If item diversity is an objective of your final output, keep in mind that listwise ranking can only be impactful if there are high-quality, diverse candidates in its input. This means that tuning the upstream system, and in particular candidate generation, to source diverse candidates is critical.
This blog post did not cover major recsys topics like feature engineering, experimentation, and metric selection, but I hope that it gave you some ideas about how to build or improve your system. If you’re interested in learning more, I recommend (pun intended) signing up for apply(recsys), a free, virtual, half-day event focusing on everything recsys, with guest speakers from major recsys powerhouses such as Slack and ByteDance. You can also reach out to me ([email protected]) in the Tecton/Feast Slack if you want to chat more about recommendation system design.