Recommendation Systems Walkthrough - Popularity Recommendations
This post discusses a different approach to recommending movies based on the movie’s popularity.
Unlocking Recommendations
The Foundational Power of Popularity-Based Systems
In this post, we will use the average rate available in the movies database. This approach is for building a more generalized recommendation widget based on the movie’s popularity.
This document outlines the concept and implementation of popularity-based recommendation systems, a foundational approach in movie recommendation engines. These systems prioritize items based on their broad appeal, making them essential for generalized recommendations and addressing the cold start problem.
Core Concept: Why Popularity Matters
Popularity-based recommendations suggest items based on their broad appeal to the general audience, rather than personalized user preferences. The underlying principle is that items with higher popularity are more likely to be enjoyed by a larger user base.
This method is crucial for creating generalized recommendation widgets (e.g., “Top 10 Movies This Week”) and for addressing the cold start problem (recommending to new users or new items with limited interaction data).
Mechanism & Challenges: Beyond Simple Averages
A simple approach involves sorting items by a pre-calculated aggregate metric, such as an average rating. However, a simple arithmetic average can be misleading. For instance, a movie with one 5-star rating is not as statistically reliable as a movie with 10,000 ratings averaging 4.0 stars. This highlights the need for more sophisticated weighting.
The approach here is pretty simple we sort the movies based on the pre-calculated average rate which is collected from different users.
IMDb’s Weighted Rating Formula (Bayesian Estimate)
IMDb uses a robust statistical approach for its “Top 250” lists, employing a Bayesian estimate for its weighted rating to stabilize ratings for items with fewer votes and prevent outliers from disproportionately influencing perceived popularity.
For the sake of demonstration, IMDB has a list of top-rated movies which is rated by different users. How is IMDb actually calculating these rates?
The formula for calculating the top rated 250 titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C where:
- R = average for the movie (mean) = (Rating)
- v = number of votes for the movie = (votes)
- m = minimum votes required to be listed in the Top 250 (currently 25000)
- C = the mean vote across the whole report (currently 7.0)
Intuition: This formula blends a movie’s individual average rating (R) with the global average (C). When v (votes) is small, WR is pulled closer to C. As v increases and surpasses m (minimum votes), WR approaches R.
Python Implementation Example
def compute_weighted_average_rate(r, v, m, c):
"""
Calculates the IMDb-style weighted rating for an item.
Args:
r (float): The average rating for the item.
v (int): The number of votes for the item.
m (int): The minimum votes required for consideration (regularization constant).
c (float): The mean vote across the entire dataset (prior mean).
Returns:
float: The calculated weighted rating.
"""
if v == 0: # Handle cases with no votes to avoid division by zero or NaN
return c # Default to global average if no votes
wr = (v / (v + m) * r) + (m / (v + m) * c)
return wr
This criterion forms a solid basis for a generalized recommendation widget, useful when individual user profiles are unavailable or insufficient. Scores derived from this formula are typically recomputed periodically to reflect changing preferences and new data.
Data Representation & SQL Queries
Based on the previous criteria we can use it as a suggestion widget to suggest movies to all users given the IMDB popularity info or users’ average rate that will be recomputed from time to time.
Of course, it is not the best to give a recommendation, but I think it can be used along with other recommender algorithms.
A movies_app_movie table might store movie names, rates, and a pre-calculated popularity score. Assume for a moment we have the following table in our movies database which is fetched using this sql query:
SELECT name, rate, COALESCE(popularity, 0) AS popularity
FROM movies_app_movie
ORDER BY popularity DESC;
| Name | Rate | Popularity |
|---|---|---|
| Minions | 6.4 | 547.4882980000001 |
| Wonder Woman | 7.2 | 294.337037 |
| Beauty and the Beast | 6.8 | 287.253654 |
| Baby Driver | 7.2 | 228.032744 |
| Big Hero 6 | 7.8 | 213.84990699999997 |
| Deadpool | 7.4 | 187.860492 |
| Guardians of the Galaxy Vol. 2 | 7.6 | 185.33099199999998 |
| Avatar | 7.2 | 185.070892 |
| John Wick | 7.0 | 183.870374 |
| Gone Girl | 7.9 | 154.80100900000002 |
| The Hunger Games: Mockingjay - Part 1 | 6.6 | 147.098006 |
| War for the Planet of the Apes | 6.7 | 146.161786 |
| Captain America: Civil War | 7.1 | 145.882135 |
| Pulp Fiction | 8.3 | 140.95023600000002 |
| Pirates of the Caribbean: Dead Men Tell No Tales | 6.6 | 133.82782 |
| The Dark Knight | 8.3 | 123.167259 |
| Blade Runner | 7.9 | 96.272374 |
| The Avengers | 7.4 | 89.887648 |
| Captain Underpants: The First Epic Movie | 6.5 | 88.561239 |
| The Circle | 5.4 | 88.439243 |
COALESCE(popularity, 0) ensures a default value of 0 if popularity is NULL. ORDER BY popularity DESC sorts from highest to lowest popularity. A
movies_app_ratingtable stores individual movie ratings. Here’s how to aggregate ratings per movie:
Based on the previous table, we deduce that movies are sorted in descending order according to their popularity. We will go through how we can compute such info for every film in our database.
In order to compute the popularity property it is a bit tricky since we may need some of the following information which of course will be stored in a separate table to JOIN on later.
Let’s take a look at a separate table for the ratings of movies and it is required to do some aggregation to calculate the rating count of every movie.
SELECT movie_id,
COUNT(movie_id),
AVG(rate) AS rate_avg
FROM movies_app_rating
GROUP BY movie_id
ORDER BY rate_avg DESC
LIMIT 20;
| Movie ID | Name | Count | Rate Avg |
|---|---|---|---|
| 2284 | Mr. Magorium’s Wonder Emporium | 1 | 5 |
| 4459 | Night Without Sleep | 1 | 5 |
| 5473 | De Dominee | 1 | 5 |
| 2636 | The Specialist | 1 | 5 |
| 36931 | On the Edge | 1 | 5 |
| 64278 | Interceptor Force 2 | 1 | 5 |
| 183 | The Wizard | 1 | 5 |
| 845 | Strangers on a Train | 1 | 5 |
| 26791 | Brigham City | 1 | 5 |
| 43267 | 29th Street | 1 | 5 |
| 31413 | Innocence | 1 | 5 |
| 4201 | The Fifth Musketeer | 1 | 5 |
| 2984 | A Countess from Hong Kong | 1 | 5 |
| 4140 | Blindsight | 1 | 5 |
| 6107 | Murder in Three Acts | 1 | 5 |
| 1563 | Sunless | 1 | 5 |
| 65216 | Bloody Cartoons | 1 | 5 |
| 1933 | The Others | 1 | 5 |
| 8675 | Orgazmo | 1 | 5 |
| 2897 | Around the World in Eighty Days | 1 | 5 |
COUNT(movie_id)calculates v (vote count).AVG(rate)calculates R (average rating). This aggregation highlights the issue of movies with high average > ratings but very few votes (e.g., 5.0 with 1 vote), reinforcing the need for weighted formulas.
Factors for Computing Popularity Scores
A robust popularity score often uses a weighted sum of dynamic and static factors, providing a comprehensive view of an item’s current and long-term appeal. The specific blend and weights are often proprietary and fine-tuned for the platform.
There are many ways to determine the popularity of a movie. There is no standard way of computing such a score, we can take the following factors into consideration for example:
- Number of votes for the day.
- Number of views for the day.
- Number of users who marked it as a “favorite” for the day.
- Number of users who added it to their “watchlist” for the day.
- Number of comments.
- Number of rates (Negative Vs. Positive).
- Number of total votes.
The Strategic Role of Popularity in Recommender Systems Architecture
- Baseline and Fallback: They act as a robust baseline model and an excellent fallback mechanism for the cold-start problem for new users.
- Addressing Cold-Start for Items (Partially): The IMDb weighted rating formula partially mitigates the cold-start problem for new items with few ratings by pulling them towards the global average.
- Limitations and Trade-offs: Popularity-based systems inherently lack personalization, which can lead to a “filter bubble” and conflict with “beyond accuracy” metrics like diversity and novelty.
- Component in Hybrid Systems: Popularity recommendations can be used alongside other algorithms in hybrid recommendation systems, serving as a candidate generator or a feature.
Conclusion
The “Popularity Recommendations” method, though simple, embodies the data science principle of extracting meaningful signals from noisy data, especially with sparsity. The Bayesian estimation in the IMDb formula transforms raw averages into statistically sound popularity scores. While modern systems focus on personalization through techniques like collaborative filtering and deep learning, popularity-based approaches remain vital for providing strong baselines, effective cold-start solutions for new users, and crucial components in sophisticated hybrid architectures. Its enduring appeal stems from its ease of implementation and its robust, generalized insight into collective preference.