Building a Recommender System

Providing Recommendations based on Collaborative Filtering

For those of us constantly accessing the web, we are all too familiar with Recommender Systems. They are all around us and are either those pesky ads that are listening in on all of our online activity or a holy grail for new information or purchases. Love them or hate them, these recommender systems are widespread for a reason. These systems have played an integral part in the success of many large online businesses such as Amazon, Google and Spotify and are providing some of these companies up to 30% of their total revenue.

The goal of this project is to build a basic recommender model to get more familiar with aspects of Machine Learning. To achieve this goal, the popular online movie dataset from MovieLens has been employed to make movie recommendations to users based on how they rate certain movies. The MovieLens dataset contains over 100,000 real world ratings for many different films and many different users. The dataset was produced by having volunteers rate movies on a scale from 1–5 based on how they enjoyed the movie.

The way the model works is by having users rating several films upon arrival, to figure out what kind of films they enjoy and then the model will make suggestions based on how they rated those initial suggestions. The movies suggested to the new users will be based on movies that people who rated similarly to them also enjoyed. This user-user recommendation type system is called Collaborative Filtering. The model in this project was produced using Alternating Least Squares in PySpark.

Starting the project and first Impressions

I selected the small movie list from MovieLens for this project. From the code snippets above, the dataframes imported and what kind of information they contained is shown.

We see in the ratings.csv file, the columns included had a UserID to keep track of each person and how they were rating movies, a movieID to determine which movie was being rated, the rating given by the user and a timestamp to record the time at which the rating was made. The timestamp was not important for the scope of this project, so the column was dropped.

In the movies.csv file, we see an assigned movieID which is how the two dataframes will relate to one another, a title to identify the name of each movie and then genres labels for each movie.

Exploratory Data Analysis

Question 1: What are the counts for User Ratings?

By looking the value counts for the user ratings, we can see that most movie receive a rating of at least 3 or higher, with the common rating being 4.

Question 2: What are the average ratings for each genre?

Most genres were rated around 3.5, although War, Documentary and Film noir tended to score higher ratings than the rest of the genres. Horror seemed to receive lower ratings on average when compared to the other genre categories.

Question 3: Which Movies received the highest ratings?

I got the average ratings for each movie and then took the top 100 and put them in descending order. The insight to be taken here is that a lot of movies in this dataset have, on average, received perfect scores of 5.

Building the Model

The model was built using ALS and the parameters were tuned using CrossValidation. The model takes in ratings for movies and then recommends a given number of recommendations based on those ratings. For the recommended movies, the model predicts the score the user would give those movies and recommends the movies with the highest predicted ratings.

Modernizing the Model

To push the project further, I attempted to modernize the model by changing the ratings to a binary system to match a Thumbs Up or Thumbs Down system which is the type of recommender engine that has become most popular. Any rating given under a 3 was considered a 0, or Thumbs Down, while a 3 or above was a 1, or Thumbs Up.

Future Work

My future goals for this project would be to try and improve the system which could be done by trying different parameters on the current model or trying new model types other than ALS altogether. More insights on how the model could be improved could be acquired through more intensive EDA. This project could be useful to learn how to create a user interface, which would make the utilization of this code much easier for the average person to use and receive film recommendations.

This project can be found at:

Data Scientist | Health Enthusiast | Learner