Online streaming platforms like Netflix have plenty of movies in their repositories and if we can build a recommendation system to recommend relevant movies to users based on their historical interactions, this would improve customer satisfaction and hence improve revenue. The techniques that we will learn here will not only be limited to movies, it can be any item for which you want to build a recommendation system. For this case study, you can find the dataset here.

In this project we will be building various recommendation systems:

- Knowledge/Rank based recommendation system
- Similarity-Based Collaborative filtering
- Matrix Factorization Based Collaborative Filtering

based on the **ratings** dataset

The **ratings** dataset contains the following attributes:

- userId
- movieId
- rating
- timestamp

In [ ]:

```
# uncomment if you are using google colab
#from google.colab import drive
#drive.mount('/content/drive')
```

In [2]:

```
# installing surprise library, only do it for first time
!pip install surprise
```

In [3]:

```
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from surprise import accuracy
# class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader
# class for loading datasets
from surprise.dataset import Dataset
# for model tuning model hyper-parameters
from surprise.model_selection import GridSearchCV
# for splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split
# for implementing similarity based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic
# for implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD
from collections import defaultdict
# for implementing cross validation
from surprise.model_selection import KFold
```

In [4]:

```
rating = pd.read_csv('ratings.csv')
```

Let's check the **info** of the data

In [5]:

```
rating.info()
```

- There are
**1,00,004 observations**and**4 columns**in the data - All the columns are of
**numeric data type** - The data type of the timestamp column is int64 which is not correct. We can convert this to DateTime format but
**we don't need timestamp for our analysis**. Hence,**we can drop this column**

In [6]:

```
#Dropping timestamp column
rating = rating.drop(['timestamp'], axis=1)
```

In [7]:

```
#printing the top 5 rows of the dataset Hint use .head()
#remove _______and complete the code
rating.head()
```

Out[7]:

userId | movieId | rating | |
---|---|---|---|

0 | 1 | 31 | 2.5 |

1 | 1 | 1029 | 3.0 |

2 | 1 | 1061 | 3.0 |

3 | 1 | 1129 | 2.0 |

4 | 1 | 1172 | 4.0 |

In [69]:

```
plt.figure(figsize = (12, 4))
#remove _______and complete the code
sns.countplot(x = "rating", data=rating)
plt.tick_params(labelsize = 10)
plt.title("Distribution of Ratings ", fontsize = 10)
plt.xlabel("Ratings", fontsize = 10)
plt.ylabel("Number of Ratings", fontsize = 10)
plt.show()
```

This plot shows the distribution of all of the different ratings. This plot indicates that the ratings of 3 and greater are more common in the dataset with a rating of 4 being the most common rating.

In [9]:

```
#Finding number of unique users
#remove _______ and complete the code
len(rating['userId'].unique())
```

Out[9]:

671

There are 671 total unique users.

In [10]:

```
#Finding number of unique movies
#remove _______ and complete the code
len(rating['movieId'].unique())
```

Out[10]:

9066

There are 9066 total unique movies.

In [11]:

```
rating.groupby(['userId', 'movieId']).count()
```

Out[11]:

rating | ||
---|---|---|

userId | movieId | |

1 | 31 | 1 |

1029 | 1 | |

1061 | 1 | |

1129 | 1 | |

1172 | 1 | |

... | ... | ... |

671 | 6268 | 1 |

6269 | 1 | |

6365 | 1 | |

6385 | 1 | |

6565 | 1 |

100004 rows × 1 columns

In [12]:

```
rating.groupby(['userId', 'movieId']).count()['rating'].sum()
```

Out[12]:

100004

No, there are no movies that have been interacted with more than once by the same user since the number of ratings by grouped by userId is equal to 1 for each movieId, which is proven by the sum equaling to 100004, which is the total amount of ratings in the data set.

In [13]:

```
#remove _______ and complete the code
rating['movieId'].mode()
```

Out[13]:

0 356 dtype: int64

The movie that is most interacted with is the movie with movieId 356.

In [14]:

```
#Plotting distributions of ratings for 341 interactions with movieid 356
plt.figure(figsize=(7,7))
rating[rating['movieId'] == 356]['rating'].value_counts().plot(kind='bar')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()
```

This plot shows the distribution of all interactions with movieid 356.

In [15]:

```
#remove _______ and complete the code
rating['userId'].mode()
```

Out[15]:

0 547 dtype: int64

User 547 has interacted the most with any movie in the dataset.

In [16]:

```
#Finding user-movie interactions distribution
count_interactions = rating.groupby('userId').count()['movieId']
count_interactions
```

Out[16]:

userId 1 20 2 76 3 51 4 204 5 100 ... 667 68 668 20 669 37 670 31 671 115 Name: movieId, Length: 671, dtype: int64

In [62]:

```
#Plotting user-movie interactions distribution
plt.figure(figsize=(15,7))
#remove _______ and complete the code
sns.histplot(count_interactions)
plt.xlabel('Number of Interactions by Users')
plt.show()
```

This plot depicts the distribution of the number of interactions by user. For interpretation, this plot essentially signifies that most users interact with movies around 0-50 times approximately. This means that the number of interactions a user has with a movie is most likely going to be less than 500 ( or even 100).

Rank-based recommendation systems provide recommendations based on the most popular items. This kind of recommendation system is useful when we have **cold start** problems. Cold start refers to the issue when we get a new user into the system and the machine is not able to recommend movies to the new user, as the user did not have any historical interactions in the dataset. In those cases, we can use rank-based recommendation system to recommend movies to the new user.

To build the rank-based recommendation system, we take **average** of all the ratings provided to each movie and then rank them based on their average rating.

In [17]:

```
#remove _______ and complete the code
#Calculating average ratings
average_rating = rating.groupby('movieId').mean().rating
#Calculating the count of ratings
count_rating = rating.groupby('movieId').count().rating
#Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating})
```

In [18]:

```
final_rating.head()
```

Out[18]:

avg_rating | rating_count | |
---|---|---|

movieId | ||

1 | 3.872470 | 247 |

2 | 3.401869 | 107 |

3 | 3.161017 | 59 |

4 | 2.384615 | 13 |

5 | 3.267857 | 56 |

Now, let's create a function to find the **top n movies** for a recommendation based on the average ratings of movies. We can also add a **threshold for a minimum number of interactions** for a movie to be considered for recommendation.

In [19]:

```
def top_n_movies(data, n, min_interaction=100):
#Finding movies with minimum number of interactions
recommendations = data[data['rating_count'] > min_interaction]
#Sorting values w.r.t average rating
recommendations = recommendations.sort_values(by='avg_rating', ascending=False)
return recommendations.index[:n]
```

We can **use this function with different n's and minimum interactions** to get movies to recommend

In [20]:

```
#remove _______ and complete the code
list(top_n_movies(final_rating,5,50))
```

Out[20]:

[858, 318, 913, 1221, 50]

In [21]:

```
#remove _______ and complete the code
list(top_n_movies(final_rating,5,100))
```

Out[21]:

[858, 318, 1221, 50, 527]

In [22]:

```
#remove _______ and complete the code
list(top_n_movies(final_rating,5,200))
```

Out[22]:

[318, 50, 527, 608, 296]

Now that we have seen how to apply the Rank-Based Recommendation System, let's create Collaborative Filtering Based Recommendation Systems.

In this above interactions matrix - out of the users B and C, which user is most likely to interact with the movie - The Terminal?

In this type of recommendation system, `we do not need any information`

about the users or items. We only need user item interaction data to build a collaborative recommendation system. For example -

**Ratings**provided by users. For example - ratings of books on goodread, movie ratings on imdb etc**Likes**of users on different facebook posts, likes on youtube videos**Use/buying**of a product by users. For example - buying different items on e-commerce sites**Reading**of articles by readers on various blogs

- Similarity/Neighborhood based
- Model based

Below we are building similarity based recommendation system using `cosine`

similarity and using KNN to find similar users which are nearest neighbor to the given user.

We will be using a new library - `surprise`

to build the remaining models, let's first import the necessary classes and functions from this library

Below we are loading the `rating`

dataset, which is a pandas dataframe, into a different format called `surprise.dataset.DatasetAutoFolds`

which is required by this library. To do this we will be using the classes `Reader`

and `Dataset`

In [28]:

```
# instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))
# loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)
# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
```

In [29]:

```
#remove _______ and complete the code
sim_options = {'name': 'cosine',
'user_based': True
}
#defining Nearest neighbour algorithm
algo_knn_user = KNNBasic(sim_options = sim_options, verbose=False, random_state=1)
# Train the algorithm on the trainset or fitting the model on train dataset
algo_knn_user.fit(trainset)
#predict ratings for the testset
predictions = algo_knn_user.test(testset)
# Then compute RMSE
accuracy.rmse(predictions)
```

RMSE: 0.9925

Out[29]:

0.9924509041520163

The RMSE is 0.9925.

`userId=4`

and for `movieId=10`

¶In [30]:

```
#remove _______ and complete the code
algo_knn_user.predict(4, 10, r_ui=4, verbose=True)
```

user: 4 item: 10 r_ui = 4.00 est = 3.62 {'actual_k': 40, 'was_impossible': False}

Out[30]:

Prediction(uid=4, iid=10, r_ui=4, est=3.6244912065910952, details={'actual_k': 40, 'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=10 is 3.62.

Let's predict the rating for the same `userId=4`

but for a movie which this user has not interacted before i.e. `movieId=3`

In [31]:

```
#remove _______ and complete the code
algo_knn_user.predict(4, 3, verbose=True)
```

user: 4 item: 3 r_ui = None est = 3.20 {'actual_k': 40, 'was_impossible': False}

Out[31]:

Prediction(uid=4, iid=3, r_ui=None, est=3.202703552548654, details={'actual_k': 40, 'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=3 is 3.20.

Below we will be tuning hyper-parmeters for the `KNNBasic`

algorithms. Let's try to understand different hyperparameters of KNNBasic algorithm -

**k**(int) – The (max) number of neighbors to take into account for aggregation (see this note). Default is 40.**min_k**(int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.**sim_options**(dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise -- cosine
- msd (default)
- pearson
- pearson baseline

For more details please refer the official documentation https://surprise.readthedocs.io/en/stable/knn_inspired.html

In [107]:

```
#remove _______ and complete the code
# setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
'user_based': [True], "min_support":[2,4]}}
# performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
# fitting the data
grid_obj.fit(data)
# best RMSE score
print(grid_obj.best_score['rmse'])
# combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])
```

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above

Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters

In [102]:

```
results_df = pd.DataFrame.from_dict(grid_obj.cv_results)
results_df.head()
```

Out[102]:

split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_k | param_min_k | param_sim_options | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.999601 | 1.007935 | 1.004073 | 1.003869 | 0.003406 | 42 | 0.773227 | 0.773754 | 0.776122 | 0.774368 | 0.001259 | 21 | 0.355385 | 0.013596 | 2.113928 | 0.012148 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'cosine', 'user_based': True, 'min_su... |

1 | 0.998780 | 1.006857 | 1.001463 | 1.002366 | 0.003359 | 39 | 0.772787 | 0.773139 | 0.774894 | 0.773607 | 0.000922 | 16 | 0.309440 | 0.026484 | 2.092452 | 0.010118 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'cosine', 'user_based': True, 'min_su... |

2 | 1.002915 | 1.013767 | 1.006427 | 1.007703 | 0.004521 | 49 | 0.781391 | 0.785870 | 0.783849 | 0.783703 | 0.001832 | 48 | 0.511408 | 0.014796 | 2.106024 | 0.007105 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'pearson', 'user_based': True, 'min_s... |

3 | 1.002333 | 1.011460 | 1.002914 | 1.005569 | 0.004172 | 44 | 0.781702 | 0.783972 | 0.781196 | 0.782290 | 0.001207 | 47 | 0.416535 | 0.012565 | 2.098976 | 0.012688 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'pearson', 'user_based': True, 'min_s... |

4 | 0.993738 | 1.004707 | 0.996401 | 0.998282 | 0.004671 | 25 | 0.774667 | 0.779212 | 0.775501 | 0.776460 | 0.001976 | 32 | 0.479957 | 0.013051 | 2.077184 | 0.017852 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'pearson_baseline', 'user_based': Tru... |

Now we will building final model by using tuned values of the hyperparameters which we received by using grid search cross validation

In [63]:

```
#remove _______ and complete the code
sim_options = {'name': 'cosine',
'user_based': True, "min_support":2}
# using the optimal similarity measure for user-user based collaborative filtering
# creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_user = KNNBasic(sim_options=sim_options, k=30, min_k=3, random_state=1,verbose=False)
# training the algorithm on the trainset
similarity_algo_optimized_user.fit(trainset)
# predicting ratings for the testset
predictions = similarity_algo_optimized_user.test(testset)
# computing RMSE on testset
accuracy.rmse(predictions)
```

RMSE: 0.9871

Out[63]:

0.9871266024277001

The RMSE is 0.9871.

`userId=4`

and for `movieId=10`

with the optimized model¶In [105]:

```
#remove _______ and complete the code
similarity_algo_optimized_user.predict(4,10, r_ui=4, verbose=True)
```

user: 4 item: 10 r_ui = 4.00 est = 3.58 {'actual_k': 30, 'was_impossible': False}

Out[105]:

Prediction(uid=4, iid=10, r_ui=4, est=3.583535324429299, details={'actual_k': 30, 'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=10 is 3.58.

Below we are predicting rating for the same `userId=4`

but for a movie which this user has not interacted before i.e. `movieId=3`

, by using the optimized model as shown below -

In [106]:

```
#remove _______ and complete the code
similarity_algo_optimized_user.predict(4,3, verbose=True)
```

user: 4 item: 3 r_ui = None est = 3.17 {'actual_k': 30, 'was_impossible': False}

Out[106]:

Prediction(uid=4, iid=3, r_ui=None, est=3.170232402310352, details={'actual_k': 30, 'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=3 is 3.17.

We can also find out the similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar user to the `userId=4`

based on the `msd`

distance metric

In [108]:

```
similarity_algo_optimized_user.get_neighbors(4, k=5)
```

Out[108]:

[357, 220, 590, 491, 647]

Below we will be implementing a function where the input parameters are -

- data: a rating dataset
- user_id: an user id against which we want the recommendations
- top_n: the number of movies we want to recommend
- algo: the algorithm we want to use to predict the ratings

In [32]:

```
def get_recommendations(data, user_id, top_n, algo):
# creating an empty list to store the recommended movie ids
recommendations = []
# creating an user item interactions matrix
user_item_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')
# extracting those movie ids which the user_id has not interacted yet
non_interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
# looping through each of the movie id which user_id has not interacted yet
for item_id in non_interacted_movies:
# predicting the ratings for those non interacted movie ids by this user
est = algo.predict(user_id, item_id).est
# appending the predicted ratings
recommendations.append((item_id, est))
# sorting the predicted ratings in descending order
recommendations.sort(key=lambda x: x[1], reverse=True)
return recommendations[:top_n] # returing top n highest predicted rating movies for this user
```

In [110]:

```
#remove _______ and complete the code
recommendations = get_recommendations(rating,4,5,similarity_algo_optimized_user)
```

In [111]:

```
recommendations
```

Out[111]:

[(309, 5), (3038, 4.999999999999999), (98491, 4.899347045407786), (6273, 4.839859025263867), (116, 4.753206589295344)]

In [34]:

```
#remove _______ and complete the code
#definfing similarity measure
sim_options = {'name': 'pearson',
'user_based': False}
#defining Nearest neighbour algorithm
algo_knn_item = KNNBasic(sim_options = sim_options,verbose=False)
# Train the algorithm on the trainset or fitting the model on train dataset
algo_knn_item.fit(trainset)
#predict ratings for the testset
predictions = algo_knn_item.test(testset)
# Then compute RMSE
accuracy.rmse(predictions)
```

RMSE: 0.9964

Out[34]:

0.9964454065946875

The RMSE is 0.9964.

`userId=4`

and for `movieId=10`

¶In [35]:

```
#remove _______ and complete the code
algo_knn_item.predict(4,10, r_ui=4, verbose=True)
```

user: 4 item: 10 r_ui = 4.00 est = 4.42 {'actual_k': 40, 'was_impossible': False}

Out[35]:

Prediction(uid=4, iid=10, r_ui=4, est=4.420788161822849, details={'actual_k': 40, 'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=10 is 4.42.

`userId=4`

but for a movie which this user has not interacted before i.e. `movieId=3`

¶In [36]:

```
#remove _______ and complete the code
algo_knn_item.predict(4,3, verbose=True)
```

user: 4 item: 3 r_ui = None est = 4.06 {'actual_k': 40, 'was_impossible': False}

Out[36]:

Prediction(uid=4, iid=3, r_ui=None, est=4.064635736744944, details={'actual_k': 40, 'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=3 is 4.06.

In [37]:

```
#remove _______ and complete the code
# setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
'user_based': [False], "min_support":[2,4]}
}
# performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
# fitting the data
grid_obj.fit(data)
# best RMSE score
print(grid_obj.best_score['rmse'])
# combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])
```

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above:

In [39]:

```
results_df = pd.DataFrame.from_dict(grid_obj.cv_results)
results_df.head()
```

Out[39]:

split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_k | param_min_k | param_sim_options | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1.008200 | 1.018035 | 1.005463 | 1.010566 | 0.005398 | 49 | 0.780272 | 0.784731 | 0.776986 | 0.780663 | 0.003174 | 45 | 19.228845 | 0.144610 | 11.522912 | 0.115562 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'cosine', 'user_based': False, 'min_s... |

1 | 1.002260 | 1.010876 | 0.996875 | 1.003337 | 0.005766 | 42 | 0.768556 | 0.773272 | 0.764388 | 0.768739 | 0.003629 | 38 | 13.712201 | 2.123800 | 11.927781 | 0.527770 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'cosine', 'user_based': False, 'min_s... |

2 | 1.025800 | 1.035418 | 1.025767 | 1.028995 | 0.004542 | 54 | 0.799728 | 0.806096 | 0.798486 | 0.801437 | 0.003333 | 52 | 21.718961 | 2.769808 | 11.538013 | 0.114138 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'pearson', 'user_based': False, 'min_... |

3 | 1.011432 | 1.014203 | 1.005328 | 1.010321 | 0.003708 | 48 | 0.781175 | 0.784392 | 0.776458 | 0.780675 | 0.003258 | 46 | 12.749396 | 0.179462 | 11.176713 | 0.269833 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'pearson', 'user_based': False, 'min_... |

4 | 0.968087 | 0.975972 | 0.958479 | 0.967512 | 0.007153 | 10 | 0.732377 | 0.737830 | 0.723254 | 0.731154 | 0.006013 | 7 | 10.298205 | 0.670899 | 10.387762 | 0.071025 | {'k': 10, 'min_k': 3, 'sim_options': {'name': ... | 10 | 3 | {'name': 'pearson_baseline', 'user_based': Fal... |

In [40]:

```
#remove _______ and complete the code
sim_options = {'name': 'pearson_baseline',
'user_based': False, 'min_support': 2}
# creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_item = KNNBasic(sim_options=sim_options, k=30, min_k=6,verbose=False)
# training the algorithm on the trainset
similarity_algo_optimized_item.fit(trainset)
# predicting ratings for the testset
predictions = similarity_algo_optimized_item.test(testset)
# computing RMSE on testset
accuracy.rmse(predictions)
```

RMSE: 0.9495

Out[40]:

0.9494691122446014

The RMSE is 0.9495.

`userId=4`

and for `movieId=10`

with the optimized model as shown below¶In [41]:

```
#remove _______ and complete the code
similarity_algo_optimized_item.predict(4,10, r_ui=4, verbose=True)
```

user: 4 item: 10 r_ui = 4.00 est = 4.18 {'actual_k': 30, 'was_impossible': False}

Out[41]:

Prediction(uid=4, iid=10, r_ui=4, est=4.176345863496977, details={'actual_k': 30, 'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=10 is 4.18.

`userId=4`

but for a movie which this user has not interacted before i.e. `movieId=3`

, by using the optimized model:¶In [42]:

```
#remove _______ and complete the code
similarity_algo_optimized_item.predict(4, 3, verbose=True)
```

user: 4 item: 3 r_ui = None est = 4.36 {'actual_k': 30, 'was_impossible': False}

Out[42]:

Prediction(uid=4, iid=3, r_ui=None, est=4.3587850293101775, details={'actual_k': 30, 'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=3 is 4.36.

We can also find out the similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar user to the `userId=4`

based on the `msd`

distance metric

In [43]:

```
#remove _______ and complete the code
similarity_algo_optimized_item.get_neighbors(4, k=5)
```

Out[43]:

[1347, 311, 1445, 778, 108]

In [44]:

```
#remove _______ and complete the code
recommendations = get_recommendations(rating, 4, 5, similarity_algo_optimized_item)
```

In [45]:

```
recommendations
```

Out[45]:

[(190, 5), (449, 5), (1046, 5), (1365, 5), (1398, 5)]

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

**Latent Features:** The features that are not present in the empirical data but can be inferred from the data. For example:

Now if we notice the above movies closely:

Here **Action**, **Romance**, **Suspense** and **Comedy** are latent features of the corresponding movies. Similarly, we can compute the latent features for users as shown below:

SVD is used to compute the latent features from the user-item matrix that we already learned earlier. But SVD does not work when we missing values in the user-item matrix.

First we need to convert the below movie-rating dataset:

into an user-item matrix as shown below:

We have already done this above while computing cosine similarities.

**SVD decomposes this above matrix into three separate matrices:**

- U matrix
- Sigma matrix
- V transpose matrix

the above matrix is a n x k matrix, where:

- n is number of users
- k is number of latent features

the above matrix is a k x k matrix, where:

- k is number of latent features
- Each diagonal entry is the singular value of the original interaction matrix

the above matrix is a kxn matrix, where:

- k is the number of latent features
- n is the number of items

In [46]:

```
#remove _______ and complete the code
# using SVD matrix factorization
algo_svd = SVD()
# training the algorithm on the trainset
algo_svd.fit(trainset)
# predicting ratings for the testset
predictions = algo_svd.test(testset)
# computing RMSE on the testset
accuracy.rmse(predictions)
```

RMSE: 0.9023

Out[46]:

0.9023308579642529

The RMSE is 0.9023.

`userId=4`

and for `movieId=10`

¶In [47]:

```
#remove _______ and complete the code
algo_svd.predict(4, 10, r_ui=4, verbose=True)
```

user: 4 item: 10 r_ui = 4.00 est = 4.09 {'was_impossible': False}

Out[47]:

Prediction(uid=4, iid=10, r_ui=4, est=4.087940020019515, details={'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=10 is 4.09.

`userId=4`

but for a movie which this user has not interacted before i.e. `movieId=3`

:¶In [48]:

```
#remove _______ and complete the code
algo_svd.predict(4, 3, verbose=True)
```

user: 4 item: 3 r_ui = None est = 3.79 {'was_impossible': False}

Out[48]:

Prediction(uid=4, iid=3, r_ui=None, est=3.78670657499283, details={'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=3 is 3.79.

In SVD, rating is predicted as -

$$\hat{r}_{u i}=\mu+b_{u}+b_{i}+q_{i}^{T} p_{u}$$

If user $u$ is unknown, then the bias $b_{u}$ and the factors $p_{u}$ are assumed to be zero. The same applies for item $i$ with $b_{i}$ and $q_{i}$.

To estimate all the unknown, we minimize the following regularized squared error:

$$\sum_{r_{u i} \in R_{\text {train }}}\left(r_{u i}-\hat{r}_{u i}\right)^{2}+\lambda\left(b_{i}^{2}+b_{u}^{2}+\left\|q_{i}\right\|^{2}+\left\|p_{u}\right\|^{2}\right)$$

The minimization is performed by a very straightforward **stochastic gradient descent**:

$$\begin{aligned} b_{u} & \leftarrow b_{u}+\gamma\left(e_{u i}-\lambda b_{u}\right) \\ b_{i} & \leftarrow b_{i}+\gamma\left(e_{u i}-\lambda b_{i}\right) \\ p_{u} & \leftarrow p_{u}+\gamma\left(e_{u i} \cdot q_{i}-\lambda p_{u}\right) \\ q_{i} & \leftarrow q_{i}+\gamma\left(e_{u i} \cdot p_{u}-\lambda q_{i}\right) \end{aligned}$$

There are many hyperparameters to tune in this algorithm, you can find a full list of hyperparameters here

Below we will be tuning only three hyperparameters -

**n_epochs**: The number of iteration of the SGD algorithm**lr_all**: The learning rate for all parameters**reg_all**: The regularization term for all parameters

In [49]:

```
#remove _______ and complete the code
# set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
'reg_all': [0.2, 0.4, 0.6]}
# performing 3-fold gridsearch cross validation
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
# fitting data
gs.fit(data)
# best RMSE score
print(gs.best_score['rmse'])
# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
```

0.8938749155641837 {'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}

In [50]:

```
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df.head()
```

Out[50]:

split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_n_epochs | param_lr_all | param_reg_all | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.946593 | 0.943374 | 0.939202 | 0.943057 | 0.003026 | 25 | 0.739959 | 0.737531 | 0.736375 | 0.737955 | 0.001494 | 25 | 3.379065 | 0.007287 | 0.390260 | 0.010506 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.2} | 10 | 0.001 | 0.2 |

1 | 0.951260 | 0.947063 | 0.943786 | 0.947370 | 0.003059 | 26 | 0.745191 | 0.742214 | 0.741280 | 0.742895 | 0.001668 | 26 | 3.358689 | 0.022910 | 0.367266 | 0.018548 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.4} | 10 | 0.001 | 0.4 |

2 | 0.955922 | 0.952691 | 0.949034 | 0.952549 | 0.002814 | 27 | 0.750363 | 0.748347 | 0.747016 | 0.748576 | 0.001376 | 27 | 3.424530 | 0.093007 | 0.361025 | 0.010906 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.6} | 10 | 0.001 | 0.6 |

3 | 0.910209 | 0.906131 | 0.903036 | 0.906459 | 0.002937 | 10 | 0.702858 | 0.700852 | 0.700916 | 0.701542 | 0.000931 | 9 | 3.546953 | 0.013166 | 0.371510 | 0.009060 | {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.2} | 10 | 0.005 | 0.2 |

4 | 0.917776 | 0.913339 | 0.910198 | 0.913771 | 0.003109 | 15 | 0.710949 | 0.709001 | 0.708324 | 0.709425 | 0.001113 | 15 | 3.365576 | 0.032710 | 0.357900 | 0.017146 | {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4} | 10 | 0.005 | 0.4 |

In [52]:

```
#remove _______ and complete the code
# building the optimized SVD model using optimal hyperparameter search
svd_algo_optimized = SVD(n_epochs=30, lr_all=0.01, reg_all=0.2)
# training the algorithm on the trainset
svd_algo_optimized.fit(trainset)
# predicting ratings for the testset
predictions = svd_algo_optimized.test(testset)
# computing RMSE
accuracy.rmse(predictions)
```

RMSE: 0.8955

Out[52]:

0.8954630064689425

**Let's us now predict rating for an user with userId=4 and for movieId=10 with the optimized model**

In [53]:

```
#remove _______ and complete the code
svd_algo_optimized.predict(4, 10, r_ui=4, verbose=True)
```

user: 4 item: 10 r_ui = 4.00 est = 3.99 {'was_impossible': False}

Out[53]:

Prediction(uid=4, iid=10, r_ui=4, est=3.987473187515993, details={'was_impossible': False})

The predicted rating for a user with userId=4 and for movieId=10 is 3.99.

In [54]:

```
#remove _______ and complete the code
svd_algo_optimized.predict(4, 3, verbose=True)
```

user: 4 item: 3 r_ui = None est = 3.61 {'was_impossible': False}

Out[54]:

Prediction(uid=4, iid=3, r_ui=None, est=3.61423730206565, details={'was_impossible': False})

In [55]:

```
#remove _______ and complete the code
get_recommendations(rating, 4, 5, svd_algo_optimized)
```

Out[55]:

[(1192, 5), (926, 4.950016862333721), (1948, 4.946539035975806), (3310, 4.945746618737897), (116, 4.930743682667831)]

Below we are comparing the rating predictions of users for those movies which has been already watched by an user. This will help us to understand how well are predictions are as compared to the actual ratings provided by users

In [56]:

```
def predict_already_interacted_ratings(data, user_id, algo):
# creating an empty list to store the recommended movie ids
recommendations = []
# creating an user item interactions matrix
user_item_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')
# extracting those movie ids which the user_id has interacted already
interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].notnull()].index.tolist()
# looping through each of the movie id which user_id has interacted already
for item_id in interacted_movies:
# extracting actual ratings
actual_rating = user_item_interactions_matrix.loc[user_id, item_id]
# predicting the ratings for those non interacted movie ids by this user
predicted_rating = algo.predict(user_id, item_id).est
# appending the predicted ratings
recommendations.append((item_id, actual_rating, predicted_rating))
# sorting the predicted ratings in descending order
recommendations.sort(key=lambda x: x[1], reverse=True)
return pd.DataFrame(recommendations, columns=['movieId', 'actual_rating', 'predicted_rating']) # returing top n highest predicted rating movies for this user
```

Here we are comparing the predicted ratings by `similarity based recommendation`

system against actual ratings for `userId=7`

In [57]:

```
predicted_ratings_for_interacted_movies = predict_already_interacted_ratings(rating, 7, similarity_algo_optimized_item)
df = predicted_ratings_for_interacted_movies.melt(id_vars='movieId', value_vars=['actual_rating', 'predicted_rating'])
sns.displot(data=df, x='value', hue='variable', kde=True);
```

Below we are comparing the predicted ratings by `matrix factorization based recommendation`

system against actual ratings for `userId=7`

In [58]:

```
predicted_ratings_for_interacted_movies = predict_already_interacted_ratings(rating, 7, svd_algo_optimized)
df = predicted_ratings_for_interacted_movies.melt(id_vars='movieId', value_vars=['actual_rating', 'predicted_rating'])
sns.displot(data=df, x='value', hue='variable', kde=True);
```

In [59]:

```
# instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))
# loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)
# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
```

RMSE is not the only metric we can use here. We can also examine two fundamental measures, precision and recall. We also add a parameter k which is helpful in understanding problems with multiple rating outputs.

See the Precision and Recall @ k section of your notebook and follow the instructions to compute various precision/recall values at various values of k.

To know more about precision recall in Recommendation systems refer to these links :

https://surprise.readthedocs.io/en/stable/FAQ.html

https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54

In [64]:

```
#function can be found on surprise documentation FAQs
def precision_recall_at_k(predictions, k=10, threshold=3.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user.
user_est_true = defaultdict(list)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key=lambda x: x[0], reverse=True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[:k])
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set it to 0.
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set it to 0.
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
return precisions, recalls
```

In [66]:

```
#A basic cross-validation iterator.
kf = KFold(n_splits=5)
# Make list of k values
K = [5, 10]
#remove _______ and complete the code
# Make list of models
models = [algo_knn_user, similarity_algo_optimized_user, algo_knn_item, similarity_algo_optimized_item, algo_svd, svd_algo_optimized]
for k in K:
for model in models:
print('> k={}, model={}'.format(k,model.__class__.__name__))
p = []
r = []
for trainset, testset in kf.split(data):
model.fit(trainset)
predictions = model.test(testset, verbose=False)
precisions, recalls = precision_recall_at_k(predictions, k=k, threshold=3.5)
# Precision and recall can then be averaged over all users
p.append(sum(prec for prec in precisions.values()) / len(precisions))
r.append(sum(rec for rec in recalls.values()) / len(recalls))
print('-----> Precision: ', round(sum(p) / len(p), 3))
print('-----> Recall: ', round(sum(r) / len(r), 3))
```

**Write your Answer here:__**

7.1) The base line user-user model "algo_knn_user" had a slightly lower RMSE than the base line item-item model "algo_knn_item", which were 0.9925 and 0.9964 respectively.

The prediction for userId = 4 and movieId = 10 were 3.62 and 4.42 respectively, so for this particular prediction, the user-user model predicted a rating closer to the actual rating of 4.

Also, the item-item model seems to over-predict while the user-user model under-predicts.

7.2) The baseline models compared to the tuned models have a greater RMSE. In addition, the baseline user-user model predicted a rating (3.62) which was closer to the actual rating than the tuned model which predicted a rating of 3.58. On the other hand, the item-item models behaved the opposite as the tuned model was closer to the actual rating than the baseline model.

7.3) The difference between the matrix faxtorization model and the collaborative filtering models is that the matrix factorization model focuses on the user's past behavior by converting the latent features, such as genre, of the movie the user rates, where as collaborative filtering models are developed based on related users or related items and not the user directly.

The respective RMSE values for the algorithms:

algo_knn_user: 0.9925 similarity_algo_optimized_user: 0.9871

algo_knn_item: 0.9964 similarity_algo_optimized_item: 0.9495

algo_svd: 0.9023 svd_algo_optimized: 0.8955

The precision and recall for all models is shown in Question 6. The precision is the highest for the baseline user-user model with k = 5. The precision is the lowest with the baseline item-item model with k=10. The recall is lowest with the baseline item-item model with k=5. The recall is highest with the baseline user-user model with k=10.

7.4) There was no improvement from the tuned or SVD models as the RMSE and precision continued to decrease with each new algorithm. Eventhough some of the tuned models and SVD models performed better for the specific userId=4, the baseline algorithms perform better on the entire dataset.

In this case study, we saw three different ways of building recommendation systems:

- rank-based using averages
- similarity-based collaborative filtering
- model-based (matrix factorization) collaborative filtering

We also understood advantages/disadvantages of these recommendation systems and when to use which kind of recommendation systems. Once we build these recommendation systems, we can use **A/B Testing** to measure the effectiveness of these systems.

Here is an article explaining how Amazon use **A/B Testing** to measure effectiveness of its recommendation systems.