Content based movie recommendation system for Netflix using natural language processing
All of us at some point have been baffled by how we end up receiving such precise recommendations from the streaming site Netflix. All the data the users share with the company and the interest we express through our content choices are used to build a personalized recommendation for the user. Companies in the industry generally use either of the two filter based recommendation systems: content-based filter recommendation or collaborative filter based recommendation. Although the names are kind of self-explanatory, I have gone ahead and explained what we mean by content-based filtering for the purpose of understanding our project.
This type of recommendation filter uses an algorithm that simply selects content similar to the users’ previous choices. Let us take an example to better understand content-based filters, suppose a user recently subscribed to Netflix and just watched a comedy movie. If Netflix used a content-based filter (the content filter being genre in our example) to provide recommendations, the user is most likely to be recommended movies of the comedy genre. For the user to get movies from other genres as his recommendations he/she would have to autonomously try out movies from other genres. This type of filter does not involve use of data from other users like collaborative filtering.
Understanding Mathematical underpinnings :
- Text vectorization methods
- Cosine distance
1. Text vectorization methods:
Bag of words :
Using Count Vectorizer from sklearn we convert a string of words (text) to vectors. The CountVectorizer forms an array with features(columns) as words from the corpus of words present in the text data and sentences as data points(rows). If the word exists in the sentence the feature value for that sentence becomes else 0. This method ends up creating a vector of 1’s and 0’s with dimensions equal to the number of words(feature) in our input text corpus. All vector representation of sentences are usually sparse arrays as they have very few words compared to the entire corpus of words.
TF-IDF :
Using TfidfVectorizer from sklearn we can convert a string of words to a vector. The TfidfVectorizer forms an array with features(columns) as words from the corpus of words present in the text data and sentences as data points(rows). If a word exists in a sentence the Tfidf value for that feature in the sentence(data point) is term frequency of the word in that sentence multiplied by inverse document frequency. Term frequency (TF) is the number of times a word(feature) occurs in a sentence(data point) divided by the total number of words present in that sentence. If the word occurs more than once in a statement it has more term frequency/weightage. This makes sense as a higher weightage/magnitude of a feature in a vector will signify the higher importance of that word in the sentence. Inverse document frequency (IDF) is the log value of the total number of documents(sentences) in the input data divided by the number of documents(sentences) in which the feature(word) occurs. As log is an increasing function, the Inverse document log will have a higher value when the number of sentences in which the word(feature) occurs will be low (denominator), thus making the resulting IDF value(result of division) high. As the word occurs in less number of sentences(data points) as compared to commonly occurring words in sentences. It makes sense as we give more weightage/importance to a word occurring only in a few sentences as compared to a word commonly occurring in sentences, as this word might play a more important role in the meaning of the sentence.
IDF :
We build an IdfVectorizer function in order to convert our sentence to a vector. IdfVectorizer forms an array with features(columns) as words from the corpus of words present in the text data and sentences as data points(rows). The Idf value is calculated in the same way as mentioned in the TfidfVectorizer model above. We only assign Idf value to the cells and ignore the TF values to avoid the bias of Termfrequency values towards sentences with less number of words present in them. The lower the number of words occurring in a statement the higher will the be term frequency value for a word because the denominator(total no of words in our sentence) will be lower(please refer to Tfidf description for formula).
Average Word2Vec :
We use Google’s word2vec library which has vector representation of 3 million words in it. This library was created using neural network models to learn word association from a large corpus of texts. Each word has 300 dimension vector representation. The words having similar semantic meaning are assigned similar vector representation by the model. We make a function which gives us the vector representation of our sentences(data points) by adding all the vector representing the words in our sentence and dividing the resulting vector by the number of words in the sentence. The resulting 300 dimensions vector formed is an average vector representation of all the word in our sentence. This method is called as an Avg Word2Vec.
2. Cosine distance:
Cosine distance is the measure of the angle between between two vectors in an N-dimensional space. N in our case is the number of features or the number of words present in our corpus of words formed from all the sentences in our data. We calculate cosine distance using the formula mentioned below :
Vectors conversion of texts in our case are multi-dimensional because of high number of words present in the corpus. For ease of understanding lets consider our vectors to be 2-dimensional for a moment, as it is easiest for us to picture. Let’s refresh the concepts of dot product, dot product between any two given vectors is the projection of one vector on the other. Therefore the dot product between two perpendicular vectors (i.e. vectors not having any common directions) is zero. Generally, we can calculate the dot product of n-dimensional vectors as shown below.
In distance based similarity recommendations, we use Cosine distance as a metric for distance measure when the magnitude of the vectors does not matter. Let us understand this with an example, suppose we are working with text data represented by word counts. We make an assumption that when a word for example ‘physics’ occurs more frequently in a given sentence 1 than it does in a given sentence 2, the magnitude of the vector tells us that sentence 1 is more similar to the topic of physics. However, it could be the case that we are working with sentences of unequal lengths. Physics probably occurred more in sentence 1 because the sentence is longer as compared to sentence 2. Cosine similarity corrects this bias. Text data is the most typical example for when to use this metric, you would want to apply cosine similarity for cases where the features weights might be larger without meaning anything different. If we would use Euclidean distance instead of Cosine distance the magnitude of the distance between the vectors of sentence 1 and sentence 2 would make it seem they are far apart. This is the case because Euclidean distance is affected by the magnitude of the vector.
To summarize if we plot in an N-dimensional space, where each dimension represents a word in the sentence, the cosine similarity captures the the angle between the sentences(vectors) and not the magnitude of distance between them. If we need the magnitude, we should compute the Euclidean distance instead.
Now that we are clear with the concepts, Lets start coding.
Program Flow:
Extract Transform Load Pipeline(ETL):
In this project we have created a extract transform load pipeline. The ETL pipeline carries out the following tasks mentioned below and gives us transformed data. The code snippets shown below are just for better understanding of the readers and may not be apart of the class created in the actual pipeline. Check out the actual ETL Pipeline.
1. Data collection:
The data used is from Kaggle, it consists of movies and tv series data having 7787 rows(data points) and12 columns(features). Though there are 12 features in our data, we would use only the pertinent features like ‘title’, ’country’, ’director’, ’cast’, ’listed_in’ and ’description’ of the movie/tv series.
We need to create a row for each movie containing the pertinent features as shown in the figure above. Later we are going to merge the features of a data point(row) to create a single string containing all the features of that movie. We perform vectorization on this string using Natural language processing and vectorization. That’s right, using these vectors we find out similarities based on the cosine distances between them! But of course, there’s a bit of cleaning we need to do first.
2. Data cleaning :
Missing data:
We will remove rows (data point) from data if the movie title or the description is missing due to the following reasons :
‘description’ : The feature describing movies will play a major role in vector formation because of the presence of plethora of words describing movie characteristics. Missing description will affect the final sparsity of our vector formed from our merged sentence after vectorization (refer to methods of vectorization). Thus it makes sense to delete rows of movies with missing descriptions.
‘title’ : It makes sense to remove data points without a title as we cannot recommend a movie if it does not exist.
We do not remove data points having no directors or actors name features. This is done because the combined string of all features will still contain enough words describing the movie even with missing director or actor names feature to form vectors significant enough(according to sparsity) to make distance based similarity recommendations.
Duplicate data:
No duplicate movie titles exist in any rows, hence all movies are unique. Hence we do not remove any rows from the data. It does not make sense to remove rows with duplicate ‘director’ or ‘cast’ feature as different movies can have similar director and actors.
3. Text preprocessing :
‘description’ :
We use natural language processing toolkit to remove stopwords from our movie ‘description’ column. Stopwords are most commonly used words present in every sentence like ‘they’ , ‘hey’, ‘the’ etc. which do not add much meaning to the sentences.
‘cast’, ‘director’ & ‘Countries’:
Merge all the first names and last names of cast and directors into one unique word and also lower the text. Let me explain why we do so with the help of an example, suppose we have movie X where the name of actor is Jennifer Aniston, and movie Y where the director is Jennifer Lawrence, the recommender will detect a similarity because of the same first names and that is something we would want to avoid. We would prefer the recommender to only consider a similarity if the person associated to different movies is exactly the same person. We would only keep the leading first three cast members of our movies in our data out of the entire cast of the movie to avoid using irrelevant features. We also lower and merge the words in our country name to create a unique word to avoid the same problem mentioned above.
Merging features:
The next step is merging our features into a string.
After merging all the columns and we save it in a new column called ‘text’, This column containing all the merged features into string is ready for vectorization. We carry out vectorization and model implementation using the modelling pipeline.
Modelling Pipeline:
The modelling pipeline carries out the various types of vectorization and implements the distance based similarity machine learning model and returns to us results of these implementations. Some of the code snippets shown below are just for better understanding of the readers and may not be a part of the class created in the actual pipeline. Check out the modelling pipeline.
1. Vectorization :
We use distance based similarities of vectors to make recommendations. In order to make recommendation, we first need to vectorize the ‘text’ strings corresponding to each movie. I have described in detail various ways of going about this vectorization in mathematical underpinnings, I’ll just mention the code to do so in this section.
Bag of words : From scikitlearn library import CountVectorizer tool.
Term Frequency * Inverse document frequency (Tfidf) : from scikitlearn library import TfidfVectorzer tool.
Inverse Document Frequency (IDF): We need to create functions which will carry out IDF vectorization as no inbuilt tool for IDF exists in scikitlearn library.
First, we create a matrix with words as our features and data points as our sentences with the help of Countvectorizer:
Second, we use the idf function created above to change values of the vectors cells in the matrix from CountVectorizer values to idf values.
The resultant matrix formed is an idf sentence to vector converted matrix.
AverageWord2Vec : From gensim library we import Google’s word2vec pretrained model with vector representation of 3 million, each word is represented using a 300 dimension vector.
First, we create a function to add all our vector representation of words and divide them with the number of words in our sentence to give a average vector representing our sentence in the 300 dimension space.
Second, we implement the Avg Word2Vec function on each sentence.
The resultant matrix represents average Word2Vec vectorization of sentences.
2. Modelling :
Once we have vector representation of the sentences of the text feature, we can find out similar sentences based on the distance between their respective vectors. The tool we use to find the distances between two sentences is pairwise distances from scikitlearn. We recommend the sentences which are closest to each other based on the Cosine distance(not euclidean distance) for reasons mentioned under mathematical underpinnings(2.Cosine distances).
Above is method we create in modelling class to to get a list of indices of similar movies based on distance based recommendations. We define visualization functions separately for better understanding of results.
1. Results :
Bag of words:
Tfidf:
Average word to vector:
Euclidean distance:
In case of average Word2Vec model each word is assigned a 300 dimension vector based on its semantic meaning. The words are preassigned these vectors and features by Google and have nothing to do with the word corpus of our sentences unlike the other vectorization methods. As the vector cell magnitude does not represent the presence of a word in a sentence, the magnitude of the vector will matter in case of average word2vec. It makes more sense to use Euclidean distance instead of Cosine distance to make distance based similarity recommendations when magnitude of the vector plays a role in measuring distance between the vectors (refer to the Cosine distance description above).
A different visualization function is used for better understanding of results of average Word2Vec vectorizer’s distance based similarity recommendations. In this visualization function we use heatmap plot from seaborn library. The heatmap plot represents Euclidean distances between vectors of each word of the selected and recommended movie strings. The labels on the Y axis represent the words from the selected movie string and labels on X axis represent recommended movie strings.
Recommended content :
The reason for so many different words showing zero Euclidean distance from each other in the heatmap is as follows:
When a word does not exist in Google’s Word2Vec library we assign a 300 dimension zero vector to these words. As we have many unique words which do not exist in our Word2Vec library, all these words will be assigned a zero vector. Hence, the Euclidean distance between these words becomes zero as they have identical vector representation.
7. Conclusion & Improvements:
Conclusion:
Average Word2Vec model:
The results of average Word2Vec vectorization based model are not very relevant. The reason behind this is our idea to combine first and last names like ‘samueljackson’ to form a unique word in our text string to get rid of similarity perception our distance based similarity model might have between people having same first names. This technique works wonderfully well for other vectorization models. Unfortunately, these words do not exist in our word2vecmodel provided by Google. To make up for this discrepancy we assign zeros to all 300 dimensions of vector representing the word ‘samueljackson’ . Many such unique words in our corpus end up becoming zero vectors and do not contribute to the average word to vector conversion of our sentences. This leads to placing of the vector representing a movie in a region of vectors with not very similar movies due to zero vector representation of certain words in the strings.
Tfidf & Bag of words:
Seeing our results for bag of words and Tfidf vectorization models, we cannot determine which vectorization model is giving us better results. To find out which model is better we would have to first live test our recommendation models. We would divide our customers into random test groups of equal sizes and live test a model on each group. Second step would be to collect various response parameters or business parameters such as purchase conversion rate of recommendations, recommendation selection rate etc. Third step would be to perform multivariate A/B testing using these parameters after collecting significant amount of data to select which is the better performing model.
Improvements :
In order to make our Word2Vec model work we need to create an alternate string of words without joining last names and first names of cast and directors. Also we should not merge our first and second word in the country names. The Google Word2Vec model may not have a vector representation for ‘samueljackson’ but has a vector representation for words ‘samuel’ and ‘jackson’. Also, we do not need to worry about other actors having the same first name ‘samuel’ but a different last name as being perceived similar by our model. As, in case of Average Word2Vec method we get the vector representation of our sentence by summation of vectors representing all the words in the given sentence, summation of vectors representing ‘Samuel’ and ‘Jackson’ will be different from summation of vectors representing ‘Samuel’ and last name ‘XYZ’. Hence with a more suitable text data we can solve the issue mentioned in ‘conclusion’ regarding zero vectors in the Googles Word2Vec vectorization.
Github link.