mathjax

Thursday, April 24, 2014

Movie and Genre Similarity using Link Analysis

Here we again measure movie genre similarity, but instead of simply counting genre co-occurrences, as in the previous post, we use link analysis. That is, we construct a bipartite graph, where the first part represents movies and the second part consists of genres, and . See the plot below for a sample from the imdb dataset. 

Each movie is connected to one or more genres. The average number of genres per movie is about 2.7. The data above represents a small portion of the total dataset. 
For this analysis below, we partition the dataset by decade, as before in the previous post. The set of movies and genres are denoted A and B, respectively Define sA(X, Y) to be the similarity between movies, and sB(x, y) to be the similarity between genres. We use the following equations to calculate the similarity between two movies X,Y ∈ A:
Similarly, we use the following equations to calculate the similarity between two genres x,y ∈ B:
Note that if X = Y, then sA(X, Y) = 1. Also, if x = y, then sB(x, y) = 1. That is, a movie X is 100% similar to itself; likewise, a genre x is perfectly similar to itself. 
C1 and C2 are decay constants in the range (0, 1). O(X) is the number of edges of movie X, and Oi(X) is the genre, x ∈ B, of the ith edge of X. The terms that use Y can be treated the same. I(x) is the number of edges of genre x, and Ii(x) is the movie, X ∈ A, of the ith edge of x. The terms that use y can be treated the same. 
We initialize all sA(X, Y) and sB(x, y) to 0 where X ≠ Y and x ≠ y; otherwise, sA(X, Y) and sB(x, y) are set to 1. 
All combinations of elements of A must be calculated, as must all combinations of elements of B. There are $ \binom{|A|}{2}} $  pairs of X,Y ∈ A, and \$ |B| \choose 2 \$ pairs of x,y ∈ B. The two equations above, for sA(X, Y) and sB(x, y), must be calculated for each pair, resulting in $ \binom{|A|}{2} + \binom{|B|}{2} $ calculations. These calculations must be repeated until the values of sA and sB converge. This may take several iterations. Here the criteria is that the total change over all pairs is less than 0.1.
After  sA and sB have been calculated
One advantage of the method above over the co-occurrence method is that we calculate movie similarity as well as genre similarity. This allows us to give movie recommendations. If we know a user likes movie X, then all we need to do is find movies similar to X and recommend them. For example, the results of the above algorithm gives that the four most similar movies to The Shawshank Redemption are:
What Becomes of the Broken Hearted?
Hidden Agenda
South Central
Some Mother's Son
This seems right. These movies definitely share some characteristics of The Shawshank Redemption. 
Another example should suffice. Here are the 4 most similar movies to The Matrix:
Dragon Inn
American Ninja V
Operation Delta Force 3: Clear Target
The Defender
A third example should further suffice things up. X = Pulp Fiction. Recommendations =
Safe House
The Rich Man's Wife
Darr
Incognito
Note: these similar movies are calculated by the decade; that is, if the query is a movie from the 1990's, the recommendations will only be movies from the 1990's. Doing all movies together takes too much time with the algorithm above.

Below is a side-by-side comparison of the results of this post and the results in Co-occurrences. Again, each column is for a decade, and the higher genres are the ones closest (most similar to) the Comedy genre.



So now we have genre-genre similarity and movie-movie similarity. In order to determine genre-movie similarity, we will employ personalized PageRank, a topic for another post.

No comments:

Post a Comment