The Binarium: Movie and Genre Similarity using Link Analysis

Here we again measure movie genre similarity, but instead of simply counting genre co-occurrences, as in the previous post, we use link analysis. That is, we construct a bipartite graph, where the first part represents movies and the second part consists of genres, and . See the plot below for a sample from the imdb dataset.

Each movie is connected to one or more genres. The average number of genres per movie is about 2.7. The data above represents a small portion of the total dataset.
For this analysis below, we partition the dataset by decade, as before in the previous post. The set of movies and genres are denoted A and B, respectively Define s_A(X, Y) to be the similarity between movies, and s_B(x, y) to be the similarity between genres. We use the following equations to calculate the similarity between two movies X,Y ∈ A:

Similarly, we use the following equations to calculate the similarity between two genres x,y ∈ B:

Note that if X = Y, then s_A(X, Y) = 1. Also, if x = y, then s_B(x, y) = 1. That is, a movie X is 100% similar to itself; likewise, a genre x is perfectly similar to itself.
C₁ and C₂ are decay constants in the range (0, 1). O(X) is the number of edges of movie X, and O_i(X) is the genre, x ∈ B, of the ith edge of X. The terms that use Y can be treated the same. I(x) is the number of edges of genre x, and I_i(x) is the movie, X ∈ A, of the ith edge of x. The terms that use y can be treated the same.
We initialize all s_A(X, Y) and s_B(x, y) to 0 where X ≠ Y and x ≠ y; otherwise, s_A(X, Y) and s_B(x, y) are set to 1.
All combinations of elements of A must be calculated, as must all combinations of elements of B. There are $ \binom{|A|}{2}} $ pairs of X,Y ∈ A, and \$ |B| \choose 2 \$ pairs of x,y ∈ B. The two equations above, for s_A(X, Y) and s_B(x, y), must be calculated for each pair, resulting in $ \binom{|A|}{2} + \binom{|B|}{2} $ calculations. These calculations must be repeated until the values of s_A and s_B converge. This may take several iterations. Here the criteria is that the total change over all pairs is less than 0.1.
After s_A and s_B have been calculated
One advantage of the method above over the co-occurrence method is that we calculate movie similarity as well as genre similarity. This allows us to give movie recommendations. If we know a user likes movie X, then all we need to do is find movies similar to X and recommend them. For example, the results of the above algorithm gives that the four most similar movies to The Shawshank Redemption are:
What Becomes of the Broken Hearted?

Hidden Agenda

South Central

Some Mother's Son

This seems right. These movies definitely share some characteristics of The Shawshank Redemption.

Another example should suffice. Here are the 4 most similar movies to The Matrix:

Dragon Inn

American Ninja V

Operation Delta Force 3: Clear Target

The Defender

A third example should further suffice things up. X = Pulp Fiction. Recommendations =

Safe House

The Rich Man's Wife

Darr

Incognito

Note: these similar movies are calculated by the decade; that is, if the query is a movie from the 1990's, the recommendations will only be movies from the 1990's. Doing all movies together takes too much time with the algorithm above.

Below is a side-by-side comparison of the results of this post and the results in Co-occurrences. Again, each column is for a decade, and the higher genres are the ones closest (most similar to) the Comedy genre.

So now we have genre-genre similarity and movie-movie similarity. In order to determine genre-movie similarity, we will employ personalized PageRank, a topic for another post.

The Binarium

mathjax

Thursday, April 24, 2014

Movie and Genre Similarity using Link Analysis

No comments:

Post a Comment

kjh