The Binarium: 2014

Thursday, May 15, 2014

Shuffling a List

Here are two algorithms that randomly shuffle a list. There are also two implementations of the swap function. One is standard, the other is bitwise. There is a demo() function that you should run on short lists to convince yourself that shuffle is indeed doing what it claims to do.

#! /usr/bin/env python

from random import randint
import time
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

def swap(lst, i, j):
 temp = lst[i]
 lst[i] = lst[j]
 lst[j] = temp

def swap_bitwise(lst, i, j):
 lst[i] = lst[i] ^ lst[j]
 lst[j] = lst[i] ^ lst[j]
 lst[i] = lst[i] ^ lst[j]

def shuffle(lst, swap_func, should_print=False):
 for i in xrange(len(lst)):
  j = randint(0, len(lst)-1)
  swap_func(lst, i, j)
  if should_print:
   print i, j, lst

# more or less due to Fisher, Yates, Durstenfeld, and Knuth
def shuffle2(lst, swap_func, should_print=False):
 for i in range(len(lst))[::-1]:
  j = randint(0, i)
  swap_func(lst, i, j)
  if should_print:
   print i, j, lst

def demo(shuffle_func):
 lst = range(10)
 print 'list before:', lst
 print 'shuffling ...'
 shuffle_func(lst, swap, True)
 print '...done'
 print 'list after:', lst

def function_times(fs, maxlen):
 res = []
 for f in fs:
  r = []
  for i in xrange(maxlen):
   data = range(i)
   start = time.time()
   shuffle(data, f)
   end = time.time()
   r.append((i, end-start))
  res.append(r)
 return res

def function_times2(fs, maxlen):
 res = []
 for f in fs:
  r = []
  for i in [1000*i for i in xrange(maxlen)]:
   data = range(i)
   start = time.time()
   shuffle(data, f)
   end = time.time()
   r.append((i, end-start))
  res.append(r)
 return res

def plot_prob():
 tests = 10000
 n_elements = 20
 m = np.array([[0.]*n_elements]*n_elements)
 for k in xrange(tests):
  initial = range(n_elements)
  shuffle2(initial, swap)
  for i in range(len(initial)):
   m[i][initial[i]] += 1
 m = m / tests
 plt.imshow(m)
 plt.colorbar()
 plt.title('Color Map')
 plt.show()
 plt.clf()
 #print m.reshape(-1,).tolist()
 plt.title('Histogram')
 plt.hist(m.reshape(-1,).tolist())
 plt.show()
 plt.clf()

def plot_times():
 functions = [swap, swap_bitwise]
 mlen = 1000
 res = function_times(functions, mlen)
 fig = plt.figure(dpi=90)
 ax = fig.add_subplot(111)
 for r in res:
  ax.plot([rr[0] for rr in r], [rr[1] for rr in r])
 ax.legend(['swap', 'bitwise swap'], 
  loc='upper center', 
  #bbox_to_anchor=(0.5, 1.05),
  ncol=2, fancybox=True, shadow=True)
 #ax.set_yscale('log')
 plt.title('Shuffle function runtimes versus list length')
 plt.xlabel('list length')
 plt.ylabel('time (seconds)')
 plt.show()

if __name__ == '__main__':
 #demo(shuffle)
 #plot_times()
 plot_prob()

Below is the printout of the demo() function.

Below is a plot of the time to shuffle, the output of the plot_times() function. As you can see, the bitwise swap is a little slower (it takes more time) than the regular swap.

Now, we change the function function_times() to shuffle much larger lists. Since it takes quite a while to iterate through every size 1, 2, ..., n, we will change the list sizes to 1000, 2000, ..., 200000. This is accomplished by changing one line in the function_times function, from

for i in xrange(maxlen):

for i in [1000*i for i in xrange(maxlen)]:

This functionality is incorporated into the function_times2() function. Here is the plot:

Notice the fluctuations near x = 150,000. This is most likely due to the fact that the list is so large that it cannot fit into cache, i.e., elements of the list are being read from RAM. If we continue to increase the list size, there will be another spike where the list is being read from disk, which is even slower. If this doesn't make sense to you, look up 'paging' and the 'principle of locality' to see how data is moved between cache, RAM, and disk.
Finally, let's compare shuffle() and shuffle2(). There are subtle differences that you can see, but let's look at the output in two ways: a colormap and a histogram. First, the colormap for shuffle():

Since we are using 20 values, the probability that i goes to j should be 1/20 = 0.05. There is, however, a non-uniform distribution of values. The histogram below confirms this: if there was no bias, the values would assume a normal (gaussian) distribution around 0.05, but the data is skewed right.
The histogram for shuffle():

Here's the colormap for shuffle2():

This looks much more uniform. Indeed, as the histogram for shuffle2() function below shows, the data is almost normally distributed. Notice the nice round shape and lack of (or less) skew.

For this reason, I recommend using shuffle2().

Permutations

Below are five functions to calculate all permutations of a list. Some are mine, some are borrowed, all are valid.

#! /usr/bin/env python

import time
import matplotlib
import matplotlib.pyplot as plt

def perm1(lst):
 if len(lst) == 0:
  return []
 elif len(lst) == 1:
  return [lst]
 else:
  l = []
  for i in range(len(lst)):
   x = lst[i]
   xs = lst[:i] + lst[i+1:]
   for p in perm1(xs):
    l.append([x] + p)
  return l

def perm2(lst):
 if len(lst) == 0:
  yield []
 elif len(lst) == 1:
  yield lst
 else:
  for i in range(len(lst)):
   x = lst[i]
   xs = lst[:i] + lst[i+1:]
   for p in perm2(xs):
    yield [x] + p

# http://stackoverflow.com/questions/2710713/algorithm-to-generate-all-possible-permutations-of-a-list
# from zhywu at stackoverflow.com
def perm3(lst):
 if len(lst) <= 2:
  yield lst
  lst = lst[::-1]
  yield lst
 else:
  l = []
  for i in range(len(lst)):
   x = lst[i]
   xs = lst[:i] + lst[i+1:]
   for p in perm3(xs):
    yield [x]+p

# from Peter Breuer at stackoverflow (ported from his Haskell implementation)
def perm4(lst):
 if len(lst) <= 0:
  yield []
 elif len(lst) == 1:
  yield lst
 else:
  for p in perm4(lst[1:]):
   for a, b in splits(p):
    yield a + [lst[0]] + b

# list of ways of splitting a list into two parts
def splits(lst):
 for i in xrange(len(lst)+1):
  yield [lst[:i], lst[i:]]

# found this on internet, not sure where
def perm5(xs, low=0):
    if low + 1 >= len(xs):
        yield xs[:]
    else:
        for p in perm5(xs, low + 1):
            yield p
        for i in range(low + 1, len(xs)):
            xs[low], xs[i] = xs[i], xs[low]
            for p in perm5(xs, low + 1):
                yield p        
            xs[low], xs[i] = xs[i], xs[low]

def function_times(fs, maxlen=5):
 res = []
 for f in fs:
  r = []
  for i in range(maxlen):
   data = range(i)
   start = time.time()
   for p in f(data):
    pass
   end = time.time()
   r.append(end-start)
  res.append(r)
 return res

def all_results_equal(fs, data):
 p1 = sorted(perm1(data))
 p2 = sorted([p for p in perm2(data)])
 p3 = sorted([p for p in perm3(data)])
 p4 = sorted([p for p in perm4(data)])
 p5 = sorted([p for p in perm5(data)])
 return p1 == p2 == p3 == p4 == p5

if __name__ == '__main__':
 functions = [perm1, perm2, perm3, perm4, perm5]
 assert all_results_equal(functions, range(4))

 res = function_times(functions, 9)
 fig = plt.figure(dpi=90)
 ax = fig.add_subplot(111)
 for r in res:
  ax.plot(r)
 ax.legend(['perm1', 'perm2', 'perm3', 'perm4', 'perm5'], 
  loc='upper center', 
  #bbox_to_anchor=(0.5, 1.05),
  ncol=2, fancybox=True, shadow=True)
 ax.set_yscale('log')
 plt.title('Permutation function runtimes versus list length')
 plt.xlabel('list length')
 plt.ylabel('time (seconds)')
 plt.show()

Below are two plots of the same data--one using linear y axis, the other using logarithmic y axis. As you can see, the fourth function (teal line) performs slightly better than the rest, for large lists, although none of the functions performs that well on large lists.

Tuesday, May 13, 2014

Combinations

A combination is an instance of one possible way to choose k things from n things. For example, all combinations of size 2 from the list ['a', 'b', 'c'] are ['a', 'b'], ['b', 'c'], and ['a', 'c']. The number of combinations is just famed n-choose-k function: $ \binom{n}{k} = \frac{n!}{k!(n-k!)} $. In the example above, n = 3 and k = 2, so the number of combinations is $ \frac{3!}{2!(3-2!)} = \frac{3!}{2! 1!} = \frac{3!}{2!} = \frac{3*2*1}{2*1} = 3 $. The following functions compute all combinations of a list. Each has advantages and disadvantages, depending on your needs. The first function, combs1(), makes copies of the list in the recursive call. This will effect performance for large lists. However, the second function, combs2(), makes no copies, which is why it requires a little more manipulation.

#! /usr/bin/env python

def combs1(k, available, used=[]):
 if len(used) == k:
  yield tuple(used)
 elif len(available) <= 0:
  pass
 else:
  for c in combs1(k, available[1:], used[:]+[available[0]]):
   yield c
  for c in combs1(k, available[1:], used[:]):
   yield c

def combs2(k, available, used=[]):
 if len(used) == k:
  yield tuple(used)
 elif len(available) <= 0:
  pass
 else:
  head = available.pop(0)
  used.append(head)
  for c in combs2(k, available, used):
   yield c
  used.pop(-1)
  for c in combs2(k, available, used):
   yield c
  available.insert(0, head)

def testfunctions(testlist, k=2):
 mycombs = []
 testfunctions = [combs1, combs2]
 for tf in testfunctions:
  mycombs.append([s for s in tf(k, testlist)])
  
 # this is for validation
 import itertools
 combs_from_lib = [i for i in itertools.combinations(testlist, k)]
 
 results = []
 for mycomb in mycombs:
  if mycomb == combs_from_lib:
   results.append('correct')
  else:
   results.append('wrong')
 return results

if __name__ == '__main__':
 print testfunctions(list('abc'))

For validation, I import the itertools.combinations and compare the results from my functions. They match!

Friday, May 9, 2014

Majority Algorithm

This algorithm is for finding the majority item in a list or stream of items. Here, for simplicity, all items are integers. This algorithm is known as the Majority Algorithm, and was first published by Fisher and Salzberg in 1982. The point of this algorithm is to read a stream of n items, and return the majority item if it exists. A majority item is defined as an item that occurs more than n/2 times in a stream with n items. The algorithm does this with one counter.
The algorithm first initializes the counter to 0 and the majority element to null. Then it looks at each item in the stream. For each item, if the counter is 0, the item is set to the majority and the counter is set to 1. If the item is the current majority item, the counter is incremented by 1. If the item is not the current majority item, the counter is decremented by 1.
For testing, I have created a little stream whose majority item is the integer 5.

#! /usr/bin/env python

from random import choice

def find_majority(data_stream):
 majority = None
 counter = 0
 for item in data_stream:
  if item == majority:
   counter += 1
  else:
   if counter == 0:
    majority = item
    counter = 1
   else:
    counter -= 1
  print 'item:', item, 'majority:', majority, 'counter:', counter
 return majority

def stream(items, maxi=2000):
 i = 0
 while True:
  yield choice(items)
  i += 1
  if i > maxi:
   break

if __name__ == '__main__':
 l = range(10) + [5]*10
 print 'stream:', l
 s = stream(l, len(l))
 maj = find_majority(s)
 print 'majority:', maj

Sample:

Since 5 is barely the majority, the counter hovers around 0 and 1, but the final result is correct. If 5 was the standout majority, the counter would increase, as so:

Note that if the counter is 0 at the end of the algorithm, the last item seen by the algorithm could have occurred up to n/2 times, but not more than n/2. Also note that the stream() function above is stochastic, i.e., it takes a list and emits random values from that list. If you do not want a stochastic stream, just pass a regular old list or array to the algorithm. It can also be easily extended to read streams from disk. Just open a file and pass the reference to the majority algorithm. In python, the code

f = open('myfile')
for item in f:
    # do something

is different than

f = open('myfile')
for item in f.readlines():
    # do something

The first code snippet will read from disk one line at a time, therefore memory requirements will be low, but at the cost of repeated disk accesses. This is slower, but if the data stream is so large that it does not fit into RAM, then this is a good solution. The second code snippet reads the entire file into memory first, then starts iterating, which is only good if the entire file can fit into RAM.

Anagrams using MapReduce and mrjob

This code discovers anagrams. The input is a text file dictionary with one word per line. The output is a text file.

#! /usr/bin/env python

from mrjob.job import MRJob

class Anagram(MRJob):
 def mapper(self, _, word):
  letters = list(word)
  letters.sort()
  if word != '':
   yield letters, word

 def reducer(self, _, words):
  anagrams = [w for w in words]
  if len(anagrams) >= 2:
   yield len(anagrams), anagrams

if __name__ == '__main__':
 Anagram.run()

Call it like this:

Sample output:

The output file consists of two columns: the first column is the number of word in the set, and the second column is the set itself.

Wednesday, April 30, 2014

Perceptron

The perceptron algorithm attempts to find a line of separation between two classes of data--hence it is a classification algorithm.
Below is the python code for the perceptron algorithm. We first generate a dataset with two dimensional input and one-dimensional output. The input is normally distributed along both axes. The output, y, is either +1 or -1. The training set is a list of tuples of the form ([x1, x2], y) where the first element of the tuple is a list (actually a numpy array in the code, but for simplicity, I write it as a list), and the second element is the output.
Then we feed the training set to the learning algorithm. The perceptron function initializes the weight vector, w, randomly. It then iterates over the training set data. At each iteration, we multiply w by x, resulting in the dot product over the two vectors. If this dot product is positive, then the point x is on the correct side of the dividing line. If the dot product is negative, then the point x is on the wrong side of the dividing line, and therefore the classifier is wrong. To correct this wrong, we nudge the line slightly.

#! /usr/bin/env python

import math
import numpy as np
from numpy.random import normal
from pylab import plot, cla, show, norm, rand, xlabel, ylabel, title, legend, get_cmap, savefig, xlim, ylim
from random import shuffle, random
import matplotlib.pyplot as plt

def generateDataNormal(n):
 data = []
 mu, sigma = 0.15, 0.15 # mean and standard deviation
 x1 = np.random.normal(mu, sigma, n)
 y1 = np.random.normal(mu, sigma, n)
 data.extend([(np.array([x1[i], y1[i]]), 1) for i in range(n)])
 mu, sigma = -0.15, 0.15 # mean and standard deviation
 x2 = np.random.normal(mu, sigma, n)
 y2 = np.random.normal(mu, sigma, n)
 data.extend([(np.array([x2[i], y2[i]]), -1) for i in range(n)])
 return data

def test_error(w, test_data):
 correct = 0
 for x_i, y_i in test_data:
  y_i_prime = np.dot(w, x_i)
  if y_i * y_i_prime >= 0:
   correct += 1
 e = (len(test_data)-float(correct))/len(test_data)
 return e

def perceptron_weights(training_data, lrate=0.2):
 w = []
 xlen = len(training_data[0][0])
 for i in xrange(xlen):
  w.append(random())
 w = np.array(w)
 for x_i, y_i in training_data:
  y_i_prime = np.dot(w, x_i)
  if y_i * y_i_prime <= 0.0:
   w = w + lrate*y_i*x_i
   yield w.copy()

def perceptron_weight(training_data, lrate=0.2):
 w = []
 xlen = len(training_data[0][0])
 for i in xrange(xlen):
  w.append(random())
 w = np.array(w)
 for x_i, y_i in training_data:
  y_i_prime = np.dot(w, x_i)
  if y_i * y_i_prime <= 0.0:
   w = w + lrate*y_i*x_i
 return w

def plot_data(training_data, ws, lrate):
 for x in training_data:
  if x[-1] == 1:
   plot(x[0][0], x[0][1], 'ob')
  else:
   plot(x[0][0], x[0][1], 'or')
 colors = np.linspace(0.1, 1, len(ws))
 mymap = get_cmap("Greys")
 mycolors = mymap(colors)
 i = 0
 for w in ws:
  n = norm(w)
  ww = w/n
  ww1 = [ww[1], -ww[0]]
  ww2 = [-ww[1], ww[0]]
  plot([ww1[0], ww2[0]],[ww1[1], ww2[1]],color=mycolors[i])
  i += 1
 xlim(-1, 1)
 ylim(-1, 1)
 savefig('perceptron_'+str(lrate)+'.png')
 cla()
 #show()

if __name__ == '__main__':
 training_data = generateDataNormal(100)
 shuffle(training_data)
 for lrate in [0.1, 0.2, 0.4, 0.8]:
  weights = [w for w in perceptron_weights(training_data, lrate)]
  plot_data(training_data, weights, lrate)

Below are three plots of how w changes on each iteration (where it is wrong and w must be changed) for different learning rates, represented by the Greek letter lambda. A note about the color mapping for the lines (which are actually the lines orthogonal to w): there is a line for each time w is updated, and the first line is the lightest, and the last line is the darkest. As you can see, small learning rates like 0.1 result in small changes in w and larger learning rates like 1.0 result in larger change. Caution should be taken when choosing lambda to somewhat guarantee convergence and avoid thrashing or cycling of w.
The blue circles are one class and the red circles are the other.

Tuesday, April 29, 2014

Counting Words using MapReduce

This class counts words in a document.

#! /usr/bin/env python

import re
from mrjob.job import MRJob

class WordCount(MRJob):
 def mapper(self, key, value):
  for word in value.split():
   yield word.lower(), 1
 def reducer(self, key, value):
  yield key, sum(value)

if __name__ == '__main__':
 WordCount.run()

Assuming you have Jane Austin's Sense and Sensibility in a .txt file like I do, call it like so:

Each output line contains two tab-delimited elements: the word, and the count of that word.

Oh, dear, that's a lot of 'dear's. As you can see, the mapper function simply defines words as anything between two spaces, or space characters. Additional work can clean that up a bit. The following fix uses regular expressions--regex. The special character '\w' means 'a word character', which means a-z or A-Z or 0-9. The '+' means 'one or more'. Putting these together, '\w+' means one or more word characters, which comprises a word. The 'ws' variable contains all matching words in the 'word' string. So for example, if 'word' is "dear--sure!", then 'ws' is all occurrences of one or more word characters, which for this example is "dear" and "sure."

def mapper(self, key, value):
  for word in value.split():
   ws = re.findall(r'\w+', word)
   for w in ws:
    yield w.lower(), 1

The occurrences of the word 'dear' and its kin have been stripped of the non-word characters, giving better knowledge of the words themselves, but loosing some of the peripheral info contained in the punctuation; that is, 'dear' does not equal 'dear!'

Monday, April 28, 2014

Coin Changing

This code finds all ways of making change for a given amount using the specified denominations. For example, if we had use of an infinite supply of pennies, nickels, dimes, and quarters, then the best way to make change for 33 cents is 1 quarter, 1 nickel, and 3 pennies.

def change(n, coins_available, coins_so_far):
 if sum(coins_so_far) == n:
  yield coins_so_far
 elif sum(coins_so_far) > n:
  pass
 elif coins_available == []:
  pass
 else:
  for c in change(n, coins_available[:], coins_so_far+[coins_available[0]]):
   yield c
  for c in change(n, coins_available[1:], coins_so_far):
   yield c

if __name__ == '__main__':
 n = 33
 coins = [1, 5, 10, 25]

 solutions = [s for s in change(n, coins, [])]
 for s in solutions:
  print s

 print 'optimal solution:', min(solutions, key=len)

Below is the output, showing all ways of making change for 33 cents using only coins worth 1, 5, 10, and 25 cents.

Sunday, April 27, 2014

Counting Letter Frequencies with MapReduce

This code shows how to use the mrjob library in python to count the occurrence of letters in a document.

#! /usr/bin/env python

from mrjob.job import MRJob

class LetterCount(MRJob):
 def mapper(self, key, value):
  for word in value.split():
   for letter in list(word):
    yield letter.lower(), 1
 def reducer(self, key, value):
  yield key, sum(value)

if __name__ == '__main__':
 LetterCount.run()

The input and output files can be passed in like so:

We pass dict.txt--just a text file with about 118,000 words, one per line--as the input and we specify lettercounts.txt as the output. The '<' and '>' symbols are called 'redirect' operators for stdin and stdout, respectively. Below is some sample output. Observe that 'e' is the most frequent letter, a fact that makes code breaking slightly easier.

Saturday, April 26, 2014

merging k sorted lists

Function to merge $k$ sorted lists. This algorithm uses a queue data structure. The keys into the queue are the first element (the head) of each sorted list, and the values are the remaining elements of the list (the tail). Using a (min) queue guarantees that the list with the smallest first element will always be used first. The first for loop runs in $O(k*log(k))$ time, since the heads of $k$ lists must be heapified; the while loop runs in $O(k*n*log(k))$ time, where there are $k$ lists, and assuming all $k$ lists have $n$ elements.

import heapq

def merge_k_sorted_lists(lists):
 h = []
 for l in lists:
  head = l[0]
  l.pop(0)
  rest = l
  h.append((head, rest))
 heapq.heapify(h)
 r = []
 while h != []:
  head, rest = heapq.heappop(h)
  r.append(head)
  if rest != []:
   head = rest.pop(0)
   heapq.heappush(h, (head, rest))
 return r

lists = [range(5), range(-5,1), range(5,9)]
print lists
listsmerged = merge_k_sorted_lists(lists)
print listsmerged

Friday, April 25, 2014

merging two sorted lists

Function to merge two sorted lists in linear time. This function is half of the mergesort algorithm.

#! /usr/bin/env python

def merge_sorted_lists(list1, list2):
 r = []
 i, j = 0, 0
 while i < len(list1) and j < len(list2):
  if list1[i] < list2[j]:
   r.append(list1[i])
   i += 1
  else:
   r.append(list2[j])
   j += 1
 if i == len(list1):
  r.extend(list2[j:])
 elif j == len(list2):
  r.extend(list1[i:])
 else:
  print 'something\'s wrong'
 return r

print merge_sorted_lists([1,3,5], [2,4,6])

Sample output:

converting between decimal and hexadecimal

Two functions for converting between decimal and hexadecimal.

def decimal_to_hexadecimal(decimal_num):
 if decimal_num == 0:
  return '0'
 hexvals = map(str, range(10)) + ['A', 'B', 'C', 'D', 'E', 'F']
 h = ''
 while decimal_num > 0:
  mod = decimal_num % 16
  h = hexvals[mod] + h
  decimal_num /= 16
 return h

def hexadecimal_to_decimal(hex_num):
 hexvals = map(str, range(10)) + ['A', 'B', 'C', 'D', 'E', 'F']
 i, d = len(hex_num)-1, 0
 while i >= 0:
  #print i
  d = d + hexvals.index(hex_num[i]) * 16**(len(hex_num)-1-i)
  i -= 1
 return d

def test_hex():
 originals = range(20)
 hexes = [decimal_to_hexadecimal(d) for d in originals]
 decimals = [hexadecimal_to_decimal(b) for b in hexes]
 assert originals == decimals
 print 'originals:', originals
 print 'hexes:', hexes
 print 'decimals:', decimals

if __name__ == '__main__':
 test_hex()

Below is some sample output.

Thursday, April 24, 2014

converting between decimal and octal

Two functions to perform conversion between the decimal and octal representation of integers.

#! /usr/bin/env python

def decimal_to_octal(decimal_num):
 i, o = 1, 0
 while decimal_num > 0:
  mod = decimal_num % 8
  o += (i * mod)
  decimal_num /= 8
  i *= 10
 return o

def octal_to_decimal(octal_num):
 i, d = 0, 0
 while octal_num > 0:
  mod = octal_num % 10
  d += (mod * 8**i)
  octal_num /= 10
  i += 1
 return d

if __name__ == '__main__':
 originals = range(10)
 octals = [decimal_to_octal(d) for d in originals]
 decimals = [octal_to_decimal(b) for b in octals]
 assert originals == decimals
 print 'originals:', originals
 print 'octals:', octals
 print 'decimals:', decimals

Sample output below.

converting between decimal and binary

Two functions for number conversion.

#! /usr/bin/env python

def decimal_to_binary(decimal_num):
 i, b = 1, 0
 while decimal_num > 0:
  mod = decimal_num % 2
  b += (i * mod)
  decimal_num /= 2
  i *= 10
 return b

def binary_to_decimal(binary_num):
 i, d = 0, 0
 while binary_num > 0:
  mod = binary_num % 10
  d += (mod * 2**i)
  binary_num /= 10
  i += 1
 return d

if __name__ == '__main__':
 originals = range(10)
 binaries = [decimal_to_binary(d) for d in originals]
 decimals = [binary_to_decimal(b) for b in binaries]
 assert originals == decimals
 print 'originals:', originals
 print 'binaries:', binaries
 print 'decimals:', decimals

Sample output below.

Movie and Genre Similarity using Link Analysis

Here we again measure movie genre similarity, but instead of simply counting genre co-occurrences, as in the previous post, we use link analysis. That is, we construct a bipartite graph, where the first part represents movies and the second part consists of genres, and . See the plot below for a sample from the imdb dataset.

Each movie is connected to one or more genres. The average number of genres per movie is about 2.7. The data above represents a small portion of the total dataset.
For this analysis below, we partition the dataset by decade, as before in the previous post. The set of movies and genres are denoted A and B, respectively Define s_A(X, Y) to be the similarity between movies, and s_B(x, y) to be the similarity between genres. We use the following equations to calculate the similarity between two movies X,Y ∈ A:

Similarly, we use the following equations to calculate the similarity between two genres x,y ∈ B:

Note that if X = Y, then s_A(X, Y) = 1. Also, if x = y, then s_B(x, y) = 1. That is, a movie X is 100% similar to itself; likewise, a genre x is perfectly similar to itself.
C₁ and C₂ are decay constants in the range (0, 1). O(X) is the number of edges of movie X, and O_i(X) is the genre, x ∈ B, of the ith edge of X. The terms that use Y can be treated the same. I(x) is the number of edges of genre x, and I_i(x) is the movie, X ∈ A, of the ith edge of x. The terms that use y can be treated the same.
We initialize all s_A(X, Y) and s_B(x, y) to 0 where X ≠ Y and x ≠ y; otherwise, s_A(X, Y) and s_B(x, y) are set to 1.
All combinations of elements of A must be calculated, as must all combinations of elements of B. There are $ \binom{|A|}{2}} $ pairs of X,Y ∈ A, and \$ |B| \choose 2 \$ pairs of x,y ∈ B. The two equations above, for s_A(X, Y) and s_B(x, y), must be calculated for each pair, resulting in $ \binom{|A|}{2} + \binom{|B|}{2} $ calculations. These calculations must be repeated until the values of s_A and s_B converge. This may take several iterations. Here the criteria is that the total change over all pairs is less than 0.1.
After s_A and s_B have been calculated
One advantage of the method above over the co-occurrence method is that we calculate movie similarity as well as genre similarity. This allows us to give movie recommendations. If we know a user likes movie X, then all we need to do is find movies similar to X and recommend them. For example, the results of the above algorithm gives that the four most similar movies to The Shawshank Redemption are:
What Becomes of the Broken Hearted?

Hidden Agenda

South Central

Some Mother's Son

This seems right. These movies definitely share some characteristics of The Shawshank Redemption.

Another example should suffice. Here are the 4 most similar movies to The Matrix:

Dragon Inn

American Ninja V

Operation Delta Force 3: Clear Target

The Defender

A third example should further suffice things up. X = Pulp Fiction. Recommendations =

Safe House

The Rich Man's Wife

Darr

Incognito

Note: these similar movies are calculated by the decade; that is, if the query is a movie from the 1990's, the recommendations will only be movies from the 1990's. Doing all movies together takes too much time with the algorithm above.

Below is a side-by-side comparison of the results of this post and the results in Co-occurrences. Again, each column is for a decade, and the higher genres are the ones closest (most similar to) the Comedy genre.

So now we have genre-genre similarity and movie-movie similarity. In order to determine genre-movie similarity, we will employ personalized PageRank, a topic for another post.

Sunday, April 20, 2014

Movie Genre Similarity using Genre Co-occurrences

This page shows a few ways of visualizing movie genre similarity. The similarity scores were obtained by iterating through an IMDB dataset of 10,000 movies from the 1950's to the 2010's, and counting the number of times each genre appeared with the other genres. For example, if a movie's genres were Action, Adventure, and Comedy, the co-occurrence counts would go up for the following pairs: (Action, Adventure), (Action, Comedy), and (Adventure, Comedy). The resulting co-occurrence count matrix is shown below.
The color map is as follows: 1.0 is perfect similarity, and 0 is total non-similarity.

The next three visualizations focus on individual genres. For the Comedy genre plot below, the more similar a genre, the higher the label. The plot below is not to scale. The genres are simply ordered most-to-least similar, top-to-bottom. The data is also time sliced by decade, as indicated at the bottom of the plot. Observe the changes in rank over time.

This plot is to scale, although that scale is not shown.

The plot below shows essentially the same thing as above, but using line plots, showing scale, and with lower similarity genres filtered out.

There are 24 genres in total, and not enough room for them here. Feel free to request a particular genre, and I'll try to post it.

For those interested in more detailed view, below is a plot using 1-year time slices (as opposed to the 10-year slices used above), for the Sci-Fi genre.

The plot below, for the Mystery genre, is for 2-year time slices. Notice the rise in similarity to the Thriller genre, and the jagged wave shape of the Horror genre. Apparently, intense feelings of fear and shock, combined with puzzlement, are popular approximately once per generation.

This analysis was made with python, specifically numpy, pandas, matplotlib.

The Binarium

mathjax