The Binarium: May 2014

Thursday, May 15, 2014

Shuffling a List

Here are two algorithms that randomly shuffle a list. There are also two implementations of the swap function. One is standard, the other is bitwise. There is a demo() function that you should run on short lists to convince yourself that shuffle is indeed doing what it claims to do.

#! /usr/bin/env python

from random import randint
import time
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

def swap(lst, i, j):
 temp = lst[i]
 lst[i] = lst[j]
 lst[j] = temp

def swap_bitwise(lst, i, j):
 lst[i] = lst[i] ^ lst[j]
 lst[j] = lst[i] ^ lst[j]
 lst[i] = lst[i] ^ lst[j]

def shuffle(lst, swap_func, should_print=False):
 for i in xrange(len(lst)):
  j = randint(0, len(lst)-1)
  swap_func(lst, i, j)
  if should_print:
   print i, j, lst

# more or less due to Fisher, Yates, Durstenfeld, and Knuth
def shuffle2(lst, swap_func, should_print=False):
 for i in range(len(lst))[::-1]:
  j = randint(0, i)
  swap_func(lst, i, j)
  if should_print:
   print i, j, lst

def demo(shuffle_func):
 lst = range(10)
 print 'list before:', lst
 print 'shuffling ...'
 shuffle_func(lst, swap, True)
 print '...done'
 print 'list after:', lst

def function_times(fs, maxlen):
 res = []
 for f in fs:
  r = []
  for i in xrange(maxlen):
   data = range(i)
   start = time.time()
   shuffle(data, f)
   end = time.time()
   r.append((i, end-start))
  res.append(r)
 return res

def function_times2(fs, maxlen):
 res = []
 for f in fs:
  r = []
  for i in [1000*i for i in xrange(maxlen)]:
   data = range(i)
   start = time.time()
   shuffle(data, f)
   end = time.time()
   r.append((i, end-start))
  res.append(r)
 return res

def plot_prob():
 tests = 10000
 n_elements = 20
 m = np.array([[0.]*n_elements]*n_elements)
 for k in xrange(tests):
  initial = range(n_elements)
  shuffle2(initial, swap)
  for i in range(len(initial)):
   m[i][initial[i]] += 1
 m = m / tests
 plt.imshow(m)
 plt.colorbar()
 plt.title('Color Map')
 plt.show()
 plt.clf()
 #print m.reshape(-1,).tolist()
 plt.title('Histogram')
 plt.hist(m.reshape(-1,).tolist())
 plt.show()
 plt.clf()

def plot_times():
 functions = [swap, swap_bitwise]
 mlen = 1000
 res = function_times(functions, mlen)
 fig = plt.figure(dpi=90)
 ax = fig.add_subplot(111)
 for r in res:
  ax.plot([rr[0] for rr in r], [rr[1] for rr in r])
 ax.legend(['swap', 'bitwise swap'], 
  loc='upper center', 
  #bbox_to_anchor=(0.5, 1.05),
  ncol=2, fancybox=True, shadow=True)
 #ax.set_yscale('log')
 plt.title('Shuffle function runtimes versus list length')
 plt.xlabel('list length')
 plt.ylabel('time (seconds)')
 plt.show()

if __name__ == '__main__':
 #demo(shuffle)
 #plot_times()
 plot_prob()

Below is the printout of the demo() function.

Below is a plot of the time to shuffle, the output of the plot_times() function. As you can see, the bitwise swap is a little slower (it takes more time) than the regular swap.

Now, we change the function function_times() to shuffle much larger lists. Since it takes quite a while to iterate through every size 1, 2, ..., n, we will change the list sizes to 1000, 2000, ..., 200000. This is accomplished by changing one line in the function_times function, from

for i in xrange(maxlen):

for i in [1000*i for i in xrange(maxlen)]:

This functionality is incorporated into the function_times2() function. Here is the plot:

Notice the fluctuations near x = 150,000. This is most likely due to the fact that the list is so large that it cannot fit into cache, i.e., elements of the list are being read from RAM. If we continue to increase the list size, there will be another spike where the list is being read from disk, which is even slower. If this doesn't make sense to you, look up 'paging' and the 'principle of locality' to see how data is moved between cache, RAM, and disk.
Finally, let's compare shuffle() and shuffle2(). There are subtle differences that you can see, but let's look at the output in two ways: a colormap and a histogram. First, the colormap for shuffle():

Since we are using 20 values, the probability that i goes to j should be 1/20 = 0.05. There is, however, a non-uniform distribution of values. The histogram below confirms this: if there was no bias, the values would assume a normal (gaussian) distribution around 0.05, but the data is skewed right.
The histogram for shuffle():

Here's the colormap for shuffle2():

This looks much more uniform. Indeed, as the histogram for shuffle2() function below shows, the data is almost normally distributed. Notice the nice round shape and lack of (or less) skew.

For this reason, I recommend using shuffle2().

Permutations

Below are five functions to calculate all permutations of a list. Some are mine, some are borrowed, all are valid.

#! /usr/bin/env python

import time
import matplotlib
import matplotlib.pyplot as plt

def perm1(lst):
 if len(lst) == 0:
  return []
 elif len(lst) == 1:
  return [lst]
 else:
  l = []
  for i in range(len(lst)):
   x = lst[i]
   xs = lst[:i] + lst[i+1:]
   for p in perm1(xs):
    l.append([x] + p)
  return l

def perm2(lst):
 if len(lst) == 0:
  yield []
 elif len(lst) == 1:
  yield lst
 else:
  for i in range(len(lst)):
   x = lst[i]
   xs = lst[:i] + lst[i+1:]
   for p in perm2(xs):
    yield [x] + p

# http://stackoverflow.com/questions/2710713/algorithm-to-generate-all-possible-permutations-of-a-list
# from zhywu at stackoverflow.com
def perm3(lst):
 if len(lst) <= 2:
  yield lst
  lst = lst[::-1]
  yield lst
 else:
  l = []
  for i in range(len(lst)):
   x = lst[i]
   xs = lst[:i] + lst[i+1:]
   for p in perm3(xs):
    yield [x]+p

# from Peter Breuer at stackoverflow (ported from his Haskell implementation)
def perm4(lst):
 if len(lst) <= 0:
  yield []
 elif len(lst) == 1:
  yield lst
 else:
  for p in perm4(lst[1:]):
   for a, b in splits(p):
    yield a + [lst[0]] + b

# list of ways of splitting a list into two parts
def splits(lst):
 for i in xrange(len(lst)+1):
  yield [lst[:i], lst[i:]]

# found this on internet, not sure where
def perm5(xs, low=0):
    if low + 1 >= len(xs):
        yield xs[:]
    else:
        for p in perm5(xs, low + 1):
            yield p
        for i in range(low + 1, len(xs)):
            xs[low], xs[i] = xs[i], xs[low]
            for p in perm5(xs, low + 1):
                yield p        
            xs[low], xs[i] = xs[i], xs[low]

def function_times(fs, maxlen=5):
 res = []
 for f in fs:
  r = []
  for i in range(maxlen):
   data = range(i)
   start = time.time()
   for p in f(data):
    pass
   end = time.time()
   r.append(end-start)
  res.append(r)
 return res

def all_results_equal(fs, data):
 p1 = sorted(perm1(data))
 p2 = sorted([p for p in perm2(data)])
 p3 = sorted([p for p in perm3(data)])
 p4 = sorted([p for p in perm4(data)])
 p5 = sorted([p for p in perm5(data)])
 return p1 == p2 == p3 == p4 == p5

if __name__ == '__main__':
 functions = [perm1, perm2, perm3, perm4, perm5]
 assert all_results_equal(functions, range(4))

 res = function_times(functions, 9)
 fig = plt.figure(dpi=90)
 ax = fig.add_subplot(111)
 for r in res:
  ax.plot(r)
 ax.legend(['perm1', 'perm2', 'perm3', 'perm4', 'perm5'], 
  loc='upper center', 
  #bbox_to_anchor=(0.5, 1.05),
  ncol=2, fancybox=True, shadow=True)
 ax.set_yscale('log')
 plt.title('Permutation function runtimes versus list length')
 plt.xlabel('list length')
 plt.ylabel('time (seconds)')
 plt.show()

Below are two plots of the same data--one using linear y axis, the other using logarithmic y axis. As you can see, the fourth function (teal line) performs slightly better than the rest, for large lists, although none of the functions performs that well on large lists.

Tuesday, May 13, 2014

Combinations

A combination is an instance of one possible way to choose k things from n things. For example, all combinations of size 2 from the list ['a', 'b', 'c'] are ['a', 'b'], ['b', 'c'], and ['a', 'c']. The number of combinations is just famed n-choose-k function: $ \binom{n}{k} = \frac{n!}{k!(n-k!)} $. In the example above, n = 3 and k = 2, so the number of combinations is $ \frac{3!}{2!(3-2!)} = \frac{3!}{2! 1!} = \frac{3!}{2!} = \frac{3*2*1}{2*1} = 3 $. The following functions compute all combinations of a list. Each has advantages and disadvantages, depending on your needs. The first function, combs1(), makes copies of the list in the recursive call. This will effect performance for large lists. However, the second function, combs2(), makes no copies, which is why it requires a little more manipulation.

#! /usr/bin/env python

def combs1(k, available, used=[]):
 if len(used) == k:
  yield tuple(used)
 elif len(available) <= 0:
  pass
 else:
  for c in combs1(k, available[1:], used[:]+[available[0]]):
   yield c
  for c in combs1(k, available[1:], used[:]):
   yield c

def combs2(k, available, used=[]):
 if len(used) == k:
  yield tuple(used)
 elif len(available) <= 0:
  pass
 else:
  head = available.pop(0)
  used.append(head)
  for c in combs2(k, available, used):
   yield c
  used.pop(-1)
  for c in combs2(k, available, used):
   yield c
  available.insert(0, head)

def testfunctions(testlist, k=2):
 mycombs = []
 testfunctions = [combs1, combs2]
 for tf in testfunctions:
  mycombs.append([s for s in tf(k, testlist)])
  
 # this is for validation
 import itertools
 combs_from_lib = [i for i in itertools.combinations(testlist, k)]
 
 results = []
 for mycomb in mycombs:
  if mycomb == combs_from_lib:
   results.append('correct')
  else:
   results.append('wrong')
 return results

if __name__ == '__main__':
 print testfunctions(list('abc'))

For validation, I import the itertools.combinations and compare the results from my functions. They match!

Friday, May 9, 2014

Majority Algorithm

This algorithm is for finding the majority item in a list or stream of items. Here, for simplicity, all items are integers. This algorithm is known as the Majority Algorithm, and was first published by Fisher and Salzberg in 1982. The point of this algorithm is to read a stream of n items, and return the majority item if it exists. A majority item is defined as an item that occurs more than n/2 times in a stream with n items. The algorithm does this with one counter.
The algorithm first initializes the counter to 0 and the majority element to null. Then it looks at each item in the stream. For each item, if the counter is 0, the item is set to the majority and the counter is set to 1. If the item is the current majority item, the counter is incremented by 1. If the item is not the current majority item, the counter is decremented by 1.
For testing, I have created a little stream whose majority item is the integer 5.

#! /usr/bin/env python

from random import choice

def find_majority(data_stream):
 majority = None
 counter = 0
 for item in data_stream:
  if item == majority:
   counter += 1
  else:
   if counter == 0:
    majority = item
    counter = 1
   else:
    counter -= 1
  print 'item:', item, 'majority:', majority, 'counter:', counter
 return majority

def stream(items, maxi=2000):
 i = 0
 while True:
  yield choice(items)
  i += 1
  if i > maxi:
   break

if __name__ == '__main__':
 l = range(10) + [5]*10
 print 'stream:', l
 s = stream(l, len(l))
 maj = find_majority(s)
 print 'majority:', maj

Sample:

Since 5 is barely the majority, the counter hovers around 0 and 1, but the final result is correct. If 5 was the standout majority, the counter would increase, as so:

Note that if the counter is 0 at the end of the algorithm, the last item seen by the algorithm could have occurred up to n/2 times, but not more than n/2. Also note that the stream() function above is stochastic, i.e., it takes a list and emits random values from that list. If you do not want a stochastic stream, just pass a regular old list or array to the algorithm. It can also be easily extended to read streams from disk. Just open a file and pass the reference to the majority algorithm. In python, the code

f = open('myfile')
for item in f:
    # do something

is different than

f = open('myfile')
for item in f.readlines():
    # do something

The first code snippet will read from disk one line at a time, therefore memory requirements will be low, but at the cost of repeated disk accesses. This is slower, but if the data stream is so large that it does not fit into RAM, then this is a good solution. The second code snippet reads the entire file into memory first, then starts iterating, which is only good if the entire file can fit into RAM.

Anagrams using MapReduce and mrjob

This code discovers anagrams. The input is a text file dictionary with one word per line. The output is a text file.

#! /usr/bin/env python

from mrjob.job import MRJob

class Anagram(MRJob):
 def mapper(self, _, word):
  letters = list(word)
  letters.sort()
  if word != '':
   yield letters, word

 def reducer(self, _, words):
  anagrams = [w for w in words]
  if len(anagrams) >= 2:
   yield len(anagrams), anagrams

if __name__ == '__main__':
 Anagram.run()

Call it like this:

Sample output:

The output file consists of two columns: the first column is the number of word in the set, and the second column is the set itself.

The Binarium

mathjax