mathjax

Friday, May 9, 2014

Majority Algorithm

This algorithm is for finding the majority item in a list or stream of items. Here, for simplicity, all items are integers. This algorithm is known as the Majority Algorithm, and was first published by Fisher and Salzberg in 1982. The point of this algorithm is to read a stream of n items, and return the majority item if it exists. A majority item is defined as an item that occurs more than n/2 times in a stream with n items. The algorithm does this with one counter. 
The algorithm first initializes the counter to 0 and the majority element to null. Then it looks at each item in the stream. For each item, if the counter is 0, the item is set to the majority and the counter is set to 1. If the item is the current majority item, the counter is incremented by 1. If the item is not the current majority item, the counter is decremented by 1. 
For testing, I have created a little stream whose majority item is the integer 5. 
#! /usr/bin/env python

from random import choice

def find_majority(data_stream):
 majority = None
 counter = 0
 for item in data_stream:
  if item == majority:
   counter += 1
  else:
   if counter == 0:
    majority = item
    counter = 1
   else:
    counter -= 1
  print 'item:', item, 'majority:', majority, 'counter:', counter
 return majority

def stream(items, maxi=2000):
 i = 0
 while True:
  yield choice(items)
  i += 1
  if i > maxi:
   break

if __name__ == '__main__':
 l = range(10) + [5]*10
 print 'stream:', l
 s = stream(l, len(l))
 maj = find_majority(s)
 print 'majority:', maj

Sample:

Since 5 is barely the majority, the counter hovers around 0 and 1, but the final result is correct. If 5 was the standout majority, the counter would increase, as so:
Note that if the counter is 0 at the end of the algorithm, the last item seen by the algorithm could have occurred up to n/2 times, but not more than n/2. Also note that the stream() function above is stochastic, i.e., it takes a list and emits random values from that list. If you do not want a stochastic stream, just pass a regular old list or array to the algorithm. It can also be easily extended to read streams from disk. Just open a file and pass the reference to the majority algorithm. In python, the code
f = open('myfile')
for item in f:
    # do something

is different than


f = open('myfile')
for item in f.readlines():
    # do something

The first code snippet will read from disk one line at a time, therefore memory requirements will be low, but at the cost of repeated disk accesses. This is slower, but if the data stream is so large that it does not fit into RAM, then this is a good solution. The second code snippet reads the entire file into memory first, then starts iterating, which is only good if the entire file can fit into RAM. 

No comments:

Post a Comment