The algorithm first initializes the counter to 0 and the majority element to null. Then it looks at each item in the stream. For each item, if the counter is 0, the item is set to the majority and the counter is set to 1. If the item is the current majority item, the counter is incremented by 1. If the item is not the current majority item, the counter is decremented by 1.
For testing, I have created a little stream whose majority item is the integer 5.
#! /usr/bin/env python
from random import choice
def find_majority(data_stream):
majority = None
counter = 0
for item in data_stream:
if item == majority:
counter += 1
else:
if counter == 0:
majority = item
counter = 1
else:
counter -= 1
print 'item:', item, 'majority:', majority, 'counter:', counter
return majority
def stream(items, maxi=2000):
i = 0
while True:
yield choice(items)
i += 1
if i > maxi:
break
if __name__ == '__main__':
l = range(10) + [5]*10
print 'stream:', l
s = stream(l, len(l))
maj = find_majority(s)
print 'majority:', maj
Sample:Since 5 is barely the majority, the counter hovers around 0 and 1, but the final result is correct. If 5 was the standout majority, the counter would increase, as so:
Note that if the counter is 0 at the end of the algorithm, the last item seen by the algorithm could have occurred up to n/2 times, but not more than n/2. Also note that the stream() function above is stochastic, i.e., it takes a list and emits random values from that list. If you do not want a stochastic stream, just pass a regular old list or array to the algorithm. It can also be easily extended to read streams from disk. Just open a file and pass the reference to the majority algorithm. In python, the code
f = open('myfile')
for item in f:
# do something
is different than
f = open('myfile')
for item in f.readlines():
# do something
The first code snippet will read from disk one line at a time, therefore memory requirements will be low, but at the cost of repeated disk accesses. This is slower, but if the data stream is so large that it does not fit into RAM, then this is a good solution. The second code snippet reads the entire file into memory first, then starts iterating, which is only good if the entire file can fit into RAM.


No comments:
Post a Comment