mathjax

Sunday, April 27, 2014

Counting Letter Frequencies with MapReduce

This code shows how to use the mrjob library in python to count the occurrence of letters in a document.
#! /usr/bin/env python

from mrjob.job import MRJob

class LetterCount(MRJob):
 def mapper(self, key, value):
  for word in value.split():
   for letter in list(word):
    yield letter.lower(), 1
 def reducer(self, key, value):
  yield key, sum(value)
if __name__ == '__main__':
 LetterCount.run()
The input and output files can be passed in like so:
We pass dict.txt--just a text file with about 118,000 words, one per line--as the input and we specify lettercounts.txt as the output. The '<' and '>' symbols are called 'redirect' operators for stdin and stdout, respectively. Below is some sample output. Observe that 'e' is the most frequent letter, a fact that makes code breaking slightly easier. 

No comments:

Post a Comment