mathjax

Tuesday, April 29, 2014

Counting Words using MapReduce

This class counts words in a document.
#! /usr/bin/env python
import re
from mrjob.job import MRJob

class WordCount(MRJob):
 def mapper(self, key, value):
  for word in value.split():
   yield word.lower(), 1
 def reducer(self, key, value):
  yield key, sum(value)

if __name__ == '__main__':
 WordCount.run()

Assuming you have Jane Austin's Sense and Sensibility in a .txt file like I do, call it like so: 

Each output line contains two tab-delimited elements: the word, and the count of that word. 

Oh, dear, that's a lot of 'dear's. As you can see, the mapper function simply defines words as anything between two spaces, or space characters. Additional work can clean that up a bit. The following fix uses regular expressions--regex. The special character '\w' means 'a word character', which means a-z or A-Z or 0-9. The '+' means 'one or more'. Putting these together, '\w+' means one or more word characters, which comprises a word. The 'ws' variable contains all matching words in the 'word' string. So for example, if 'word' is "dear--sure!", then 'ws' is all occurrences of one or more word characters, which for this example is "dear" and "sure."
def mapper(self, key, value):
  for word in value.split():
   ws = re.findall(r'\w+', word)
   for w in ws:
    yield w.lower(), 1



The occurrences of the word 'dear' and its kin have been stripped of the non-word characters, giving better knowledge of the words themselves, but loosing some of the peripheral info contained in the punctuation; that is, 'dear' does not equal 'dear!'

No comments:

Post a Comment