#! /usr/bin/env python
import re from mrjob.job import MRJob class WordCount(MRJob): def mapper(self, key, value): for word in value.split(): yield word.lower(), 1 def reducer(self, key, value): yield key, sum(value) if __name__ == '__main__': WordCount.run()
Assuming you have Jane Austin's Sense and Sensibility in a .txt file like I do, call it like so:
Each output line contains two tab-delimited elements: the word, and the count of that word.
Oh, dear, that's a lot of 'dear's. As you can see, the mapper function simply defines words as anything between two spaces, or space characters. Additional work can clean that up a bit. The following fix uses regular expressions--regex. The special character '\w' means 'a word character', which means a-z or A-Z or 0-9. The '+' means 'one or more'. Putting these together, '\w+' means one or more word characters, which comprises a word. The 'ws' variable contains all matching words in the 'word' string. So for example, if 'word' is "dear--sure!", then 'ws' is all occurrences of one or more word characters, which for this example is "dear" and "sure."
def mapper(self, key, value):
for word in value.split():
ws = re.findall(r'\w+', word)
for w in ws:
yield w.lower(), 1
The occurrences of the word 'dear' and its kin have been stripped of the non-word characters, giving better knowledge of the words themselves, but loosing some of the peripheral info contained in the punctuation; that is, 'dear' does not equal 'dear!'



No comments:
Post a Comment