I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space).
My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks
import sys
import re
for line in sys.stdin:
line = re.sub("[^A-Za-z]", "", line.strip())
line = line.lower()
words = ' '.join(line.split())
for word in words:
print '%s\t%s' % (word, 1)
解决方案
You've actually got two problems.
First, this:
line = re.sub("[^A-Za-z]", "", line.strip())
This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.
Next, even if you didn't do that, you do this:
words = ' '.join(line.split())
This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)
So, in the next line, when you do this:
for word in words:
You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.
If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:
words = line.split()
for word in words:
Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:
line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
for word in words:
However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:
line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:
… or just:
words = re.split(r"[^A-Za-z]", line.strip())
for word in words: