python提取英文单词怎么写,从python中的字符串中提取英语单词

最新推荐文章于 2023-04-02 08:25:30 发布

offer大虾

最新推荐文章于 2023-04-02 08:25:30 发布

阅读量737

点赞数 2

文章标签： python提取英文单词怎么写

I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space).

My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks

import sys

import re

for line in sys.stdin:

line = re.sub("[^A-Za-z]", "", line.strip())

line = line.lower()

words = ' '.join(line.split())

for word in words:

print '%s\t%s' % (word, 1)

解决方案

You've actually got two problems.

First, this:

line = re.sub("[^A-Za-z]", "", line.strip())

This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.

Next, even if you didn't do that, you do this:

words = ' '.join(line.split())

This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)

So, in the next line, when you do this:

for word in words:

You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.

If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:

words = line.split()

for word in words:

Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:

line = re.sub(r"[^A-Za-z\s]", "", line.strip())

words = line.split()

for word in words:

However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:

line = re.sub(r"[^A-Za-z]", " ", line.strip())

words = line.split()

for word in words:

… or just:

words = re.split(r"[^A-Za-z]", line.strip())

for word in words: