python提取英文单词怎么写,从python中的字符串中提取英语单词

I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space).

My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks

import sys

import re

for line in sys.stdin:

line = re.sub("[^A-Za-z]", "", line.strip())

line = line.lower()

words = ' '.join(line.split())

for word in words:

print '%s\t%s' % (word, 1)

解决方案

You've actually got two problems.

First, this:

line = re.sub("[^A-Za-z]", "", line.strip())

This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.

Next, even if you didn't do that, you do this:

words = ' '.join(line.split())

This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)

So, in the next line, when you do this:

for word in words:

You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.

If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:

words = line.split()

for word in words:

Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:

line = re.sub(r"[^A-Za-z\s]", "", line.strip())

words = line.split()

for word in words:

However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:

line = re.sub(r"[^A-Za-z]", " ", line.strip())

words = line.split()

for word in words:

… or just:

words = re.split(r"[^A-Za-z]", line.strip())

for word in words:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值