python单词个数统计_Python练习第三题,统计单词个数

^第三题:一个英文的纯文本文件,统计其中的单词出现的个数。

统计什么好呢,就拿Python彩蛋import this来试试吧。(将下列单词保存为“test.txt”)>>> import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than *right* now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!

一、分析(python 正则表达式 re findall 方法能够以列表的形式返回能匹配的子串。)

参考re模块 —— rere.findall(pattern, string, flags=0)

作为一个字符串列表,在字符串中,返回所有非重叠匹配的模式。The string是从左到右扫描的,所以匹配的内容是按照该顺序来的如果模式中存在一个或多个组,请返回组列表;如果模式具有多个组,这将是元组的列表。Return all non-overlapping matches of pattern in string, as a list of strings. The string是从左到右扫描的,所以匹配的内容是按照该顺序来的If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

二、实验

先来了解re.findall的用法

>>> import re

>>> re.findall('cat','dog cat dog')

['cat']

>>> re.findall('3','2,4,6,ss,gg')

[]

>>> re.findall('3','2,4,6,ss,gg,3')

['3']

可以看到re.findall就是用来匹配的

下一步需要找到能代表字母的表达式

可以使用方括号来指定多个字符区间。例如正则表达式[A-Za-z]匹配任何字母,包括大写和小写的;正则表达式[A-Za-z][A-Za-z]* 匹配一个字母后面接着0或者多个字母(大写或者小写)。当然我们也可以用元字符+做到同样的事情,也就是:[A-Za-z]+ ,和[A-Za-z][A-Za-z]*完全等价。但是要注意元字符+ 并不是所有支持正则表达式的程序都支持的。关于这一点可以参考后面的正则表达式语法支持情况。

[^a-zA-Z] 简单来说就是任意一个非字母的字符,虽然可以匹配除字母之外的任意字符,但只能是一个,不是多个

如果想匹配多个非字母的字符,需要在后面加量词修饰,如

[^a-zA-Z]+ 表示1个或多个非字母字符

[^a-zA-Z]{5,10} 给示5到10个除字母之外的字符

^[a-z] 匹配以小写字母开头的文本串 ;[^a-z] 表示与不包含小写字母的字符匹配

test.py中没有数字,我们可以选用[^a-zA-Z]

三、代码

import re

def count(filepath):

f = open(filepath, 'r')

s = f.read()

words = re.findall(r'[^a-zA-Z]+', s)

return len(words)

if __name__ == '__main__':

num = count('test.txt')

print (num)

用这个版本得出的结果是208,我还发现了很多大神发布的别的版本,可以参考下,但是得出的最终结果却不一样。

import re

with open('test.txt','r')as f:

data = f.read()

result = re.split(r"[^a-zA-Z]",data)

print (len([x for x in result if x!= '']))

好简洁的版本,结果是149,使用的“re.split”

import re

with open('test.txt','r')as f:

data = f.read()

result = re.findall(r"[^a-zA-Z]+",data)

print("the number of words in the file is: %s" % len(result))

结果是150?

import re

def get_num():

num = 0

f = open('test.txt', 'r')

for line in f.readlines():

num += len(re.findall(r'[^a-zA-Z]+', line))

f.close()

return num

if __name__ == '__main__':

print(get_num())结果是174

那么问题来了,为什么会有不同结果?更多解法 zhangslob/RE

专栏 知乎专栏 知乎专栏

资料 re,正则表达式_百度百科(匆忙中写出,可能有错误,欢迎提出)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值