百度领航团零基础Python 【函数基础(大作业)】解析

最新推荐文章于 2021-02-10 13:08:47 发布

catltr

最新推荐文章于 2021-02-10 13:08:47 发布

阅读量363

点赞数

本文链接：https://blog.csdn.net/catltr/article/details/113770736

版权

作业内容来自百度飞桨领航团的零基础学python课程
课程链接：https://aistudio.baidu.com/aistudio/course/introduce/7073

作业内容

统计英语6级试题中所有单词的词频，并返回一个如下样式的字典

{'and':100,'abandon':5}

英语6级试题的文件路径./artical.txt

Tip: 读取文件的方法

def get_artical(artical_path):
    with open(artical_path) as fr:
        data = fr.read()
    return data

get_artical('./artical.txt')

处理要求

(a)'\n'是换行符需要删除
(b) 标点符号需要处理
['.', ',', '!', '?', ';', '\'', '\"', '/', '-', '(', ')']
© 阿拉伯数字需要处理
['1','2','3','4','5','6','7','8','9','0']
(d) 注意大小写一些单词由于在句首，首字母大写了。需要把所有的单词转成小写
'String'.lower()
(e) 高分项
通过自己查找资料学习正则表达式，并在代码中使用(re模块)

可参考资料：https://docs.python.org/3.7/library/re.html

artical大致内容

在这里插入图片描述

思考过程

依赖

因为需要使用到正则表达式,从要求可以看出需要re模块

import re

读取文件

def get_artical(artical_path):
    with open(artical_path) as fr:
        data = fr.read()
    return data

这里使用了read方法:
read([size])方法从文件当前位置起读取size个字节，若无参数size，则表示读取至文件结束为止，返回字符串对象

按照处理要求处理

raw_artical = get_artical('./artical.txt')
print(raw_artical)

得到结果:
红框这里换行了其实就是\n字符(需要去掉)
在这里插入图片描述

raw_artical = re.sub(r'[\d\n\.\-,!\?")(:;\/]+', ' ', raw_artical.lower())

Python 的 re 模块提供了re.sub用于替换字符串中的匹配项。

语法：
re.sub(pattern, repl, string, count=0, flags=0)
参数：

pattern : 正则中的模式字符串。
repl : 替换的字符串，也可为一个函数。
string : 要被查找替换的原始字符串。
count : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。

去掉数字(\d),换行符(\n),其他特殊符号和全部小写一步到位,将所有字符替换成空格而不是空,是因为如果替换了\n或者a-leave中的-时会导致单词连接到了一起

获取拥有的单词

raw_artical_set = set(re.split(r'\s+', raw_artical))
print(raw_artical_set)

Python 的 re 模块提供了re.split用于分割字符串,用法参考re.sub和字符串的split。
使用set(集合)去重

得到结果中有''

{'', 'only' ... }

这里字符串头尾有空格导致的
使用String.strip()去掉空格就没问题了

raw_artical_set = set(re.split(r'\s+', raw_artical.strip()))

统计词频

 words = {}
 for word in raw_artical_set:
     words[word] = len(re.findall(f'(?:[^\w]|^){word}(?:[^\w]|$)', raw_artical))

re.findall会找到符合条件的所有内容,并返回结果组成的数组
这里的(?:[^\w]|^){word}(?:[^\w]|$)用于匹配字符串开头,结尾的单词,或者前后不为[a-z]的单词,这样可以避免将sub和submit识别成一个单词

这里也可以将之前re.split()得到的数组中的内容进行数量统计

raw_artical_list = re.split(r'\s+', raw_artical.strip())

# # 获取所有单词word数量
words = {}
for word in set(raw_artical_list):
    words[word] = raw_artical_list.count(word)

具体实现看完整代码

完整代码

import re

def get_artical(artical_path):
    with open(artical_path) as fr:
        data = fr.read()
    return data

raw_artical = get_artical('./artical.txt')

raw_artical = re.sub(r'[\d\n\.\-,!\?")(:;\/]+', ' ', raw_artical.lower())

words = {}
for word in set(re.split(r'\s+', raw_artical.strip())):
    words[word] = len(re.findall(f'(?:[^\w]|^){word}(?:[^\w]|$)', raw_artical))

print(words)

功能大致实现了,但是执行时间太长了
在这里插入图片描述

import re
# 请根据处理要求下面区域完成代码的编写。
def get_artical(artical_path):
    with open(artical_path) as fr:
        data = fr.read()
    return data

# get_artical()为自定义函数，可用于读取指定位置的试题内容。
raw_artical = get_artical('./artical.txt')

#  转小写 然后 去掉换行,标点符号,数字
raw_artical = re.sub(r'[\d\n\.\-,!\?")(:;\/]+', ' ', raw_artical.lower())

# 得到所有单词
raw_artical_list = re.split(r'\s+', raw_artical.strip())

# # 获取所有单词word数量
words = {}
for word in set(raw_artical_list):
    words[word] = raw_artical_list.count(word)

print(words)

这提升可以不是一丁点,主要是上面的正则匹配每一个单词太慢了

在这里插入图片描述

catltr

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
百度领航团零基础Python 【函数基础(大作业)】解析

作业内容统计英语6级试题中所有单词的词频，并返回一个如下样式的字典{'and':100,'abandon':5}英语6级试题的文件路径./artical.txtTip: 读取文件的方法def get_artical(artical_path): with open(artical_path) as fr: data = fr.read() return dataget_artical('./artical.txt')处理要求(a)'\n'是换行符需要删
复制链接

扫一扫