Python实训题目（1）实现文本中的词频统计和排序-CSDN博客

本文链接：https://blog.csdn.net/weixin_57454642/article/details/141642722

题目

题目来自https://edu.csdn.net/skill/practice/python-3-5/29?language=python&materialId=17346

一个正式的项目一般都有对应的代码规范，代码规范约定了如何在项目中编写代码。一般来说，个人开发可以使用任何自己喜欢的代码规范，但是在团队开发中，一般要遵循团队约定的代码规范。

不同的编程语言，都有一些著名的代码规范。例如著名的K&R是指《The C Programming Language》一书的作者Kernighan和Ritchie二人，这是世界上第一本介绍C语言的书，而K&R风格即指他们在该书中书写代码所使用的风格。

Google 有一个代码风格指引：Google Style Guides，里面包含了大部分主流编程语言的编码规范。以下Python三引号字符串摘录该规范的核心描述：

google_style_guide = '''
Every major open-source project has its own style guide: a set of conventions (sometimes arbitrary) about how to write code for that project. It is much easier to understand a large codebase when all the code in it is in a consistent style.

“Style” covers a lot of ground, from “use camelCase for variable names” to “never use global variables” to “never use exceptions.” This project (google/styleguide) links to the style guidelines we use for Google code. If you are modifying a project that originated at Google, you may be pointed to this page to see the style guides that apply to that project.

This project holds the C++ Style Guide, C# Style Guide, Swift Style Guide, Objective-C Style Guide, Java Style Guide, Python Style Guide, R Style Guide, Shell Style Guide, HTML/CSS Style Guide, JavaScript Style Guide, TypeScript Style Guide, AngularJS Style Guide, Common Lisp Style Guide, and Vimscript Style Guide. This project also contains cpplint, a tool to assist with style guide compliance, and google-c-style.el, an Emacs settings file for Google style.
'''

CodeChina上有一个中文镜像仓库：zh-google-styleguide，可以看到 Google Python 代码风格指引十分简短：python_style_rules

此外，Python 官方代码风格文档是：PEP8。你会问PEP是什么？PEP是Python Enhancement Proposals的缩写。一个PEP是一份为Python社区提供各种增强功能的技术规格，也是提交新特性，以便让社区指出问题，精确化技术文档的提案。

实际的开发中可以通过配置开发环境的插件来辅助自动化检查代码风格。下面的Python三引号字符串描述了一组相关信息：

python_style_guides = '''
* Python 代码风格指南',
    * [google-python-styleguide_zh_cn](https://zh-google-styleguide.readthedocs.io/en/latest/google-python-styleguide/python_style_rules /)
    * [PEP8](https://legacy.python.org/dev/peps/pep-0008/)
* 代码风格和自动完成工具链
    * 基本工具
        * [pylint](https://pylint.org/)
        * [autopep8](https://pypi.org/project/autopep8/)
    * Visual Studio Code Python 开发基本插件
        * Pylance
        * Python Path
        * Python-autopep8
'''

请编写一段单词统计Python代码，统计上述两个Python三引号字符串里英文单词的词频。要求：

单词请忽略大小写
使用数组splits = ['\n', ' ', '-', ':', '/', '*', '_', '(', ')', '"', '”', '“',']','[',',','.','\n']来切割单词
输出词频最高的5个单词和词频信息。

基本代码框架如下：

# -*- coding: UTF-8 -*-
def top_words(splits, text, top_n=5):
    i = 0
    word_dict = {}
    chars = []
    while i < len(text):
        c = text[i]
        if c in splits:
            # 过滤掉分隔字符串
            while i+1 < len(text) and text[i+1] in splits:
                i += 1
            word = ''.join(chars).lower()

            # 统计词频
            # TODO(You): 请在此添加代码

            chars = []
        else:
            chars.append(c)

        i += 1

    word_list = list(word_dict.values())
    top_n = min(top_n, len(word_list))
    word_list.sort(key=lambda word_info: word_info['count'], reverse=True)
    return word_list[0:top_n]

if __name__ == '__main__':
    google_style_guide = ...
    python_style_guides = ...
    splits = [' ', '-', ':', '/', '*', '_', '(', ')', '"', '”', '“', ',', '.', '\n']

    tops = top_words(splits, google_style_guide+python_style_guides)

    print('单词排行榜')
    print('--------')
    i = 0
    while i < len(tops):
        top = tops[i]
        word = top['word']
        count = top['count']
        print(f'{i+1}. 单词：{word}, 词频：{count}')
        i += 1

预期的输出结果为：

单词排行榜
--------
1. 单词：style, 词频：23
2. 单词：guide, 词频：16
3. 单词：to, 词频：9
4. 单词：python, 词频：9
5. 单词：project, 词频：8

以下选项是对代码中TODO部分的多种实现，你能找出以下实现错误的选项吗？

word_info = word_dict.get(word, {'word': word, 'count': 0})
word_info['count'] += 1
word_dict[word] = word_info

word_info = word_dict.get(word)
if word_info is None:
    word_info = {'word': word, 'count': 1}
    word_dict[word] = word_info
else:
    word_info['count'] += 1

word_info = word_dict.get(word)
if word_info is None:
    word_info = {'word': word, 'count': 0}
    word_dict[word] = word_info

word_info['count'] += 1

if not word in word_dict:
    word_info = {'word': word, 'count': 0}
    word_dict[word] = word_info

word_info['count'] += 1

题干解析

该程序用于统计给定文本中出现频率最高单词。用户可以指定文本分隔符并获取出现次数最多的前 N 个单词。

功能

1. 文本分割和单词提取：根据指定的分隔符（如空格、符号等）将输入文本拆分为多个单词。

2. 词频统计：计算每个单词在文本中出现的次数，并以字典形式存储。

3. 词频排序：按词频从高到低排序，并返回词频最高的前 N 个单词。

函数说明

top_words(splits, text, top_n=5)

功能：统计文本中出现频率最高的单词

参数：splits ：用于分割文本的字符列表，列表中的每个字符都会被当作分隔符。

text ：待处理的字符串文本。

top_n ：返回的最高频单词数量，默认为 5。

main部分

调用 top_words 函数给文本进行词频分析，并将词频前五的单词打印到控制台。

各部分代码功能分析

1. 初始化变量

i = 0
word_dict = {}
chars = []

i 是一个计数器，用于遍历文本中的每一个字符

word_dict 是一个字典，用于存储每个单词及其对应的出现次数。

chars 是一个列表，用于临时存储当前正在解析的单词的字符。

2. 遍历文件

while i < len(text):
    c = text[i]
    if c in splits:
        ...
    else:
        chars.append(c)
    i += 1

使用 while 循环遍历 text 中的每一个字符 c。

如果当前字符 c 为分隔符（即在 splits 列表中），则表示当前单词解析结束，进入下一步处理。否则，将字符 c 添加到 chars 列表中，继续解析单词。

3. 处理单词及统计字频

当遇到分隔符时，表示一个单词已解析完成，需要将其加入词典并统计出现次数。

if c in splits:
    while i+1 < len(text) and text[i+1] in splits:
        i += 1
    word = ''.join(chars).lower()

    # 统计词频
    # TODO(You): 请在此添加代码

    chars = []

先处理连续出现分隔符的情况，确保每个单词间的分割符只识别一次。

使用 ''.join(chars).lower() 将收集到的字符合并成一个单词，并将其转化成小写，以确保统计时不区分大小写。

统计词频是需要我们判断的部分TODO，后面会一一解析每个选项。

将 chars 列表清空，为解析下一个单词做准备。

4. 排序并返回结果

word_list = list(word_dict.values())
top_n = min(top_n, len(word_list))
word_list.sort(key=lambda word_info: word_info['count'], reverse=True)
return word_list[0:top_n]

将 word_dict 中所有的单词及其信息转换为一个列表 word_list。

第二行代码确保返回的单词数量不超过文本中实际单词的数量。

按照词频对 word_list 进行降序排序，然后返回前 top_n 个频率最高的单词。

5. 主程序

google_style_guide = ...
python_style_guides = ...
splits = [' ', '-', ':', '/', '*', '_', '(', ')', '"', '”', '“']

tops = top_words(splits, google_style_guide+python_style_guides)

设置检查目标文本和分隔符，调用函数 top_words 获取单词统计列表。

print('单词排行榜')
print('--------')
i = 0
while i < len(tops):
    top = tops[i]
    word = top['word']
    count = top['count']
    print(f'{i+1}. 单词：{word}, 词频：{count}')
    i += 1

将词频前五的单词打印到控制台。

选项解析

选项A

word_info = word_dict.get(word, {'word': word, 'count': 0})
word_info['count'] += 1
word_dict[word] = word_info

第一行代码从词典 word_dict 中获取当前单词的信息，若词典中没有该单词，则初始化一个新的词典项 {'word': word, 'count': 0}。

将该单词的出现次数加1，并更新到词典 word_dict 中。

选项B

word_info = word_dict.get(word)
if word_info is None:
    word_info = {'word': word, 'count': 1}
    word_dict[word] = word_info
else:
    word_info['count'] += 1

从词典 word_dict 中获取当前单词的信息。

若词典中没有该单词，则初始化一个新的词典项 {'word': word, 'count': 1}，并更新到词典 word_dict 中。键为 word，值为 word_info 。

若有则在词典项 word_info 中将该单词的出现次数 'count' 加1，会同步更新词典 word_dict 中对应的单词计数。

选项C

word_info = word_dict.get(word)
if word_info is None:
    word_info = {'word': word, 'count': 0}
    word_dict[word] = word_info

word_info['count'] += 1

从词典 word_dict 中获取当前单词的信息。

若词典中没有该单词，则初始化一个新的词典项 {'word': word, 'count': 0}，并更新到词典 word_dict 中。键为 word，值为 word_info 。

在词典项 word_info 中将该单词的出现次数 'count' 加1，会同步更新词典 word_dict 中对应的单词计数。

选项D

if not word in word_dict:
    word_info = {'word': word, 'count': 0}
    word_dict[word] = word_info

word_info['count'] += 1

这个选项无法实现正确的单词频率统计。

原因在于 word_info 参数是在 if 语句的作用域内定义的，但在 if 语句外的代码中，可能会存在两种情况：

1. 如果 word 是第一次出现，它的确会通过 if 语句被初始化，然后 word_info['count'] += 1 可以成功执行。

2. 但如果 word 已经存在与 word_dict 词典中，word_info 变量在 if 语句内不会被重新赋值。此时，word_info 变量可能指向的是上一个单词的字典项，而不是当前单词的字典项。

知识点笔记

1. 字符串方法join()的基本用法

在 Python 中，join() 是一个字符串方法，用于将一个可迭代对象（如列表、元组、或字符串）中的元素连接成一个字符串，并在每个元素之间插入指定的分隔符。

基本语法

separator.join(iterable)

separator ：用作分隔符的字符串，它将插入到 iterable 中的每个元素之间。

iterable ：一个包含字符串元素的可迭代对象（如列表、元组、或字符串）。

常用功能

（1）将列表中的元素连接成字符串

words = ['Python', 'is', 'fun']
sentence = ' '.join(words)
print(sentence)  # 输出: Python is fun

这里用空格 ' ' 作为分隔符，将列表中的单词连接成一个句子。

（2）连接元组中的元素

numbers = ('1', '2', '3')
result = '-'.join(numbers)
print(result)  # 输出: 1-2-3

使用连字符 '-' 作为分隔符，将元组中的数字连接成一个字符串。

（3）连接字符串中的字符

letters = 'abc'
result = ','.join(letters)
print(result)  # 输出: a,b,c

在字符串的每个字符之间插入逗号。

注意事项

join() 方法要求 iterable 中的所有元素都必须是字符串类型，否则会引发 TypeError 异常。

separator 可以是任何字符串，包括空字符串 ''（本题中用法），这将使元素直接连接在一起。

2. 列表方法append()的基本用法

在 Python 中，append() 是列表方法，用于在列表的末尾添加一个元素。这个方法非常常用，尤其是在需要动态构建列表时。

基本语法

list.append(element)

list ：目标列表。

element ：要添加到列表末尾的元素。

常用功能

向列表末尾添加单个元素，可以是不同类型的元素

mixed_list = [1, 2, 3]
mixed_list.append('four')
mixed_list.append([5, 6])
print(mixed_list)  # 输出: [1, 2, 3, 'four', [5, 6]]

你可以添加任何类型的元素，包括字符串、数字，甚至是另一个列表。

注意事项

append() 方法是就地修改列表的，它不会返回新的列表，而是直接在原列表上操作。

只能将一个元素添加到列表末尾，如果要添加多个元素，可以使用循环或者其他方法，如extend()。

3. 字典方法values()的基本用法

values() 是一个字典方法，它返回一个包含字典中所有值的视图对象（view object）。这个视图对象包含了字典中每个键所对应的值，而不包含键本身。

假如 word_dict 如下：

word_dict = {
    'hello': {'word': 'hello', 'count': 3},
    'world': {'word': 'world', 'count': 2},
    'python': {'word': 'python', 'count': 5}
}

执行 word_dict.values() 后，会得到一个类似这样的视图对象：

dict_values([
    {'word': 'hello', 'count': 3},
    {'word': 'world', 'count': 2},
    {'word': 'python', 'count': 5}
])

4. 函数list()的基本用法

list() 是一个内置函数，用于将任何可迭代对象转换为列表。

当我们将 word_dict.values() 传递给 list() 函数时，它会将这个视图对象转换为一个真正的列表。

因此，执行 list(word_dict.values()) 后会获得一个包含 word_dict 中所有值的列表：

word_list = [
    {'word': 'hello', 'count': 3},
    {'word': 'world', 'count': 2},
    {'word': 'python', 'count': 5}
]

5. 函数sort()的基本用法

sort() 方法是 Python 中用于对列表进行原地排序的一个内置方法。

常见用法及主要参数

（1）基本排序

默认情况下，sort() 会将列表中的元素按升序（从小到大）排序。

numbers = [3, 1, 4, 1, 5, 9]
numbers.sort()
print(numbers)  # 输出: [1, 1, 3, 4, 5, 9]

（2）降序排序

使用 reverse=True 参数可以对列表进行排序。

numbers = [3, 1, 4, 1, 5, 9]
numbers.sort(reverse=True)
print(numbers)  # 输出: [9, 5, 4, 3, 1, 1]

（3）使用key进行自定义排序

① 按字符串长度排序

words = ["banana", "apple", "cherry", "date"]
words.sort(key=len)
print(words)  # 输出: ['date', 'apple', 'banana', 'cherry']

这里 key=len 意味着按字符串的长度进行排序。

② 按绝对值排序

numbers = [-6, 3, -2, 8, -1, 5]
numbers.sort(key=abs)
print(numbers)  # 输出: [-1, -2, 3, 5, -6, 8]

使用 key=abs ，按元素的绝对值大小进行排序。

③ 按对象的属性排序

students = [
    {'name': 'Alice', 'age': 25},
    {'name': 'Bob', 'age': 20},
    {'name': 'Charlie', 'age': 23}
]
students.sort(key=lambda student: student['age'])
print(students)
# 输出: [{'name': 'Bob', 'age': 20}, {'name': 'Charlie', 'age': 23}, {'name': 'Alice', 'age': 25}]

使用 key=lambda student: student['age'] 按照学生的年龄进行排序。

（4）综合使用key和reverse排序

可以同时使用 key 和 reverse 来进行复杂排序，例如按字符串长度降序排序：

words = ["banana", "apple", "cherry", "date"]
words.sort(key=len, reverse=True)
print(words)  # 输出: ['banana', 'cherry', 'apple', 'date']

（5）对元组列表排序

# 假设有一个包含元组的列表
pairs = [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]

# 按照元组的第一个元素进行排序
pairs.sort()
print(pairs)  # 输出: [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]

# 按照元组的第二个元素进行排序
pairs.sort(key=lambda pair: pair[1])
print(pairs)  # 输出: [(4, 'four'), (1, 'one'), (3, 'three'), (2, 'two')]

（6）与sorted()函数对比

sort() 是对原列表进行原地排序，排序后列表本身被修改。

sorted() 返回一个新的排序后的列表，原列表不变。

numbers = [3, 1, 4, 1, 5, 9]
sorted_numbers = sorted(numbers)
print(sorted_numbers)  # 输出: [1, 1, 3, 4, 5, 9]
print(numbers)  # 原列表未改变，输出: [3, 1, 4, 1, 5, 9]

6. 字典项的引用与赋值

在 Python 中，字典（以及其他对象，如列表）是通过引用来存储和操作的。这意味着当你对一个字典的值（如 word_info）进行修改时，如果这个值是从另一个字典（如 word_dict）中获取的，那么修改会直接反映在原始字典中。具体来说，word_info 变量存储的是对 word_dict[word] 值的引用，而不是其副本。因此，对 word_info 的任何修改都会直接影响到 word_dict [word]。

# 示例字典
word_dict = {
    'hello': {'word': 'hello', 'count': 1},
    'world': {'word': 'world', 'count': 2}
}

# 获取字典中的某个项
word_info = word_dict.get('hello')

# 修改字典项
word_info['count'] += 1

# 检查原始字典的内容
print(word_dict)

在这个例子中，word_info 和 word_dict ['hello'] 是同一个对象。所以当 word_info ['count'] 被修改时，word_dict ['hello']['count'] 也会随之更新。