【头歌-Python】集合自学引导

谛凌

于 2024-05-29 06:00:00 发布

阅读量2.3k

点赞数 31

分类专栏： Python 头歌-Educoder 文章标签： python 开发语言头歌

本文链接：https://blog.csdn.net/qq_45801887/article/details/139257232

版权

Python 同时被 2 个专栏收录

60 篇文章 139 订阅

订阅专栏

头歌-Educoder

31 篇文章 78 订阅

订阅专栏

禁止转载，原文：https://blog.csdn.net/qq_45801887/article/details/139257232
参考教程：B站视频讲解——https://space.bilibili.com/3546616042621301

如果代码存在问题，麻烦大家指正 ~ ~
有帮助麻烦点个赞 ~ ~
集合自学引导

第1关：统计小说单词数量

任务描述
本关任务：编写一个统计英文小说中单词数量的小程序。

相关知识
为了完成本关任务，你需要掌握：

读文件到字符串中
字符串切分为列表
统计单词数量

读文件到字符串中

遍历文件拼接字符串
遍历文件，每次将得到文件的一行，字符串类型，用“+”可以将这些字符串拼接成一个字符串。

def file_to_str(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写"""
    txt = ''  # 定义一个空字符串
    with open(file, 'r', encoding='utf-8') as fr:  # 创建文件对象
        for row in fr:        # 遍历文件对象
            txt = txt + row   # 将当前行拼接到字符串上，保留各行末尾的换行符
    return txt.lower()        # 返回字符串，其中字母全部转为小写
if __name__ == '__main__':
    filename = '../data/txt/Hemingway/The Old Man and the Sea.txt'
    print(file_to_str(filename))

用文件对象的read()方法读取文件对象为一个字符串

def file_to_str(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写"""
    with open(file, 'r', encoding='utf-8') as fr:  # 创建文件对象
        txt = fr.read()  # 读取文件为一个字符串
    return txt.lower()   # 返回字符串，其中字母全部转为小写

字符串切分为列表
str.split(sep=None)可以将根据参数sep指定的符号将字符串str切分为列表。
本关要求统计小说中单词数量，可以先将字符串中全部符号都替换为空格，再根据空格进行切分。符号可用string.punctuation获取。对于长字符串，遍历符号集比遍历字符串效率更高。
split()不加参数时，根据空白字符进行切分，多个空白字符当成一个处理，可以避免切分出空字符串元素。
示例如下：

import string
def file_to_lst(txt):
    """替换掉字符串中的符号和数字，根据空白字符切分为列表，返回列表"""
    for c in string.punctuation:   # 遍历符号集
        txt = txt.replace(c, ' ')  # 将全部符号都替换为空格
    words_ls = txt.split()         # 根据空白字符切分为列表
    return words_ls                # 返回列表

统计单词数量
len(ls)函数可以测试并返回列表ls的长度，也就是返回列表中元素的数量

s = 'But the shark came up fast with his head out and the old man hit him squarely in the center of his flat-topped head as his nose came out of water and lay against the fish.'
words_ls = s.split()
print(len(words_ls))  # 36

本项目将对海明威的几篇小说进行分析，小说为文本文件，文件路径为“/data/bigfiles/***.txt”，星号代表文件名。

编程要求
根据提示，在右侧编辑器补充代码，输入小说文件名，统计并输出小说中的单词数量。
本项目可能涉及到的文件：
Green Hills of Africa.txt
For Whom the Bell Tolls.txt
Death in the Afternoon.txt
A Farewell to Arms.txt
The Sun Also Rises.txt
Green Hills of Africa.txt
To Have and Have Not.txt
A Moveable Feast.txt
Men without Women.txt
Winner Take Nothing.txt
In Our Time.txt
The Old Man and the Sea.txt
The Torrents of Spring.txt

测试说明(输出仅供参考格式，数据未必真实)
平台会对你编写的代码进行测试：

测试输入：

The Old Man and the Sea.txt

预期输出：

开始你的任务吧，祝你成功！

参考代码

# 禁止转载，原文：https://blog.csdn.net/qq_45801887/article/details/139257232
# 参考教程：B站视频讲解 https://space.bilibili.com/3546616042621301
import string
 
 
def file_to_str(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写，返回字符串"""
    # 补充你的代码
    with open(file, 'r', encoding='utf-8') as f:
        txt = f.read()   # 读取文件为一个字符串
    return txt.lower()   # 返回字符串，其中字母全部转为小写
 
 
def file_to_lst(txt):
    """替换掉字符串txt中的符号和数字，根据空格切分为列表，返回列表"""
    # 补充你的代码
    for c in string.punctuation:   # 遍历符号集
        txt = txt.replace(c, ' ')  # 将全部符号都替换为空格
    words_ls = txt.split()         # 根据空白字符切分为列表
    return words_ls                # 返回列表
 
 
if __name__ == '__main__':
    filename = input()                   # 输入文件名
    path = '/data/bigfiles/'             # 文件存放路径
    text = file_to_str(path + filename)  # 读文件返回字符串
    words_lst = file_to_lst(text)        # 字符串切分为列表
    print(len(words_lst))                # 输出列表长度

第2关：统计小说中不重复单词数量

任务描述
本关任务：编写一个统计英文小说中不重复单词数量的小程序。

相关知识
为了完成本关任务，你需要掌握：

去掉重复元素

去掉重复元素
1.将序列转为集合可去除重复元素
集合元素具有唯一性，用set(seq)函数将序列seq转为集合，可去除其中重复的元素，本题可将列表做为set()的参数，转为集合，获得不重复的单词的数量。

示例如下：

def no_repeat(words_ls):
    """接收列表为参数，去除里面的重复单词，返回列表"""
    words_no_repeat = set(words_ls)  # 去掉重复单词，返回值为集合
    return words_no_repeat  # 返回集合

本项目将对海明威的几篇小说进行分析，小说为文本文件，文件路径为“/data/bigfiles/***.txt”，星号代表文件名。

编程要求
根据提示，在右侧编辑器补充代码，输入小说文件名，统计并输出小说中不重复的单词数量。

测试说明(输出仅供参考格式，数据未必真实)
平台会对你编写的代码进行测试：

测试输入：

The Old Man and the Sea.txt

预期输出：

开始你的任务吧，祝你成功！

参考代码

# 禁止转载，原文：https://blog.csdn.net/qq_45801887/article/details/139257232
# 参考教程：B站视频讲解 https://space.bilibili.com/3546616042621301
import string
 
 
def file_to_str(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写"""
    # 补充你的代码
    with open(file, 'r', encoding='utf-8') as f:  # 创建文件对象
        txt = f.read()   # 读取文件为一个字符串
    return txt.lower()   # 返回字符串，其中字母全部转为小写
 
def file_to_lst(txt):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写，
    替换掉其中的符号和数字，根据空格切分为列表，返回列表"""
    # 补充你的代码
    for c in string.punctuation:   # 遍历符号集
        txt = txt.replace(c, ' ')  # 将全部符号都替换为空格
    words_ls = txt.split()         # 根据空白字符切分为列表
    return words_ls                # 返回列表
 
 
def no_repeat(words_ls):
    """接收列表为参数，去除里面的重复单词，保持原来单词出现的顺序，返回列表"""
    # 补充你的代码
    return set(words_ls) 
 
if __name__ == '__main__':
    filename = input()             # 输入文件名
    path = '/data/bigfiles/'         # 文件存放路径
    text = file_to_str(path + filename)  # 读文件返回字符串
    words_lst = file_to_lst(text)      # 字符串切分为列表
    print(len(no_repeat(words_lst)))    # 输出集合长度

第3关：列表去掉重复元素后保持各元素出现的先后次序不变

任务描述
本关任务：编写一个能将小说中重复单词去掉再按出现次序输出的小程序。

相关知识
为了完成本关任务，你需要掌握：

集合排序输出

集合排序输出

集合无序
可以用sorted(set)函数将集合set转为排序的列表。

示例如下：

txt = 'repeatedly; again and again; time and again; over and over again'
txt = txt.replace(';', ' ')  # 字符串中的符号替换为空格
words_ls = txt.split()       # 根据空白字符切分为列表
print(words_ls)              # 输出列表
words_set = set(words_ls)    # 列表转集合
print(words_set)             # 输出集合
print(sorted(words_set, key=lambda x: words_ls.index(x)))  # 根据元素在列表中出现的序号排序
# ['repeatedly', 'again', 'and', 'again', 'time', 'and', 'again', 'over', 'and', 'over', 'again']
# {'and', 'time', 'over', 'repeatedly', 'again'}
# ['repeatedly', 'again', 'and', 'time', 'over']

本项目将对海明威的几篇小说进行分析，小说为文本文件，文件路径为“/data/bigfiles/***.txt”，星号代表文件名。

编程要求
根据提示，在右侧编辑器补充代码，输入小说文件名和一个正整数n，返回去不重复的前n个单词的列表，列表中元素顺序与在小说中出现次序相同。

测试说明
平台会对你编写的代码进行测试：

测试输入：

The Old Man and the Sea.txt
10

预期输出：

['a', 'distributed', 'proofreaders', 'canada', 'ebook', 'this', 'is', 'made', 'available', 'at']

开始你的任务吧，祝你成功！

参考代码

# 禁止转载，原文：https://blog.csdn.net/qq_45801887/article/details/139257232
# 参考教程：B站视频讲解 https://space.bilibili.com/3546616042621301
import string
 
def file_to_str(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写"""
    # 补充你的代码
    with open(file, 'r', encoding='utf-8') as f:  # 创建文件对象
        txt = f.read()   # 读取文件为一个字符串
    return txt.lower()   # 返回字符串，其中字母全部转为小写
 
 
def file_to_lst(txt):
    """替换掉字符串中的符号和数字，根据空白字符切分为列表，返回列表"""
    # 补充你的代码
    for c in string.punctuation:   # 遍历符号集
        txt = txt.replace(c, ' ')  # 将全部符号都替换为空格
    words_ls = txt.split()         # 根据空白字符切分为列表
    return words_ls                # 返回列表
 
 
def no_repeat(words_ls):
    """接收列表为参数，去除里面的重复单词，保持原来单词出现的顺序，返回列表"""
    # 补充你的代码
    words_set = list(set(words_ls)) 
    return sorted(words_set, key=lambda x: words_ls.index(x)) 
   
 
if __name__ == '__main__':
    filename = input()             # 输入文件名
    n = int(input())               # 输入一个正整数n
    path = '/data/bigfiles/'         # 文件存放路径
    text = file_to_str(path + filename)  # 读文件返回字符串
    words_lst = file_to_lst(text)      # 字符串切分为列表
    print(no_repeat(words_lst)[:n])    # 输出不重复的前n个单词的子列表

第4关：输出两本小说中共存的单词数量

任务描述
本关任务：编写一个统计两本小说中共存的单词数量的小程序。

相关知识
为了完成本关任务，你需要掌握：

集合交集

集合交集
获取两个集合的共有元素

s.intersection(t)或s & t，返回新对象，元素是两个集合交集
s.intersection_update(t)或s = s & t，更新s为两个集合的交集

示例如下：

s = {'the', 'sail', 'was', 'patched'}
t = {'was', 'like', 'the', 'flag', 'of', 'permanent', 'defeat'}
print(s.intersection(t))  # 返回元素是s和t元素交集的新集合
print(s & t)              # 同上，返回元素是s和t元素交集的新集合
s.intersection_update(t)  # 更新集合s，结果是原s和t元素交集
print(s)                  # 查看s可看到运算结果
s = {'the', 'sail', 'was', 'patched'}
t = {'was', 'like', 'the', 'flag', 'of', 'permanent', 'defeat'}
s = s & t                 # 同上，更新集合s，结果是原s和t元素交集
print(s)                  # 查看s可看到运算结果
# 上面语句输出都是{'the', 'was'}，四种方法选用一种即可

本项目将对海明威的几篇小说进行分析，小说为文本文件，文件路径为“/data/bigfiles/***.txt”，星号代表文件名。

编程要求
根据提示，在右侧编辑器补充代码，在两行中分别输入两本小说文件名，统计并输出两本小说中共存的单词数量，重复单词只统计一次。

测试说明
平台会对你编写的代码进行测试：

测试输入：

The Old Man and the Sea.txt
The Torrents of Spring.txt

预期输出：

开始你的任务吧，祝你成功！

参考代码

# 禁止转载，原文：https://blog.csdn.net/qq_45801887/article/details/139257232
# 参考教程：B站视频讲解 https://space.bilibili.com/3546616042621301
import string
 
 
def file_to_set(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写。
    替换掉字符串中的符号，根据空白字符切分为列表，转为集合类型。"""
    path = '/data/bigfiles/'       # 文件路径
    with open(path+file, 'r', encoding='utf-8') as fr:  # 创建文件对象
        txt = fr.read().lower()    # 读取文件为一个字符串，其中字母全部转为小写
    # 补充你的代码
    for c in string.punctuation: 
        txt = txt.replace(c, ' ')  
    return txt.split()   
 
def words_both(file1, file2):
    """接收两个文件名为参数，返回两个文件中共同存在的单词，相同单词只计算一次"""
    # 补充你的代码
    file1 = {i for i in file_to_set(file1)}
    file2 = {i for i in file_to_set(file2)}
    return file1&file2

 
if __name__ == '__main__':
    filename1 = input()  # 输入文件名
    filename2 = input()  # 输入文件名
    print(len(words_both(filename1, filename2)))

第5关：输出两本小说中出现的所有单词数量

任务描述
本关任务：编写一个统计两本小说中出现的所有单词数量的小程序。

相关知识
为了完成本关任务，你需要掌握：

集合并集

集合并集
获取两个集合的全部元素

s.union(t)或s | t，返回新对象，元素是两个集合并集
s.update(t)或s = s | t，更新s为两个集合的并集

示例如下：

s = {'the', 'sail', 'was', 'patched'}
t = {'was', 'like', 'the', 'flag', 'of', 'permanent', 'defeat'}
print(s.union(t))  # 返回元素是s和t元素并集的新集合
print(s | t)       # 同上，返回元素是s和t元素并集的新集合
s.update(t)        # 更新集合s，结果是原s和t元素并集
print(s)           # 查看s可看到运算结果
s = {'the', 'sail', 'was', 'patched'}
t = {'was', 'like', 'the', 'flag', 'of', 'permanent', 'defeat'}
s = s | t          # 同上，更新集合s，结果是原s和t元素并集
print(s)           # 查看s可看到运算结果
# 上面语句输出都是{'the', 'of', 'sail', 'permanent', 'was', 'patched', 'defeat', 'flag', 'like'}
# 四种方法选用一种即可

本项目将对海明威的几篇小说进行分析，小说为文本文件，文件路径为“/data/bigfiles/***.txt”，星号代表文件名。

编程要求
根据提示，在右侧编辑器补充代码，在两行中分别输入两本小说文件名，统计并输出两本小说中出现的所有单词数量，重复单词只统计一次。

测试说明
平台会对你编写的代码进行测试：

测试输入：

The Old Man and the Sea.txt
The Torrents of Spring.txt

预期输出：

开始你的任务吧，祝你成功！

参考代码

# 禁止转载，原文：https://blog.csdn.net/qq_45801887/article/details/139257232
# 参考教程：B站视频讲解 https://space.bilibili.com/3546616042621301
import string
 
 
def file_to_set(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写。
    替换掉字符串中的符号，根据空白字符切分为列表，转为集合类型。"""
    path = '/data/bigfiles/'       # 文件路径
    with open(path+file, 'r', encoding='utf-8') as fr:  # 创建文件对象
        txt = fr.read().lower()    # 读取文件为一个字符串，其中字母全部转为小写
    # 补充你的代码
    for c in string.punctuation: 
        txt = txt.replace(c, ' ')  
    return txt.split()
 
 
def words_all(file1, file2):
    """接收两个文件名为参数，返回两个文件中出现的所有单词，相同单词只计算一次"""
    # 补充你的代码
    file1 = {i for i in file_to_set(file1)}
    file2 = {i for i in file_to_set(file2)}
    return file1|file2
 
if __name__ == '__main__':
    filename1 = input()  # 输入文件名
    filename2 = input()  # 输入文件名
    print(len(words_all(filename1, filename2)))

第6关：统计仅在第一本小说中出现的单词数量

任务描述
本关任务：编写一个统计仅在第一本小说中出现的单词数量的小程序。

相关知识
为了完成本关任务，你需要掌握：

集合差集

集合差集
获取仅在集合s中存在，在集合t中不存在的元素

s.difference(t)或s - t，返回新对象，元素是两个集合差集
s.difference_update(t)或s = s - t，更新s为两个集合的差集

示例如下：

s = {'the', 'sail', 'was', 'patched'}
t = {'was', 'like', 'the', 'flag', 'of', 'permanent', 'defeat'}
print(s.difference(t))  # 返回元素是s和t元素差集的新集合
print(s - t)       # 同上，返回元素是s和t元素差集的新集合
s.difference_update(t)        # 更新集合s，结果是原s和t元素差集
print(s)           # 查看s可看到运算结果
s = {'the', 'sail', 'was', 'patched'}
t = {'was', 'like', 'the', 'flag', 'of', 'permanent', 'defeat'}
s = s - t          # 同上，更新集合s，结果是原s和t元素差集
print(s)           # 查看s可看到运算结果
# 上面语句输出都是{'patched', 'sail'}
# 四种方法选用一种即可

本项目将对海明威的几篇小说进行分析，小说为文本文件，文件路径为“/data/bigfiles/***.txt”，星号代表文件名。

编程要求
根据提示，在右侧编辑器补充代码，在两行中分别输入两本小说文件名，统计并输出仅在第一本小说中出现且在第二本小说中未出现的单词数量，重复单词只统计一次。

测试说明
平台会对你编写的代码进行测试：

测试输入：

The Old Man and the Sea.txt
The Torrents of Spring.txt

预期输出：

开始你的任务吧，祝你成功！

参考代码

# 禁止转载，原文：https://blog.csdn.net/qq_45801887/article/details/139257232
# 参考教程：B站视频讲解 https://space.bilibili.com/3546616042621301
import string
 
 
def file_to_set(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写。
    替换掉字符串中的符号，根据空白字符切分为列表，转为集合类型。"""
    path = '/data/bigfiles/'       # 文件路径
    with open(path+file, 'r', encoding='utf-8') as fr:  # 创建文件对象
        txt = fr.read().lower()    # 读取文件为一个字符串，其中字母全部转为小写
    # 补充你的代码
    for c in string.punctuation:
        txt = txt.replace(c, ' ') 
    return txt.split()  
 
 
def only_in_first(file1, file2):
    """接收两个文件名为参数，返回仅在第一本小说中出现且在第二本小说中未出现的单词集合，相同单词只计算一次"""
    # 补充你的代码
    file1 = {i for i in file_to_set(file1)}
    file2 = {i for i in file_to_set(file2)}
    return file1-file2

if __name__ == '__main__':
    filename1 = input()  # 输入文件名
    filename2 = input()  # 输入文件名
    print(len(only_in_first(filename1, filename2)))

第7关：统计未同时在两本小说中出现的单词数量

任务描述
本关任务：编写一个统计两本小说中出现的单词中未同时在两本书中出现的单词数量的小程序。

相关知识
为了完成本关任务，你需要掌握：

集合对称差集

集合对称差集
获取仅在集合s或t中存在，但不同时在集合s和t中存在的元素

s.symmetric_difference(t)或s ^ t，返回新对象，元素是两个集合对称差集
s.symmetric_difference_update(t)或s = s ^ t，更新s为两个集合的对称差集

示例如下：

s = {'the', 'sail', 'was', 'patched'}
t = {'was', 'like', 'the', 'flag', 'of', 'permanent', 'defeat'}
print(s.symmetric_difference(t))  # 返回元素是s和t元素对称差集的新集合
print(s ^ t)       # 同上，返回元素是s和t元素对称差集的新集合
s.symmetric_difference_update(t)        # 更新集合s，结果是原s和t元素对称差集
print(s)           # 查看s可看到运算结果
s = {'the', 'sail', 'was', 'patched'}
t = {'was', 'like', 'the', 'flag', 'of', 'permanent', 'defeat'}
s = s ^ t          # 同上，更新集合s，结果是原s和t元素对称差集
print(s)           # 查看s可看到运算结果
# 输出
# {'defeat', 'of', 'patched', 'sail', 'flag', 'like', 'permanent'}
# {'defeat', 'of', 'patched', 'sail', 'flag', 'like', 'permanent'}
# {'flag', 'like', 'permanent', 'defeat', 'of', 'patched', 'sail'}
# {'defeat', 'of', 'patched', 'sail', 'flag', 'like', 'permanent'}

本项目将对海明威的几篇小说进行分析，小说为文本文件，文件路径为“/data/bigfiles/***.txt”，星号代表文件名。

编程要求
根据提示，在右侧编辑器补充代码，在两行中分别输入两本小说文件名，统计并输出统计两本小说中出现的单词中未同时在两本书中出现的单词数量，重复单词只统计一次。

测试说明
平台会对你编写的代码进行测试：

测试输入：

The Old Man and the Sea.txt
The Torrents of Spring.txt

预期输出：

开始你的任务吧，祝你成功！

参考代码

# 禁止转载，原文：https://blog.csdn.net/qq_45801887/article/details/139257232
# 参考教程：B站视频讲解 https://space.bilibili.com/3546616042621301
import string
 
 
def file_to_set(file):
    """将文件名变量file指向的文件读为字符串，全部字母转为小写。
    替换掉字符串中的符号，根据空白字符切分为列表，转为集合类型。"""
    path = '/data/bigfiles/'       # 文件路径
    with open(path+file, 'r', encoding='utf-8') as fr:  # 创建文件对象
        txt = fr.read().lower()    # 读取文件为一个字符串，其中字母全部转为小写
    # 补充你的代码
    for c in string.punctuation:
        txt = txt.replace(c, ' ')  
    return txt.split()
 
 
def only_in_one(file1, file2):
    """接收两个文件名为参数，返回仅在一个小说中存在，不在两个文件中共同存在的单词，相同单词只计算一次"""
    # 补充你的代码
    file1 = {i for i in file_to_set(file1)}
    file2 = {i for i in file_to_set(file2)}
    return file1^file2
 
 
if __name__ == '__main__':
    filename1 = input()  # 输入文件名
    filename2 = input()  # 输入文件名
    print(len(only_in_one(filename1, filename2)))