统计Steve Jobs在斯坦福大学的毕业典礼演讲稿中出现频率最高的前20个单词

zhanghongyi_cpp

已于 2024-05-31 16:52:01 修改

阅读量736

点赞数 3

分类专栏： Python习题小测文章标签： python

于 2023-05-26 23:55:37 首次发布

本文链接：https://blog.csdn.net/zhanghongyi_cpp/article/details/130854495

版权

Python习题小测专栏收录该内容

255 篇文章 198 订阅

订阅专栏

【问题描述】苹果公司和Pixar动画工作室的CEO Steve Jobs在斯坦福大学的毕业典礼演讲稿保存在文件“Steve Jobs.txt”，针对该文件，编写程序统计文件中出现频率最高的前20个单词，并将结果保存在同目录下的"Steve Jobs.csv"文件中。

注意：文本中有大小写，还有标点符号，和统计价值不高的停用词，需要去掉文中的标点“,.;?-:'|”，并且删除无意义的停用词{‘the’,‘and’,‘to’,‘of’,‘a’,‘be’,‘it’,‘is’,‘not’,‘but’,‘with’,‘t’}

Steve_Jobs.txt.txt

【输入形式】读取.txt文件中的数据

【输出形式】出现的前20个高频单词保存到csv文件中

【样例输入】

txt文件输入
【样例输出】

csv文件输出

Steve Jobs.png

【样例说明】
【评分标准】

import csv
z=open("Steve_Jobs.txt","r")
x=z.read().lower()
s=''
for m in x:
    if 96<ord(m)<123:
        s+=m
    else:
        s+=' '
s=s.replace('//',' ')
s=list(s.split(' '))
n={}
for i in s:
    if i not in ('the','and','to','of','a','be','it','is','not','but','with','t',''):
        if i in n:
            n[i]+=1
        else:
            n[i]=1
l=sorted(n.items(),key=lambda x:x[1],reverse=True)
l=l[:20]
z.close()
with open('Steve_Jobs.csv','w',encoding='gbk',newline='')as f1:
    w=csv.writer(f1)
    w.writerow(['单词','词频'])
    w.writerows(l)

（以上是老师能够被检验正确的解法一）（2024/5/31更新，目前只要将上面的文件名提交的时候与题目要求一致，已经可以正常AC）
（下面是便于理解但是会被系统以薛定谔的问题的解法二）

import csv
def r(arr):
    n = len(arr)
    for i in range(1, n):
        j = i
        while j > 0 and arr[j - 1][0] < arr[j][0]:
            arr[j], arr[j - 1] = arr[j - 1], arr[j]
            j -= 1
    return arr


stopwords = ['the', 'and', 'to', 'of', 'a', 'be', 'it', 'is', 'not', 'but', 'with', 't']
stopcharacters = [',', '.', ';', '?', '-', ':', '\'', '\\', '|']

f = open('Steve_Jobs.txt', 'r', encoding='gbk')
text = f.read()

text = text.lower()
for a in stopcharacters:
    text = text.replace(a, ' ')
words = text.split()
for b in stopwords:
    while b in words:
        words.remove(b)
c = {}
for d in words:
    if d in c:
        c[d] += 1
    else:
        c[d] = 1

d = [(v, k) for k, v in c.items()]
e = r(d)
f = [(v, k) for k, v in e]
top20_words = f[0:20]

with open('Steve_Jobs.csv', 'w', newline='', encoding='gbk') as f:
    writer = csv.writer(f)
    writer.writerow(['单词','词频'])
    for word, count in top20_words:
        writer.writerow([word, count])

（老师给的标准答案）

#分析《乔布斯演讲稿》中前20位的高频词
import csv
def getText(text):
    text = text.lower()                 
    for ch in ",.;?-:\'|":
        text = text.replace(ch, " ")   
    return text

#编写函数统计单词出现频率
# text为待统计文本，topn表示取频率最高的单词个数
def wordFreq(text,topn): 
    words  = text.split()    # 将文本分词
    counts = {}
    for word in words:
        counts[word] = counts.get(word,0) + 1
    excludes = {'the','and','to','of','a','be','it','is','not','but','with','t'}
    for word in excludes:
        del(counts[word])    
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True)
    return items[:topn]

#编写主程序，调用函数
try:
    with open("Steve_Jobs.txt",'r') as file:
        text = file.read()
        text = getText(text)
        freqs = wordFreq(text,20)
except IOError:
    print("文件不存在,请确认!\n")
else:
    try:
        with open("Steve_Jobs.csv",'w',encoding="gbk",newline='')as fileFreq:
                writer=csv.writer(fileFreq)
                writer.writerow(['单词',"词频"])
                writer.writerows(freqs)
    except IOError:
        print("写入文件出错")