【问题描述】苹果公司和Pixar动画工作室的CEO Steve Jobs在斯坦福大学的毕业典礼演讲稿保存在文件“Steve Jobs.txt”,针对该文件,编写程序统计文件中出现频率最高的前20个单词,并将结果保存在同目录下的"Steve Jobs.csv"文件中。
注意:文本中有大小写,还有标点符号,和统计价值不高的停用词,需要去掉文中的标点“,.;?-:'|”,并且删除无意义的停用词{‘the’,‘and’,‘to’,‘of’,‘a’,‘be’,‘it’,‘is’,‘not’,‘but’,‘with’,‘t’}
Steve_Jobs.txt.txt
【输入形式】读取.txt文件中的数据
【输出形式】出现的前20个高频单词保存到csv文件中
【样例输入】
txt文件输入
【样例输出】
csv文件输出
Steve Jobs.png
【样例说明】
【评分标准】
import csv
z=open("Steve_Jobs.txt","r")
x=z.read().lower()
s=''
for m in x:
if 96<ord(m)<123:
s+=m
else:
s+=' '
s=s.replace('//',' ')
s=list(s.split(' '))
n={}
for i in s:
if i not in ('the','and','to','of','a','be','it','is','not','but','with','t',''):
if i in n:
n[i]+=1
else:
n[i]=1
l=sorted(n.items(),key=lambda x:x[1],reverse=True)
l=l[:20]
z.close()
with open('Steve_Jobs.csv','w',encoding='gbk',newline='')as f1:
w=csv.writer(f1)
w.writerow(['单词','词频'])
w.writerows(l)
(以上是老师能够被检验正确的解法一)(2024/5/31更新,目前只要将上面的文件名提交的时候与题目要求一致,已经可以正常AC)
(下面是便于理解但是会被系统以薛定谔的问题的解法二)
import csv
def r(arr):
n = len(arr)
for i in range(1, n):
j = i
while j > 0 and arr[j - 1][0] < arr[j][0]:
arr[j], arr[j - 1] = arr[j - 1], arr[j]
j -= 1
return arr
stopwords = ['the', 'and', 'to', 'of', 'a', 'be', 'it', 'is', 'not', 'but', 'with', 't']
stopcharacters = [',', '.', ';', '?', '-', ':', '\'', '\\', '|']
f = open('Steve_Jobs.txt', 'r', encoding='gbk')
text = f.read()
text = text.lower()
for a in stopcharacters:
text = text.replace(a, ' ')
words = text.split()
for b in stopwords:
while b in words:
words.remove(b)
c = {}
for d in words:
if d in c:
c[d] += 1
else:
c[d] = 1
d = [(v, k) for k, v in c.items()]
e = r(d)
f = [(v, k) for k, v in e]
top20_words = f[0:20]
with open('Steve_Jobs.csv', 'w', newline='', encoding='gbk') as f:
writer = csv.writer(f)
writer.writerow(['单词','词频'])
for word, count in top20_words:
writer.writerow([word, count])
(老师给的标准答案)
#分析《乔布斯演讲稿》中前20位的高频词
import csv
def getText(text):
text = text.lower()
for ch in ",.;?-:\'|":
text = text.replace(ch, " ")
return text
#编写函数统计单词出现频率
# text为待统计文本,topn表示取频率最高的单词个数
def wordFreq(text,topn):
words = text.split() # 将文本分词
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
excludes = {'the','and','to','of','a','be','it','is','not','but','with','t'}
for word in excludes:
del(counts[word])
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
return items[:topn]
#编写主程序,调用函数
try:
with open("Steve_Jobs.txt",'r') as file:
text = file.read()
text = getText(text)
freqs = wordFreq(text,20)
except IOError:
print("文件不存在,请确认!\n")
else:
try:
with open("Steve_Jobs.csv",'w',encoding="gbk",newline='')as fileFreq:
writer=csv.writer(fileFreq)
writer.writerow(['单词',"词频"])
writer.writerows(freqs)
except IOError:
print("写入文件出错")