爬虫大作业

最新推荐文章于 2024-07-26 10:45:15 发布

weixin_34143774

最新推荐文章于 2024-07-26 10:45:15 发布

阅读量221

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/mimimi/p/8932989.html

版权

1.选一个自己感兴趣的主题或网站。(所有同学不能雷同)

2.用python 编写爬虫程序，从网络上爬取相关主题的数据。

3.对爬了的数据进行文本分析，生成词云。

4.对文本分析结果进行解释说明。

5.写一篇完整的博客，描述上述实现过程、遇到的问题及解决办法、数据分析思想及结论。

6.最后提交爬取的全部数据、爬虫及数据分析源代码。


import requests
from bs4 import BeautifulSoup
import json
import jieba.analyse
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator


url = "https://item.btime.com/36i90hfhkt3838be1gof3cla1ka?from=haozcxw"
res = requests.get(url)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text,'html.parser')


title = soup.select('.title')[0].text
content = soup.select('.content-text')[0].text
info = soup.select('.edit-info')[0].text
au=info[info.find('责任编辑：'):].split()[0].lstrip('责任编辑：')
print(title,content,au)

f = open('content.txt', 'a', encoding='utf-8')
f.write(content)
f.close()

strl = '''，。、‘’ '''
for i in strl:
    ls = content.replace(i," ")
    print(ls)


lyric= ''
f=open('content.txt','r', encoding='utf-8')
for i in f:
    lyric+=f.read()


result=jieba.analyse.textrank(lyric,topK=50,withWeight=True)
keywords = dict()
for i in result:
    keywords[i[0]]=i[1]
print(keywords)

image= Image.open('t01c9f26bac34842d0d.jpg')
graph = np.array(image)
wc = WordCloud(font_path='./fonts/simhei.ttf',background_color='White',max_words=50,mask=graph)
wc.generate_from_frequencies(keywords)
image_color = ImageColorGenerator(graph)
plt.imshow(wc)
plt.imshow(wc.recolor(color_func=image_color))
plt.axis("off")
plt.show()
wc.to_file('d.jpg')

转载于:https://www.cnblogs.com/mimimi/p/8932989.html

weixin_34143774

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
爬虫大作业

1.选一个自己感兴趣的主题或网站。(所有同学不能雷同)2.用python 编写爬虫程序，从网络上爬取相关主题的数据。3.对爬了的数据进行文本分析，生成词云。4.对文本分析结果进行解释说明。5.写一篇完整的博客，描述上述实现过程、遇到的问题及解决办法、数据分析思想及结论。6.最后提交爬取的全部数据、爬虫及数据分析源代码。import request...
复制链接

扫一扫

爬虫 大作业

爬虫大作业