最近比较喜欢听《认真的老去》这首歌,那就抓这个豆瓣评论做数据集吧,,做个评论的聚类分析。
一、抓到数据
抓出来140条评论~~
放代码~~
import requests
from bs4 import BeautifulSoup
start_page = 1
end_page = 7
data = []
while start_page <= end_page:
html = BeautifulSoup(requests.get(url='https://music.douban.com/subject/26979930/comments/hot?p='.format(start_page)).text)
data += [content.text for content in html.find_all('span',{'class':'short'})]
start_page +=1
下面开始我们的聚类分析
-
文本通过jieba分词,
-
传递给CountVectorizer,统计出词频。
-
再传递给TfidfTransformer,统计出tf-idf值,
<