最近在学习爬虫,在学习了正则表达式之后可以开始爬取一些简单的网页内容,但是我还想做一些更有趣的事情,于是决定尝试爬取某站弹幕,在看过网上其他大神的代码并结合它们的风格与自己的理解后,我的代码如下:
import re
import requests
def getHtml(url):
headers = {
# 加header反反爬
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'
}
html = requests.get(url,headers = headers)
html.encoding = 'utf-8'
html.close()
return html.text
def getcid(html): #直接传入html获取cid
reg = r'{"cid":(.*?),"page"' #正则表达式,获取cid
reg = re.compile(reg,re.S)
cidlist = reg.findall(html)
print('共%d条视频'%len(cidlist))
return cidlist
def getComments(cidlist):
comments_list = []
n = 0 #count videos
length = 0 # count comments
for cid in cidlist:
comments_url = "https://comment.bilibili.com/" + str(cid) + ".xml"
#print(comments_url)
comments_html = requests.get(comments_url)
comments_html.encoding = 'utf-8'
comments_html = comments_html.text
reg = r'">(.*?)</d>'
reg = re.compile(reg)
comments = reg.findall(comments_html)
comments_list.append(comments)
n += 1
length += len(comments)
print('>>>>已获取%d条视频弹幕<<<<'%n)
print('弹幕读取完毕,共%d条弹幕'%length)
return comments_list
url = 'https://www.bilibili.com/video/BV1Bq4y127iz?p=2' #包含了所有视频的cid
html = getHtml(url)
cidlist = getcid(html)
comments = getComments(cidlist)
with open("comments.txt", 'w') as f: #保存为txt文件,为制作词云做准备
for i in comments:
f.write(str(i)+'\n')
那么我们只需要更改url就可以爬取整个页面所有eps的弹幕啦,这里以最近很火的一档恋爱综艺为例子为例子:
大家可以进入网页的源文件看一下,只要像这种源文件里包含各个eps的cid就可以。
运行代码:
可以看到爬取了8个视频,接近2.4万条弹幕,我们看一下:
发现的确是弹幕没错了,下面开始制作词云:
需要先安装jieba和wordcloud两个库,我使用的anaconda,但是不能直接conda install这两个库,我在网上找到了一个简单的命令,不用改镜像源就可以安装这两个库。
conda install -c conda-forge jieba
conda install -c conda-forge wordcloud
安装好后:
import jieba
from wordcloud import WordCloud,ImageColorGenerator
from matplotlib import pyplot as plt
from PIL import Image
import numpy as np
with open('comments.txt','r',encoding="UTF-8") as file1:
content = "".join(file1.readlines())
content_after = "".join(jieba.cut(content,cut_all=True))
images = Image.open("7.png") # 该图片作为蒙版,即词云的形状,图片颜色的对比度最好强一些
maskImages = np.array(images)
# 需要提前准备一个ttf文件,可以网上下载或者去电脑里找
wc = WordCloud(font_path="SIMYOU.TTF",background_color="white",max_words=100,max_font_size=500,width=1000,height=1000,mask=maskImages).generate(content)
plt.imshow(wc)
wc.to_file('wolfcodeTarget2.png') #导出
最终效果: