【＜Python爬虫学习1＞爬取某站弹幕并制作词云】

最新推荐文章于 2022-03-28 17:28:55 发布

苦涩柠檬香

最新推荐文章于 2022-03-28 17:28:55 发布

阅读量186

点赞数

分类专栏： python 爬虫人工智能文章标签： python 爬虫数据挖掘

本文链接：https://blog.csdn.net/qq_45633093/article/details/122281102

版权

python 同时被 3 个专栏收录

1 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

人工智能

1 篇文章 0 订阅

订阅专栏

最近在学习爬虫，在学习了正则表达式之后可以开始爬取一些简单的网页内容，但是我还想做一些更有趣的事情，于是决定尝试爬取某站弹幕，在看过网上其他大神的代码并结合它们的风格与自己的理解后，我的代码如下：

import re
import requests

def getHtml(url):
    headers = {
    # 加header反反爬
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'
}   
	html = requests.get(url,headers = headers)
    html.encoding = 'utf-8'
    html.close()
    return html.text

def getcid(html):  #直接传入html获取cid
    reg = r'{"cid":(.*?),"page"' #正则表达式，获取cid
    reg = re.compile(reg,re.S)
    cidlist = reg.findall(html)
    print('共%d条视频'%len(cidlist))
    return cidlist
    
def getComments(cidlist):
    comments_list = []
    n = 0  #count videos
    length = 0 # count comments
    for cid in cidlist:
        comments_url = "https://comment.bilibili.com/" + str(cid) + ".xml"
        #print(comments_url)
        comments_html = requests.get(comments_url)
        comments_html.encoding = 'utf-8'
        comments_html = comments_html.text
        reg = r'">(.*?)</d>'
        reg = re.compile(reg)
        comments = reg.findall(comments_html)
        comments_list.append(comments)
        n += 1
        length += len(comments)
        print('>>>>已获取%d条视频弹幕<<<<'%n)
    print('弹幕读取完毕，共%d条弹幕'%length)
    return comments_list
        
    

url = 'https://www.bilibili.com/video/BV1Bq4y127iz?p=2'  #包含了所有视频的cid
html = getHtml(url)
cidlist = getcid(html)
comments = getComments(cidlist)

with open("comments.txt", 'w') as f:   #保存为txt文件，为制作词云做准备
    for i in comments:
        f.write(str(i)+'\n')

那么我们只需要更改url就可以爬取整个页面所有eps的弹幕啦，这里以最近很火的一档恋爱综艺为例子为例子：
在这里插入图片描述
大家可以进入网页的源文件看一下，只要像这种源文件里包含各个eps的cid就可以。
运行代码：

可以看到爬取了8个视频，接近2.4万条弹幕，我们看一下：

发现的确是弹幕没错了，下面开始制作词云：
需要先安装jieba和wordcloud两个库，我使用的anaconda，但是不能直接conda install这两个库，我在网上找到了一个简单的命令，不用改镜像源就可以安装这两个库。

conda install -c conda-forge jieba
conda install -c conda-forge wordcloud

安装好后：

import jieba
from wordcloud import WordCloud,ImageColorGenerator
from matplotlib import pyplot as plt
from PIL import Image
import numpy as np
with open('comments.txt','r',encoding="UTF-8") as file1:
    content = "".join(file1.readlines())
content_after = "".join(jieba.cut(content,cut_all=True))

images = Image.open("7.png")  # 该图片作为蒙版，即词云的形状，图片颜色的对比度最好强一些
maskImages = np.array(images)

# 需要提前准备一个ttf文件，可以网上下载或者去电脑里找
wc = WordCloud(font_path="SIMYOU.TTF",background_color="white",max_words=100,max_font_size=500,width=1000,height=1000,mask=maskImages).generate(content)
plt.imshow(wc)
 
wc.to_file('wolfcodeTarget2.png') #导出