python获取B站用户评论
爬取思路
总体思路
利用request发送请求,获取到返回的json数据,并从json中提取我们想要的数据
使用到的工具:
- 使用到的浏览器: Google Chrome 版本 103.0.5060.114
- 使用到的库
import requests #用于发送请求
import xlwt #保存数据为.xls格式 可用Excel打开
import re #正则匹配用
import time
开始
前提:利用requests模块爬取数据一共有三大步:
①寻找目标数据所在的url
②找出该url的请求参数来源
③发送该请求并获取我们想要的数据
1. 找到评论所在的包
首先再浏览器地址栏中输入要爬取的目标网址,按下F12打开开发者调试工具,之后再按下enter键
用F12中的审查元素功能可以看出评论内容是在html中的,但是每次需要手动刷新才能获取到评论区的内容,还有就是不能获取到用户的别的信息,例如性别,B站等级等等…
因为每次滚动滑轮都会刷新一次评论区,所以当我们把鼠标滑倒最底下,并点击开发者工具中的Nekwork就可以看到一个包
具体步骤(一共有5步):
①点击Network
②选中工具栏中的Fetch/XHR
③点击Clear按钮清空抓到的包
④滚动B站的鼠标滑轮,让其刷新评论区
⑤此时就可以看到我们想要抓取的包了
还有一种方法是利用crtl+F查找评论内容也可以找到评论所在的包,这里就不在演示
2.寻找评论的url
点击该包
在General中就能找到Request URl
该评论的url就在问号的前面(问号后面就是参数)
https://api.bilibili.com/x/v2/reply/main
3.寻找请求参数
点击Payload
可以看到参数共有五个
mode: 3
next: 2
oid: 215358709
plat: 1
type: 1
经过分析 mode plat type 这三个的值不会变
next 就是此次请求的编号,该编号与评论数量有关
oid:就是我们发送请求的突破点,通过分析,可以知道它与视频编号有关,我们可以通过回推js代码找到它的原始参数,
这里用一种别的办法,我们右键选择查看网页源代码,利用ctrl+F搜索可以发现oid的值就在页面源代码中,这说明我们只要请求该网址,就能从页面源代码中取出oid的值
取得oid参数代码如下:
def findoid(resource_url):
url = resource_url #视频所在的网址
res = requests.get(url=url)
html = res.text
#通过正则表达式取出oid的值
oid = re.findall('window.__INITIAL_STATE__={"aid":(.*?),', html)
return oid
4.发送请求
首先得确定response的返回值是什么类型的
如下图,可以看到返回值类型是json的
发送请求测试是否能得到我们想要的内容
def get_Comments():
#评论的链接,删除了后面的参数
url1 = "https://api.bilibili.com/x/v2/reply/main"
#参数
data = {
"mode": "3",
"next": "1",
"oid": oid,
"plat": "1",
"type": "1"
}
#模拟请求头
# headers = {
# "User-Agent": "xxxx"
# }
res = requests.get(url=url1, params=data)
obj = res.json()
print(obj)
结果如下:
我们想要的内容确实在这个包中
5.分析数据包
1.寻找评论内容所在的字段
该包的响应数据的格式是json格式的,并且在preview中可以找到我们想要的评论内容,通过分析,主要的内容都在data字段中
在此例中,我们想要抓取到每个在视频底下的评论内容,评论的用户信息,以及该评论的点赞人数。
据析,data中的
replies就是此次请求所抓取到的评论内容以及其他信息
top_replies就是置顶评论
menber是该条评论的用户信息
{code: 0, message: "0", ttl: 1, data: {,…}}
code: 0
data: {,…} #我们想要的数据都在data这个字典中
assist: 0
blacklist: 0
callbacks: null
cm: {}
cm_info: {ads: null}
config: {showtopic: 1, show_up_flag: true, read_only: false}
control: {input_disable: false, root_input_text: "发一条友善的评论", child_input_text: "",…}
cursor: {all_count: 3607, is_begin: false, prev: 2, next: 3, is_end: false, mode: 3, show_type: 1,…}
effects: {preloading: ""}
folder: {has_folded: false, is_folded: false, rule: "https://www.bilibili.com/blackboard/foldingreply.html"}
hots: null
note: 1
notice: null
# replies就是评论内容所在位置
replies: [,…]
top: {admin: null,…}
top_replies: [{rpid: 118311801488, oid: 215358709, type: 1, mid: 253992474, root: 0, parent: 0, dialog: 0,…}]
up_selection: {pending_count: 0, ignore_count: 0}
upper: {mid: 253992474}
vote: 0
message: "0"
ttl: 1
2.分析replies中的数据和menber中的数据
1.1 分析replies字段
replies中的数据包含了此次请求的所有评论内容,一般有20条,这里是抓取到了末尾,因此只有18条
0: {rpid: 118554421952, oid: 215358709, type: 1, mid: 102616106, root: 0, parent: 0, dialog: 0, count: 34,…}
1: {rpid: 118323161120, oid: 215358709, type: 1, mid: 65629049, root: 0, parent: 0, dialog: 0, count: 11,…}
2: {rpid: 118355970304, oid: 215358709, type: 1, mid: 20274471, root: 0, parent: 0, dialog: 0, count: 30,…}
3: {rpid: 119212346064, oid: 215358709, type: 1, mid: 30263910, root: 0, parent: 0, dialog: 0, count: 2,…}
4: {rpid: 118629424368, oid: 215358709, type: 1, mid: 7633899, root: 0, parent: 0, dialog: 0, count: 0,…}
5: {rpid: 118909473232, oid: 215358709, type: 1, mid: 363840296, root: 0, parent: 0, dialog: 0, count: 0,…}
6: {rpid: 118841838528, oid: 215358709, type: 1, mid: 1394590, root: 0, parent: 0, dialog: 0, count: 1,…}
7: {rpid: 118452707808, oid: 215358709, type: 1, mid: 15325878, root: 0, parent: 0, dialog: 0, count: 2,…}
8: {rpid: 118398405232, oid: 215358709, type: 1, mid: 287822219, root: 0, parent: 0, dialog: 0, count: 12,…}
9: {rpid: 118564094016, oid: 215358709, type: 1, mid: 1696242541, root: 0, parent: 0, dialog: 0, count: 4,…}
10: {rpid: 118524055216, oid: 215358709, type: 1, mid: 12362257, root: 0, parent: 0, dialog: 0, count: 5,…}
11: {rpid: 118625420992, oid: 215358709, type: 1, mid: 2066390, root: 0, parent: 0, dialog: 0, count: 0,…}
12: {rpid: 118855029920, oid: 215358709, type: 1, mid: 452097799, root: 0, parent: 0, dialog: 0, count: 2,…}
13: {rpid: 118444697568, oid: 215358709, type: 1, mid: 354134911, root: 0, parent: 0, dialog: 0, count: 1,…}
14: {rpid: 118768023904, oid: 215358709, type: 1, mid: 1976004, root: 0, parent: 0, dialog: 0, count: 1,…}
15: {rpid: 118411006704, oid: 215358709, type: 1, mid: 1218247382, root: 0, parent: 0, dialog: 0, count: 5,…}
16: {rpid: 118496421232, oid: 215358709, type: 1, mid: 337283560, root: 0, parent: 0, dialog: 0, count: 1,…}
17: {rpid: 118680897712, oid: 215358709, type: 1, mid: 17846755, root: 0, parent: 0, dialog: 0, count: 2,…}
18: {rpid: 118747512480, oid: 215358709, type: 1, mid: 384167476, root: 0, parent: 0, dialog: 0, count: 0,…}
1.2 分析replies中的每一项
通过对比可以发现
message中就是我们想要的评论内容
like就是这条评论点赞的人数
此处replies是该条评论的子评论
0: {
rpid: 118554421952, oid: 215358709, type: 1, mid: 102616106, root: 0, parent: 0, dialog: 0, count: 34,…}
action: 0
assist: 0
attr: 0
content: {,…}
device: ""
jump_url: {}
max_line: 6
members: []
# message 就是此项的评论内容
message: "小缇娜的奇幻之地(159)\n无主之地3(50)\n怪物猎人世界/崛起(101)(152)\n恐怖黎明/泰坦之旅(16)(19.7)\n深岩银河(29.7)\n影子武士2(16.8)\n战神/鬼泣5(223)(65)\n破晓传说(164)\n女神异闻录5(260)\n极2/如龙7(65)(175)\n死亡之门(44)\ngta5(59.1)\n荒野大镖客2(125)\n巫师3/上古卷轴5(25)(42)\n奥德赛/枭雄/起源(74)(50)(60)\n2077(149)\n地平线5(198)\n幸福工厂(59.4)\n消逝的光芒1/2(43.6)(199)\n木筏求生(57.8)\n英灵神殿(17)(56)\n森林/绿色地狱(159)\n七日杀(23)\n饥荒(6)(12)\n小小梦魇(74)\n恐鬼症(37)\nCat Museum(21)\n开拓者:正义之怒(94)\n神界原罪2/极乐迪斯科(53)(40)\n霓虹深渊/挺进地牢(29)(19.2)\n哈迪斯(40)\n地痞街区(17.5)\n暖雪(46)\n循环英雄(23.1)\n雨中冒险2/杀戮尖塔(40)(27)\n光环士官长合集(46.4)\n泰坦陨落2(25)\n狙击精英4(15.8)\n幽灵线:东京(97)\n杀手3(110)\n死亡细胞/空洞骑士(40)(24)\n盐与避难所(15)\n过山车之星(30)\n动物园之星(54)\n侏罗纪世界:进化2(81)\n海岛大亨6(79.6)\n双点医院(39)\n足球经理2022(124)\n底特律:变人/夏日口袋(64)(63)\n主播女孩重度依赖(49)\n史莱姆农场(17)\n全面战争模拟器(24)\n双人成行(79)\n人类一败涂地(17)\n模拟大鹅(35)"
plat: 0
count: 34
ctime: 1656389162
dialog: 0
fansgrade: 0
folder: {has_folded: false, is_folded: false, rule: ""}
invisible: false
like: 1681
member: {mid: "102616106", uname: "静音い伶", sex: "保密", sign: "初音殿下赛高!",…}
mid: 102616106
oid: 215358709
parent: 0
parent_str: "0"
rcount: 32
replies: [,…]
reply_control: {sub_reply_entry_text: "共32条回复", sub_reply_title_text: "相关回复共32条", time_desc: "7天前发布"}
root: 0
root_str: "0"
rpid: 118554421952
rpid_str: "118554421952"
show_follow: false
state: 0
type: 1
up_action: {like: false, reply: false}
}
2.1 分析menber字段
其中,
uname 是用户名
sex 是用户的性别
level_info 字段中的current_level是用户在B站的等级
member: {
mid: "102616106", uname: "静音い伶", sex: "保密", sign: "初音殿下赛高!",…}
DisplayRank: "0"
avatar: "http://i2.hdslb.com/bfs/face/e20a0fc211983fd3cd2653c06d76b9742a9429ec.jpg"
contract_desc: ""
face_nft_new: 0
fans_detail: null
following: 0
is_contractor: false
is_followed: 0
is_senior_member: 1
level_info: {current_level: 6, current_min: 0, current_exp: 0, next_exp: 0}
mid: "102616106"
nameplate: {nid: 74, name: "大会员2018年度勋章",…}
nft_interaction: null
official_verify: {type: -1, desc: ""}
pendant: {pid: 0, name: "", image: "", expire: 0, image_enhance: "", image_enhance_frame: ""}
rank: "10000"
sex: "保密"
sign: "初音殿下赛高!"
uname: "静音い伶"
}
6.获取数据
分析完数据后,因为是json格式的,我们直接利用字典取值的方法取出我们想要的值,再利用个循环就能取出每次请求所有用户评论的内容以及用户的信息
代码如下
def getComment():
url1 = "https://api.bilibili.com/x/v2/reply/main"
data = {
"mode": "3",
"next": "1",
"oid": oid,
"plat": "1",
"type": "1"
}
res = requests.get(url=url1, params=data)
obj = res.json()
data = obj['data']
replies = data['replies']
# 如果评论为空就直接退出
if replies == None:
return
row = 1
for rep in replies:
# 获取评论
content = rep['content']
like = rep['like']
msg = content['message']
# 获取用户信息
member = rep['member']
sex = member['sex']
username = member['uname']
level_info = member['level_info']
level = level_info['current_level'] # type int
reply_control = rep['reply_control']
time_desc = reply_control['time_desc']
lis = [username, sex, level, like, msg]
print("username-----" + username)
print("sex----------" + sex)
print("level--------" + str(level))
print("like---------" + str(like))
print("msg----------" + msg)
结果如下:
7.保存数据
使用xlwt库进行保存
该库保存的数据可以用Excel打开,更加方便查看
代码如下:
#新建一个工作簿
book = xlwt.Workbook()
#新建一张表
#add_sheet 中第一个参数是表的名称,第二个参数cell_overwrite_ok=True表示可以修改该表
sheet = book.add_sheet('评论', cell_overwrite_ok=True)
row0 = ['username', 'sex', 'level', 'like', 'comment']
for i in range(0, len(row0)):
sheet.write(0, i, row0[i]) #表示写入到表中的第0行第i列,此处写入的是表头
#获取oid
def findoid(resource_url):
url = resource_url
res = requests.get(url=url)
html = res.text
oid = re.findall('window.__INITIAL_STATE__={"aid":(.*?),', html)
return oid
#存储数据
#获取置顶评论
def getTopReplies():
url1 = "https://api.bilibili.com/x/v2/reply/main"
data = {
"mode": "3",
"next": "1",
"oid": oid,
"plat": "1",
"type": "1"
}
res = requests.get(url=url1, params=data)
obj = res.json()
data = obj['data']
top_replies = data['top_replies']
cursor = data['cursor']
allcommens = cursor['all_count'] #获取所有的评论的数量
row = 1
for rep in top_replies:
content = rep['content']
like = rep['like']
msg = content['message']
# 获取用户信息
member = rep['member']
sex = member['sex']
username = member['uname']
level_info = member['level_info']
level = level_info['current_level'] # type int
reply_control = rep['reply_control']
time_desc = reply_control['time_desc']
lis = [username, sex, level, like, msg]
print("username-----" + username)
print("sex----------" + sex)
print("level--------" + str(level))
print("like---------" + str(like))
print("msg----------" + msg)
#在此处存储数据
for j in range(0, 5):
sheet.write(row, j, lis[j])
row += 1
if __name__ == '__main__':
# 输入目标视频的地址就能开始爬取评论
resource_url = input("请输入url地址:")
oid = findoid(resource_url)
getTopReplies()
#保存xls表
book.save(filename_or_stream="new.xls")
8.全部的代码
import requests
import xlwt
import re
import time
book = xlwt.Workbook()
sheet = book.add_sheet('评论', cell_overwrite_ok=True)
row0 = ['username', 'sex', 'level', 'like', 'comment']
for i in range(0, len(row0)):
sheet.write(0, i, row0[i])
oid = 0
def findoid(resource_url):
url = resource_url
res = requests.get(url=url)
html = res.text
oid = re.findall('window.__INITIAL_STATE__={"aid":(.*?),', html)
return oid
def getTopReplies():
url1 = "https://api.bilibili.com/x/v2/reply/main"
data = {
"mode": "3",
"next": "1",
"oid": oid,
"plat": "1",
"type": "1"
}
res = requests.get(url=url1, params=data)
obj = res.json()
data = obj['data']
top_replies = data['top_replies']
cursor = data['cursor']
allcommens = cursor['all_count'] #获取所有的评论的数量
row = 1
for rep in top_replies:
content = rep['content']
like = rep['like']
msg = content['message']
# 获取用户信息
member = rep['member']
sex = member['sex']
username = member['uname']
level_info = member['level_info']
level = level_info['current_level'] # type int
reply_control = rep['reply_control']
time_desc = reply_control['time_desc']
lis = [username, sex, level, like, msg]
print("username-----" + username)
print("sex----------" + sex)
print("level--------" + str(level))
print("like---------" + str(like))
print("msg----------" + msg)
for j in range(0, 5):
sheet.write(row, j, lis[j])
row += 1
lis = [len(top_replies),allcommens]
return lis
# index 是第几个链接
# length 是置顶评论的个数
def getComment(index, length):
url1 = "https://api.bilibili.com/x/v2/reply/main"
data = {
"mode": "3",
"next": f"{index}",
"oid": oid,
"plat": "1",
"type": "1"
}
res = requests.get(url=url1, params=data)
obj = res.json()
data = obj['data']
replies = data['replies']
# 无评论即返回
if replies == None:
return
row = 1
for rep in replies:
# 获取评论
content = rep['content']
like = rep['like']
msg = content['message']
# 获取用户信息
member = rep['member']
sex = member['sex']
username = member['uname']
level_info = member['level_info']
level = level_info['current_level'] # type int
reply_control = rep['reply_control']
time_desc = reply_control['time_desc']
lis = [username, sex, level, like, msg]
print("username-----" + username)
print("sex----------" + sex)
print("level--------" + str(level))
print("like---------" + str(like))
print("msg----------" + msg)
for j in range(0, 5):
sheet.write((index - 1) * 20 + row + length, j, lis[j])
row += 1
def get_Comments():
url1 = "https://api.bilibili.com/x/v2/reply/main"
data = {
"mode": "3",
"next": "1",
"oid": oid,
"plat": "1",
"type": "1"
}
res = requests.get(url=url1, params=data)
obj = res.json()
print(obj)
if __name__ == '__main__':
# 输入目标视频的地址就能开始爬取评论
resource_url = input("请输入url地址:")
oid = findoid(resource_url)
lis = getTopReplies()
length = lis[0]
#共有多少页
page_num = lis[1]/20
page_num = int(page_num)
print("length=" + str(length))
# 自定义爬取范围
for i in range(1, page_num):
getComment(i, length)
time.sleep(1) #设置个延时爬取,不然可能会给封ip
print("====================================================")
book.save(filename_or_stream="new.xls")
效果图如下: