第十九篇，爬取bilibili弹幕使用lxml解析遇到ValueError: Unicode strings with encoding declaration are not supported

最新推荐文章于 2024-09-09 15:35:56 发布

萌新求大佬

最新推荐文章于 2024-09-09 15:35:56 发布

阅读量734

点赞数 2

分类专栏： python爬虫文章标签： python 爬虫 bilibili

本文链接：https://blog.csdn.net/weixin_43779803/article/details/103083343

版权

python爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

这篇博客是我看了别人的一篇博客有感而发写的：python爬虫：bilibili弹幕爬取+词云生成想着既然他用beautifulsoup解析的那我lmxl肯定不能落后。
这里是我爬取bilibili视频弹幕遇到的一个问题如下：

    html = etree.HTML(text)
  File "src\lxml\etree.pyx", line 3170, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

这里报错了，在我们使用lxml解析的时候遇到的。
先看看我们之前的代码：

import requests
from lxml import etree
url = 'https://comment.bilibili.com/128589248.xml'
response = requests.get(url)
print(response.content.decode('utf-8'))  #转码

上面那个url哪来的呢，这里提一下我们打开哔哩哔哩然后随便点击一个视频，点击播放的时候查看F12找到：
在这里插入图片描述网路下面的XHR里面播放了会有很多heartbeat，随便选一个点到里面的参数，参数下面有个cid把它的值复制下来放入下面网址：

https://comment.bilibili.com/cid(128589248).xml

之后打开这个网址：
在这里插入图片描述这就是这个视频的所有评论，然后我们就获取响应们就是上面的那几行代码，获取到了之后，按照以往的习惯导入lxml库下面的etree模块：from lxml import etree，别忘了requests也要导入，这里就不多提。然后接着用一个变量html来接收解析的网页：

text = response.content.decode('utf-8')
html = etree.HTML(text)

这个时候运行的时候就发现报错了：ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration…
这个时候我们就当然百度了一下：
lxml简明教程
发现里面的第一个就让我耳目一新：

>>> xml_string = '<root><foo id="foo-id" class="foo zoo">Foo</foo><bar>中文</bar><baz></baz></root>'
>>> root = etree.fromstring(xml_string.encode('utf-8')) # 最好传 byte string
>>> etree.tostring(root)
# 默认返回的是 byte string
b'<root>root content<foo id="foo-id" class="foo zoo">Foo</foo><bar>Bar</bar><baz/></root>'

这种方法是把我们传入的html转换为utf-8的格式再返回一个byte string类型的数据之后输出，那咱就试试看;

html = etree.fromstring(text.encode('utf-8'))
danmu = html.xpath('//d/text()')
print(danmu)

在这里插入图片描述果然成功了，再看看我们之前的获取到的响应的html。

萌新求大佬

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录