python之爬取b站弹幕xml响应时的转码问题

最新推荐文章于 2024-04-26 01:45:18 发布

hhjiamei

最新推荐文章于 2024-04-26 01:45:18 发布

阅读量430

点赞数

分类专栏： python 文章标签： python xpath xml

本文链接：https://blog.csdn.net/hhjiamei/article/details/105086761

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在学习过程中，可以发现，对于xml类型的响应，了解到的方式lxml和bs解析器。

from bs4 import BeautifulSoup #主要使用BeautifulSoup类

事实上可以认为：HTML文档和标签树，BeautifulSoup类是等价的

Beautiful Soup库解析器：

bs4的HTML解析器：BeautifulSoup(mk,'html.parser')——条件：安装bs4库

lxml的HTML解析器：BeautifulSoup(mk,'lxml')——pip install lxml

lxml的XML解析器：BeautifulSoup(mk,'xml')——pip install lxml

html5lib的解析器：BeautifulSoup(mk,'html5lib')——pip install html5lib
————————————————
版权声明：本文为CSDN博主「禾如月」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/xiu_star/article/details/70157924

我尝试了两种方式。

第一种（lxml）是使用response.content（response.text 返回的是一个 unicode 型的文本数据
response.content 返回的是 bytes 型的二进制数据）,返回响应的文本数据。然后用etree.HTML (etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。etree.tostring()：输出修正后的结果，类型是bytes)

部分代码：

这里的tree打印出来是一个list https://segmentfault.com/a/1190000012645691 （参照链接）

def danmure(self, response):
    title = response.meta["title"]
    xmlurl = response.url
    response = requests.get(xmlurl)
    tree = etree.HTML(response.content)

    danmu = tree.xpath("//d/text()")
    print(danmu)

第二种方式（bs），

def danmure(self, response): title = response.meta["title"] xmlurl = response.url
    request = requests.get(xmlurl)  # 获取页面
    request.encoding = 'utf8'  # 中文需要进行转码，否则出来的都是unicode
    soup = BeautifulSoup(request.text, 'lxml')#这个和前面开头说明的用法一样
    results = soup.find_all('d')  # 找出所有'd'标签
    comments = [comment.text for comment in results]
    print(comments)