BeautifulSoup解析非标准HTML的问题

最新推荐文章于 2023-02-26 11:50:42 发布

weixin_30457065

最新推荐文章于 2023-02-26 11:50:42 发布

阅读量207

点赞数

文章标签： python c#

原文链接：http://www.cnblogs.com/tara/p/3425036.html

版权

发现问题：

BeautifulSoup版本：4.3.2

在用BeautifulSoup.find_all()搜索HTML时，遇到下面的代码：

<a href="/shipin/donghuapian/2012-07-25/23404.html"title="谦谦君子" target="_blank">温润如玉</a>

可以看出代码中a标签的href属性和title属性之间没有空格。

分析问题：

通过BeautifulSoup的诊断工具（4.2版以上才有）diagnose：

from bs4.diagnose import diagnose
html_doc = open('test.html').read()
diagnose(html_doc)

发现那行代码被解析成：

<a href="/shipin/donghuapian/2012-07-25/23404.html"> title="谦谦君子" target="_blank"&gt;温润如玉</a>

看出来了吗？这是个错误的a标签，包含title和target位置出现错误，造成BeautifulSoup.find_all()解析到此行代码时，匹配title就会失败。

问题出现的原因是BeautifulSoup默认使用Python自带的html parser，对错误网页的兼容性不强。

解决办法：

为BeautifulSoup指定一个新的html parser，这里有详情，我选择了lxml：

sudo pip install lxml

创建BeautifulSoup对象时，添加一个参数：

#coding=utf-8
import re
from bs4 import BeautifulSoup

html_doc = open('test.html').read()
soup = BeautifulSoup(html_doc, 'lxml')　　# 选择lxml作为新的html parser。
tags = soup.find_all('a', {'title': re.compile(u'君子')})

就OK了。

转载于:https://www.cnblogs.com/tara/p/3425036.html

weixin_30457065

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup解析非标准HTML的问题

发现问题：BeautifulSoup版本：4.3.2在用BeautifulSoup.find_all()搜索HTML时，遇到下面的代码：<a href="/shipin/donghuapian/2012-07-25/23404.html"title="谦谦君子" target="_blank">温润如玉</a> 可以看出代码中a标签的href属性和...
复制链接

扫一扫