c libxml2解析html,在Python中，lxml和libxml2哪个更适合解析格式错误的html？

最新推荐文章于 2024-02-11 21:22:44 发布

weixin_39755712

最新推荐文章于 2024-02-11 21:22:44 发布

阅读量208

点赞数

文章标签： c libxml2解析html

在libxml2 page中，您可以看到以下注释：Note that some of the Python purist dislike the default set of Python bindings, rather than complaining I suggest they have a look at lxml the more pythonic bindings for libxml2 and libxslt and check the mailing-list.

在lxml页面中，另一个：The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

所以本质上，使用lxml你得到了完全相同的功能，

但是使用一个与标准库中的ElementTree库兼容的pythonicapi(所以这意味着标准库文档将有助于学习如何使用lxml)。这就是为什么，lxml比libxml2更受欢迎(即使底层实现是相同的)。在

编辑：正如其他答案所解释的，要解析格式错误的html，最好的选择是使用^{}。值得注意的一点是，如果您安装了lxml，BeautifulSoup将按照新版本的documentation中的说明使用它：If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

无论如何，即使BeautifulSoup在幕后使用了lxml，您也可以直接解析无法用xml解析的损坏的{}。例如：>>> lxml.etree.fromstring('')

...

XMLSyntaxError: Premature end of data in tag html line 1, line 1, column 7

但是：

^{pr2}$

最后，请注意lxml还提供了一个到{}旧版本的接口，如下所示：>>> lxml.html.soupparser.fromstring('')

所以在一天结束时，您可能会使用lxml和{}。你唯一要选择的就是你最喜欢的API是什么。在

weixin_39755712

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
c libxml2解析html,在Python中，lxml和libxml2哪个更适合解析格式错误的html？

在libxml2 page中，您可以看到以下注释：Note that some of the Python purist dislike the default set of Python bindings, rather than complaining I suggest they have a look at lxml the more pythonic bindings for libxm...
复制链接

扫一扫