Beautifulsoup，pyquery、xpath解析库比较

最新推荐文章于 2024-08-22 12:41:20 发布

qq_43680223

最新推荐文章于 2024-08-22 12:41:20 发布

阅读量2.7k

点赞数 1

分类专栏：数据分析文章标签：网络爬虫

本文链接：https://blog.csdn.net/qq_43680223/article/details/100133174

版权

本文对比分析了Python中用于网页解析的三个库：BeautifulSoup、PyQuery和XPath。详细介绍了BeautifulSoup的常用方法，如Tag、NavigableString、Comment和BeautifulSoup对象，以及遍历、查找、选择和获取文本的方法。接着概述了PyQuery的初始化和节点定位方式，强调了其与requests库的整合。最后提到了XPath的路径定位特性，包括绝对路径和相对路径。

摘要由CSDN通过智能技术生成

俗话说：好记性不如烂笔头，零零碎碎的知识不加以总结归纳，建立知识体系，就会感觉杂乱无章，获得感极低，因此，再次比较三种解析库的常见使用方法。

主要参考：
BeautifulSoup官方文档 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
pyquery官方文档 https://pythonhosted.org/pyquery/index.html
xpathW3school文档 https://www.w3schools.com/xml/xpath_intro.asp

一、BeautifulSoup库

常见分析一个网页，主要思路是点位节点、获得数据
以下列html文本为例，阐述三种解析方法的异同点：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

beautifulsoup

实例化

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

"lxml"是解析方式，常用的有
来源于官方文档
定位节点：soup.节点
如：soup.title <title>The Dormouse's story</title>
不传入解析器，默认使用最合适的,最好传入！

UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“lxml”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

soup.p <p class="title"