PyQuery解析
阅读指导:
这篇文章详细讲解了PyQuery解析的方法。每个方法都配有相应的代码示例,读者可以通过本文更深入地了解PyQuery。由于文章篇幅较长,建议通过目录选择您感兴趣的部分进行学习。
pyquery
1. PyQuery初始化
1.1 字符串初始化
这种方式是直接将HTML的内容作为参数,来初始化PyQuery对象,再在实例化对象上面进行对元素的获取即可。
例:
<!--假设这是某个网页页面的页面代码,页面链接为 URL -->
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
获取代码:
from pyquery import PyQuery as pq
import requests
url = "URL"
html= requests.get(url
doc = pq(html)
print(doc('li'))
输出:
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
1.2 URL初始化
直接将网页的url传给pq作为参数,pq会向网页发送请求,获取到页面源代码,然后进行初始化。基本步骤和思想同字符串初始化一样,这里就不在多做解释
1.3 文件初始化
文件初始化,即是把文件路径传给pq作为参数,然后再进行初始化,其后面的过程同字符串初始化相同。
2 使用css选择器
通过css属性名来获取该属性下对应的标签的相关内容
例:
<!--html-->
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))
print()
for item in doc('#container .list li').items():
print(item.text())
输出:
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<class 'pyquery.pyquery.PyQuery'>
first item
second item
third item
fourth item
fifth item
代码解释:
通过doc(‘#container .list li’)这样的方式,先找寻到 id 为 container 的元素,在找他的下一级class名为 list 的 li 元素,然后打印出来,同时可以看到,使用 text() 方式可以输出 li 元素包含的文字部分(只输出文字,不输出转义字符)
3 查找节点
3.1 子节点
对于子节点的查找,在PyQuery里面有两种方式
- find() 该方法适用于查询一个元素的所有子孙节点
- children() 该方法适用于查找一个元素的直接子节点
解析HTML对象:
<!--html-->
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
- find()例:
获取 ul 和 li 标签的内容
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('ul')
print(items)
lis = items.find('li')
print(lis)
输出:
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
- child()方法例:
获取class属性中含active的节点
获取代码:
lis = items.children('.active')
print(lis)
输出:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
3.2 父节点
获取父节点也有两种方式:
- parent() 适用于查询该节点的直接父亲节点
- parents() 适用于查询该节点的所有祖先节点
解析HTML对象:
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
- parents()例:
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
container = items.parent()
print(container)
输出:
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span> </a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
- parents()例:
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)
输出:
<class 'pyquery.pyquery.PyQuery'>
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
3.3 遍历节点
对于一个节点里面,要是这个节点的子节点有很多的话,可以采用遍历节点的方法来查找节点,比如一个 ul 标签里面含有多个 li 标签。
例:
还是对3.2的html代码进行获取
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
print(li)
输出:
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
4 获取数据
4.1 获取属性值
使用 attr() 方法获取元素属性的属性值
例:
解析HTML对象:
<!--html-->
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.attr('href'))
输出:
<a href="link3.html"><span class="bold">third item</span></a>
link3.html
要是想获取多个节点的属性值可以用循环遍历的方式打印,因为 attr() 方法只会得到第一个节点的属性
例:
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('a')
for item in a.items():
print(item.attr('href'))
输出:
link2.html
link3.html
link4.html
link5.html
4.2 获取文本
PyQuery提供了两种获取不同文本的方法
- text() 方法来实现获取一个节点的纯文本部分
- html() 方法来实现获取一个节点的内部 HTML 文本部分
解析HTML对象:
<!--html-->
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
- text() 例:
获取目标 a 标签的纯文本内容
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.text())
输出:
<a href="link3.html"><span class="bold">third item</span></a>
third item
- html() 例:
获取目标 a 标签的内部HTML文本内容
获取代码:
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())
输出:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<a href="link3.html"><span class="bold">third item</span></a>