强大又灵活的网页解析库,如果觉得正则表达式写起来太麻烦,而BeautifulSoup语法太难记,但是熟悉jQuery的语法,那么PyQuery就是一个绝佳选择。
安装:pip3 install pyquery
初始化
字符串初始化
from pyquery import PyQuery as pq
html = '''
<div>
<url>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link3.html'><span class='bold'>third item</span></a></li>
</url>
</div>
'''
doc = pq(html)
print(doc('li'))
#这里的选择与css选择器一样,选class加点,选id加#,选标签什么都不加
输出结果为:
<li class="item-0">first item</li>
<li class="item-1"><a href="link3.html"><span class="bold">third item</span></a></li>
URL初始化
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('head'))
输出结果为:
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head>
这种是传入一个url,会自动请求这个url,把源代码给pq,生成一个pq对象
文件初始化
from pyquery import PyQuery as pq
doc = pq(filename='1.html')
print(doc('url'))
输出结果为:
<url>
<li class="item-0">first item</li>
<li class="item-1"><a href="link3.html"><span class="bold">third item</span></a></li>
</url>
------------------------
1.html内容:
<div>
<url>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link3.html'><span class='bold'>third item</span></a></li>
</url>
</div>
基本css选择器:
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
print(doc('#container .list li'))
输出结果为:
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
css选择器,id前面加#号,class前面加点,标签前面什么都不加
查找元素
查找子元素
find 方法:查找元素里面包含的元素
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)
输出结果为:
<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
children方法,查找直接子元素,find查找的只要在里面就行,find更常用
查找父元素
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
items = doc('.list')
print(items.parent())
输出结果为:
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>
还有parents方法,查找祖先节点,不只是父节点,父节点的父节点也会查找到
可以像查找元素一样,在这些方法里加上参数(类似于css选择器)来进一步进行筛选,如:
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
items = doc('.list')
print(items.parent('#container'))
#对父元素中id = container的进行筛选
输出结果为:
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>
兄弟元素
siblings与sibling方法
##在查找的时候,例如doc('.list .item-0.active'),有空格表示一级级往下找,没有空格表示并列的意思,就是即含有iten-0,又含有active的意思
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
items = doc('.list .item-0.active')
print(items)
输出结果为:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
执行items.siblings()就会输出其兄弟元素:
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
items = doc('.list .item-0.active')
print(items.siblings())
print(items.siblings('.active'))
#在查找的时候,可以进行进一步满足条件的筛选
输出结果为:
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
遍历
items()方法:实际上就是产生了一个产生器,再用for循环进行遍历
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
lis = doc('li').items()
for li in lis:
print(li)
输出结果为:
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
获取信息
获取属性
比如要获取item元素的属性:
item.attr('属性名称'),或者:
item.attr.属性名称
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
li = doc('.item-0.active a')
print(li.attr.href)
print(li.attr('href'))
输出结果为:
link3.html
link3.html
获取文本
text()方法
获取html
html()方法,如:
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())
输出结果为:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<a href="link3.html"><span class="bold">third item</span></a>
#输出li得到,这个标签及里面的内容,
#使用html方法后,得到标签里面的html代码
DOM操作
就是节点操作
addClass,removeClass 增删属性
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.removeClass('active'))
print(li.addClass('active'))
输出结果为:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
attr,css 修改属性
from pyquery import PyQuery as pq
html = '''
<div id='container'>
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
<li class='item-1 active'><a href='link4.html'>fourth item</a></li>
<li class='item-0'><a href='link5.html'>fifth item</a></li>
</url>
</div>
'''
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.attr('name','link'))
print(li.css('font-size','14px'))
输出结果为:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
#原先没有name属性,现在增加了一个name属性,如过原来有name属性,那么就会修改原来的值
<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
#用了css之后,就出现了style这个属性
remove
from pyquery import PyQuery as pq
html = '''
<div class='wrap'>
hello world
<p>this is a paragraph</p>
</div>
'''
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
print(wrap.find('p'))
wrap.find('p').remove()
print(wrap.text())
输出结果为:
hello world
this is a paragraph
<p>this is a paragraph</p>
hello world
其他DOM方法
http://pyquery.readthedocs.io/en/latest/api.html
伪类选择器