Pyquery学习笔记

pyquery

pyqyery允许你对xml文档进行jquery查询,API尽可能类似jquery,pyquery使用lxml进行快速xml和html操作

pyquery是python中强大而又灵活的网页解析库,如果你觉得正则写起来太麻烦,有觉得beautifulsoup语法太难记,如果你熟悉jquery的语法那么,pyquery就是你的绝佳的选择

初始化

基于标签选择器

html = '''
<div>
    <ul>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004">同城互助</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''

from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004">同城互助</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>

URL初始化

from pyquery import PyQuery as pq

doc = pq(url='http://www.baidu.com')
print(doc('head'))
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head> 

filename文件初始化

from pyquery import PyQuery as pq
doc = pq(filename='hello.html')

print(doc)
print('---' * 10)
print(doc('li'))
<div>
    <ul>
        <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">dfgdd</a></li>
        <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">gdsfeew</a></li>
        <li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004">kuikuik</a></li>
        <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">qe23rw</a></li>
        <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">fgdfggb</a></li>
    </ul>
</div>
------------------------------
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">dfgdd</a></li>
        <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">gdsfeew</a></li>
        <li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004">kuikuik</a></li>
        <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">qe23rw</a></li>
        <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">fgdfggb</a></li>

pyquery找都是找所有 这是跟bs4不同的地方

基本css选择器

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004">同城互助</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)

print(doc('li'))
print(doc('#container .list li'))
print(doc('#container'))
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004">同城互助</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004">同城互助</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                
<div id="container">
    <ul class="list">
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004">同城互助</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>

查找元素

子元素

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)

items = doc('.list li')
# print(items)

lis = doc.find('li')
# print(lis)

dfg = lis('span')
print(dfg)
<span class="bold">同城互助</span>
items = doc('.list')
print(items)
lis = items.children()
print(lis)
<ul class="list">
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>

<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
lis = items.children('.active')
print(lis)
<li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>

父元素

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)

items = doc('.list')
print(items)
container = items.parent()
print(container)
<ul class="list">
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>

<div id="container">
    <ul class="list">
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>

兄弟元素

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)

li = doc('.list .Sq_leftNav_forum1.active')
print(li)
print(li.siblings())  # 获取兄弟元素
<li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    
<li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>

遍历

单个元素

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)

li = doc('.Sq_leftNav_forum1.active')  # 用Sq_leftNav_forum1 active会错
print(li)
<li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)

lis = doc('li')
# print(lis)
# for li in lis:
#     print(li)


# 遍历需要添加items,返回一个迭代器,否则只返回内存地址
lis = doc('li').items()
# print(lis)
for li in lis:
    print(li)
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    
<li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    
<li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    
<li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>

获取信息

attr – 获取属性

很重要,拿图片链接才能保存二进制数据

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)
a = doc('.Sq_leftNav_forum1.active a')
# print(a)
print(a.attr('href'))
/shuo/forum/001004
# 第二种写法
print(a.attr.href)
/shuo/forum/001004

text() --获取文本

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)
a = doc('.Sq_leftNav_forum1.active a')
print(a)
<a href="/shuo/forum/001004"><span class="bold">同城互助</span></a>
print(a.text())
同城互助

DOM操作

addClass,removeClass

能帮我们快速查找或筛选数据

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)
li = doc('.Sq_leftNav_forum1.active')
print(li)
<li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
li.removeClass('active')
print(li)
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
li.addClass('active')
print(li)
<li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>

attr,css

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)
li = doc('.Sq_leftNav_forum1.active')

li.attr('name', 'link')
print(li)

li.css('font-size', '14px')
print(li)
<li class="Sq_leftNav_forum1 active" name="link"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    
<li class="Sq_leftNav_forum1 active" name="link" style="font-size: 14px"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>

remove

移除标签以及内容

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)

wrap = doc('.list')
# print(wrap)
# print(wrap.text())
wrap.find('a').remove()
print(wrap)
<ul class="list">
                    <li class="Sq_leftNav_forum1"/>
                    <li class="Sq_leftNav_forum2"/>
                    <li class="Sq_leftNav_forum1 active"/>
                    <li class="Sq_leftNav_forum2"/>
                    <li class="Sq_leftNav_forum1"/>
                </ul>

其他DOM方法

http://pyquery.readthedocs.io/en/latest/api.html

伪类选择器

jQuery所有选择器都适用

html = '''
<div id='container'>
    <ul class='list'>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00B002">找对象</a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/001002">新鲜事</a></li>
                    <li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    <li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                </ul>
</div>
'''
from pyquery import PyQuery
doc = PyQuery(html)

li = doc('li:last')  # li标签最后一个
print(li)

li = doc('li:last-child')  # li标签最后一个孩子
print(li)

li = doc('li:nth-child(3)')  # 从1开始,拿第三个孩子
print(li)

li = doc('li:gt(2)')  # 根据索引值判断,索引从0开始 gt表示大于 lt表示小于
print(li)

li = doc('li:eq(4)')  #eg表等于
print(li)

li = doc('li:contains(虞城)')  # contains包含某内容  主要做内容的筛选
print(li)
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                
<li class="Sq_leftNav_forum1 active"><a href="/shuo/forum/001004"><span class="bold">同城互助</span></a></li>
                    
<li class="Sq_leftNav_forum2"><a href="/shuo/forum/007005">同城活动</a></li>
                    <li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
                
<li class="Sq_leftNav_forum1"><a href="/shuo/forum/00D001">虞城有爱</a></li>
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

小刘私坊

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值