python爬虫之PyQuery的基本使用

PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪的方法了。
官网地址:http://pyquery.readthedocs.io/en/latest/
jQuery参考文档: http://jquery.cuishifeng.cn/

 

1、字符串的初始化

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

from pyquery import PyQuery as pq

 

html = '''<div>

    <ul>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

print(doc)

print(type(doc))

print(doc('li'))

复制代码

<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>

复制代码

 

2、打开html文件

  注意路劲问题

1

2

3

4

from pyquery import PyQuery as pq

doc = pq(filename='index.html')

print(doc)

print(doc('head'))

复制代码

    <title>Title</title>
</head>
<body>
    <div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''
</body>
</html>
<head>
    <meta charset="UTF-8"/>
    <title>Title</title>
</head>

复制代码

 

3、打开某个网站

1

2

3

4

doc = pq('https://www.baidu.com')

# doc1 = pq(url='https://www.baidu.com')

print(doc)

print(doc('head'))

  

4、基于CSS选择器查找

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

from pyquery import PyQuery as pq

 

html = '''<div>

    <ul id = 'haha'>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

print(doc)

#id等于haha下面的class等于item-0下的a标签下的span标签(注意层级关系以空格隔开)

print(doc('#haha .item-0 a span'))

复制代码

<div>
    <ul id="haha">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>
<span class="bold">third item</span>

复制代码

 

 

5、可以通过已经查找的标签,查找这个标签下的子标签或者父标签,而不用从头开始查找。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

from pyquery import PyQuery as pq

 

html = '''<div class=‘content’>

    <ul id = 'haha'>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

item = doc('div ul')

print(item)

#我们可以通过已经查找到的标签,再此查找这个标签下面的标签

print(item.parent())

print(item.children())

复制代码

<ul id="haha">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
<div class="&#x2018;content&#x2019;">
    <ul id="haha">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>
<li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>

复制代码

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

from pyquery import PyQuery as pq

 

html = '''<div class=‘content’>

    <ul id = 'haha'>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

item = doc('div ul')

print(item)

#注意这里查找ul标签的所有子标签,也就是li标签,下面是查找class属性的标签,如果你把class换成href肯定不行,它指的只是儿子并不是子子孙孙

print(item.children('[class]'))

 

6、获取属性值

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

from pyquery import PyQuery as pq

 

html = '''<div class=‘content’>

    <ul id = 'haha'>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

#注意class=item-0 active是一个class的属性,但是在pyquery里面要是中间也是空格隔开的话,

#就变成了item-0下的active标签下的a标签了,所以这里空格必须改成点

item = doc(".item-0.active a")

print(type(item))

print(item)

#获取属性值的两种方法

print(item.attr.href)

print(item.attr('href'))

<class 'pyquery.pyquery.PyQuery'>
<a href="link3.html"><span class="bold">third item</span></a>
link3.html
link3.html

 

7、获取标签的内容

1

2

3

4

5

6

7

8

9

10

11

12

13

14

from pyquery import PyQuery as pq

 

html = '''<div class=‘content’>

    <ul id = 'haha'>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

= doc("a").text()

print(a)

#结果很有趣,他是找到所有标签的值,然后给连到一起打出来,就像一段话
second item third item fourth item fifth item

 

 

8、Dom操作

1、属性的增加删除操作

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

from pyquery import PyQuery as pq

 

html = '''<div class=‘content’>

    <ul id = 'haha'>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

li = doc('.item-0.active')

print(li)

#删除classactive

print(li.removeClass('active'))

#增加class属性haha

print(li.addClass('haha'))

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
         
<li class="item-0 haha"><a href="link3.html"><span class="bold">third item</span></a></li>

 

2、attrs和css

  注意:下列操作有则改之,无则加之。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

from pyquery import PyQuery as pq

 

html = '''<div class=‘content’>

    <ul id = 'haha'>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

li = doc('.item-0.active')

print(li)

print(li.attr('id','id_test'))

print(li.css('font-size','20px'))

复制代码

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         
<li class="item-0 active" id="id_test"><a href="link3.html"><span class="bold">third item</span></a></li>
         
<li class="item-0 active" id="id_test" style="font-size: 20px"><a href="link3.html"><span class="bold">third item</span></a></li>

复制代码

 

 

3、删除某个标签,在爬去过程中我们通常爬去一下标签或者内容下来的时候总会有些不想要的标签,这个时候我们可以用下面的类似方法删除这个标签。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

from pyquery import PyQuery as pq

 

html = '''<div class='content'>

    <ul id = 'haha'>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul></div>'''

 

doc = pq(html)

data = doc('.content')

print(data.text())

#删除所有a标签

data.find('a').remove()

#再次打印

print(data.text())

first item second item third item fourth item fifth item
first item

 

 

转载地址:https://www.cnblogs.com/lei0213/p/7676254.html

 

 

实例:

# coding=utf-8
from pyquery import PyQuery as pq  #pyquery模块




# 执行更新操作
content = '''<p class=MsoNormal><p class=MsoNormal><span lang="EN-US" style='font-family:"Times New Roman","serif"'>
<span lang="EN-US" style='font-family:"Times New Roman","serif"'>
<img height="68" id="图片 1786" src="/alEngin/upload/word/4028803a2c408e3e012c409026b60005/2c2880432c7d6f51012c7e300e9c0021/2c2880432dc13f98012e4c90f56401d6/2c2880432dc13f98012e4c9171c801d7/2c2880432dc13f98012e4c9171e701d8.files/image001.jpg" width="94"/>
</span></span></p>
</p>'''
doc = pq(content)
# span = doc('span')
# doc('p').remove_class('MsoNormal')
# doc('span').css('font-family','').attr('lang','')

# 清除所有的class
doc('[class=MsoNormal]').remove_class('MsoNormal')
# doc下所有的元素清除font和lang
doc('*').css('font-family','').attr('lang','')
print(doc)

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值