python pyquery库_python爬虫---从零开始（五）pyQuery库

最新推荐文章于 2024-05-11 22:30:00 发布

weixin_39583162

最新推荐文章于 2024-05-11 22:30:00 发布

阅读量208

点赞数

文章标签： python pyquery库

本文链接：https://blog.csdn.net/weixin_39583162/article/details/111856842

版权

什么是pyQuery：

强大又灵活的网页解析库。如果你觉得正则写起来太麻烦(我不会写正则)，如果你觉得BeautifulSoup的语法太难记，如果你熟悉JQuery的语法，那么PyQuery就是你最佳的选择。

pyQuery的安装pip3 install pyquery即可安装啦。

pyQuery的基本用法：

初始化：

字符串初始化：

#!/usr/bin/env python#-*- coding: utf-8 -*-

html= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters;and thier names were

Lacie and

Title; and they lived at the boottom of a well.

...

"""

from pyquery importPyQuery as pq

doc=pq(html)print(doc('a'))

运行结果：

URL初始化：

#!/usr/bin/env python#-*- coding: utf-8 -*-#URL初始化

from pyquery importPyQuery as pq

doc= pq('http://www.baidu.com')print(doc('input'))

运行结果：

文件初始化：

#!/usr/bin/env python#-*- coding: utf-8 -*-#文件初始化

from pyquery importPyQuery as pq

doc= pq(filename='baidu.html')print(doc('title'))

运行结果：

选择方式和jquery一致，id、name、class都是如此，还有很多都和jquery一致。

基本CSS选择器：

#!/usr/bin/env python#-*- coding: utf-8 -*-#Css选择器

html= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters;and thier names were

Lacie and

Title; and they lived at the boottom of a well.

...

"""

from pyquery importPyQuery as pq

doc=pq(html)print(doc('.title'))

运行结果：

查找元素：

子元素：

#!/usr/bin/env python#-*- coding: utf-8 -*-#子元素

html= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters;and thier names were

Lacie and

Title; and they lived at the boottom of a well.

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('.title')print(type(items))print(items)

p= items.find('b')print(type(p))print(p)

该代码为查找id为title的标签，我们可以看到id为title的标签有两个一个是p标签，一个是a标签，然后我们再使用find方法，查找出我们需要的p标签，运行结果：

这里需要注意的是，我们所使用的find是查找每一个元素内部的标签.

children：

#!/usr/bin/env python#-*- coding: utf-8 -*-#子元素

html= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters;and thier names were

Lacie and

Title; and they lived at the boottom of a well.

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('.title')print(items.children())

运行结果：

也可以在children()内添加选择器条件：

#!/usr/bin/env python#-*- coding: utf-8 -*-#子元素

html= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters;and thier names were

Lacie and

Title; and they lived at the boottom of a well.

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('.title')print(items.children('b'))

输出结果和上面的一致。

父元素：

#!/usr/bin/env python#-*- coding: utf-8 -*-#子元素

html= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters;and thier names were

Lacie and

Title; and they lived at the boottom of a well.

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('#link1')print(items)print(items.parent())

运行结果：

这里只输出一个父元素。这里我们用parents方法会给予我们返回所有父元素，祖先元素

#!/usr/bin/env python#-*- coding: utf-8 -*-#祖先元素

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('#link1')print(items)print(items.parents('body'))

运行结果：

兄弟元素：

#!/usr/bin/env python#-*- coding: utf-8 -*-#兄弟元素

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('#link1')print(items)print(items.siblings('#link2'))

运行结果：

上面就把查找元素的方法都说了，下面我来看一下如何遍历元素。

遍历

#!/usr/bin/env python#-*- coding: utf-8 -*-#兄弟元素

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('a')for k,v inenumerate(items.items()):print(k,v)

运行结果：

获取信息：

获取属性：

#!/usr/bin/env python#-*- coding: utf-8 -*-#获取属性

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('a')print(items)print(items.attr('href'))print(items.attr.href)

运行结果：

获得文本：

#!/usr/bin/env python#-*- coding: utf-8 -*-#获取属性

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('a')print(items)print(items.text())print(type(items.text()))

运行结果：

获得HTML：

#!/usr/bin/env python#-*- coding: utf-8 -*-#获取属性

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('a')print(items.html())

运行结果：

DOM操作：

addClass、removeClass

#!/usr/bin/env python#-*- coding: utf-8 -*-#DOM操作，addClass、removeClass

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('#link2')print(items)

items.addClass('addStyle') #add_class

print(items)

items.remove_class('sister') #removeClass

print(items)

运行结果：

attr、css：

#!/usr/bin/env python#-*- coding: utf-8 -*-#DOM操作，attr,css

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)

items= doc('#link2')

items.attr('name','addname')print(items)

items.css('width','100px')print(items)

可以给予新的属性，如果原来有该属性，会覆盖掉原有的属性

运行结果：

remove：

#!/usr/bin/env python#-*- coding: utf-8 -*-#DOM操作，remove

html= """

Hello World

This is a paragraph.

"""

from pyquery importPyQuery as pq

doc=pq(html)

wrap= doc('.wrap')print(wrap.text())

wrap.find('p').remove()print("remove以后的数据")print(wrap)

运行结果：

伪类选择器：

#!/usr/bin/env python#-*- coding: utf-8 -*-#DOM操作，伪类选择器

html= """

The Dormouse's story

Once upo a time were three little sister;and theru name were

Elsie

Lacie

and

Title

...

"""

from pyquery importPyQuery as pq

doc=pq(html)#print(doc)

wrap = doc('a:first-child') #第一个标签

print(wrap)

wrap= doc('a:last-child') #最后一个标签

print(wrap)

wrap= doc('a:nth-child(2)') #第二个标签

print(wrap)

wrap= doc('a:gt(2)') #比2大的索引标签即为 0 1 2 3 4 从0开始的不是1

print(wrap)

wrap= doc('a:nth-child(2n)') #第 2的整数倍个标签

print(wrap)

wrap= doc('a:contains(Lacie)') #包含Lacie文本的标签

print(wrap)

这里不在详细的一一列举了，了解更多CSS选择器可以查看官方文档，由W3C提供地址：http://www.w3school.com.cn/css/index.asp

到这里我们就把pyQuery的使用方法大致的说完了，想了解更多，更详细的可以阅读官方文档，地址：https://pyquery.readthedocs.io/en/latest/

感谢大家的阅读，不正确的地方，还希望大家来斧正，鞠躬，谢谢🙏。

weixin_39583162

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫