python爬虫之PyQuery的常用用法

最新推荐文章于 2022-07-17 10:31:12 发布

一口木桶饭

最新推荐文章于 2022-07-17 10:31:12 发布

阅读量464

点赞数

分类专栏：爬虫 python PyQuery 文章标签： python

本文链接：https://blog.csdn.net/weixin_44415928/article/details/104272051

版权

python 同时被 3 个专栏收录

16 篇文章 0 订阅

订阅专栏

爬虫

4 篇文章 0 订阅

订阅专栏

PyQuery

1 篇文章 0 订阅

订阅专栏

安装依旧很简单pip install pyquery,对于pycharm的安装看这篇pycharm安装第三方库
首先我们定义一个HTML文本段作为下边的例子

<html>
	<head>
		<title>this is a title</title>
	</head>
	<body>
		<p class="first" name="first">this is a p label</p>
		<p class="second" name="second"><b>this is a p label, too</b></p>
		<p class="third" name="third">also, a p label</p>
		<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
		<a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
		<a href="http://www.baidu.com" class="third" id="three">an a label also</a>
	</body>
</html>

1.初始化

2.字符串初始化

from pyquery import PyQuery as pq   # 首先引入第三方库

# 定义测试字符串
text = '''<html>   
<head>
    <title>this is a title</title>
</head>
<body>
    <p class="first" name="first">this is a p label</p>
    <p class="second" name="second"><b>this is a p label, too</b></p>
    <p class="third" name="third">also, a p label</p>
    <a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    <a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    <a href="http://www.baidu.com" class="third" id="three">an a label also</a>
</body>
</html>
'''

res = pq(text)
print(res('p'))
(结果)<p class="first" name="first">this is a p label</p>
    <p class="second" name="second"><b>this is a p label, too</b></p>
    <p class="third" name="third">also, a p label</p>

'''可以看到他打印了所有的p标签'''
print(type(res('p')))
(结果)<class 'pyquery.pyquery.PyQuery'>
'''这是什么乱七八糟的类型，不过我们可以通过生成器遍历出来'''
for r in res('p').items():
	print(r)
(结果)
<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    
<a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    
<a href="http://www.baidu.com" class="third" id="three">an a label also</a>
'''打印了所有的内容'''

2.url初始化

form pyquery import pyQuery as pq

res = pq(url='http://www.baidu.com')
print(res('head'))
(结果)<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> 

'''很长的百度首页的head就出来了，
它会自动请求该页面，并把本页面的head标签打印出来'''

3.文件初始化

文件初始化其实和字符串初始化没啥大的区别，就是把读的字符串text换成了一个文件，如下

from pyquery import PyQuery as pq

res = pq(filename="index.html")
print(res('head'))

2.基本css选择器

from pyquery import PyQuery as pq
res = pq(text)  # text 是上边定义过的一个字符串
print(res("body p b"))  # 可以嵌套选择
(结果)<b>this is a p label, too</b>
'''直接打印b标签'''

1.查找子元素

temp = pq(text)("body")
child_list = temp.children()
print(child_list)
(结果)<p class="first" name="first">this is a p label</p>
    <p class="second" name="second"><b>this is a p label, too</b></p>
    <p class="third" name="third">also, a p label</p>
    <a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    <a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    <a href="http://www.baidu.com" class="third" id="three">an a label also</a>

'''可以看到他打印了body的子元素，只是一级子元素，b标签并没有打印'''

'''也可以查找特定的子标签'''
child_list = temp.children('a')
print(child_list)
(结果)
<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    <a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    <a href="http://www.baidu.com" class="third" id="three">an a label also</a>
'''这样就可以打印子标签为a的标签'''

2.查找父元素

temp = pq(text)('b')
print(temp.parent())
(结果)<p class="second" name="second"><b>this is a p label, too</b></p>
'''可以看到他把b标签的父标签打印了出来'''
print(temp.parents())
(结果)  ##########################################################
<html>
<head>
    <title>this is a title</title>
</head>
<body>
    <p class="first" name="first">this is a p label</p>
    <p class="second" name="second"><b>this is a p label, too</b></p>
    <p class="third" name="third">also, a p label</p>
    <a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    <a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    <a href="http://www.baidu.com" class="third" id="three">an a label also</a>
</body>
</html><body>
    <p class="first" name="first">this is a p label</p>
    <p class="second" name="second"><b>this is a p label, too</b></p>
    <p class="third" name="third">also, a p label</p>
    <a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    <a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    <a href="http://www.baidu.com" class="third" id="three">an a label also</a>
</body>
<p class="second" name="second"><b>this is a p label, too</b></p>

'''可以看到他把所有的父标签都打了出来'''

temp = pq(text)('a#one')
print(temp.siblings())
(结果)<p class="third" name="third">also, a p label</p>
    <p class="second" name="second"><b>this is a p label, too</b></p>
    <p class="first" name="first">this is a p label</p>
    <a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    <a href="http://www.baidu.com" class="third" id="three">an a label also</a>

'''注意这里的写法如果我们要定位一个标签而用两个限定符的时候，这俩是挨着的
比如这里的a标签和id=one是一个标签里的，所以是挨着写，如果是嵌套关系就要在中间加空格
'''

3.遍历元素

from pyquery import PqQuery as pq
temp = pq(text)('a')
print(type(temp))
print(temp)
(结果)<class 'pyquery.pyquery.PyQuery'>
<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    <a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    <a href="http://www.baidu.com" class="third" id="three">an a label also</a>
'''可以看到类型是pyquery的类型，打印出来也不是列表，我们如何遍历得到呢'''

res = temp.items()
print(res)
(结果)<generator object PyQuery.items at 0x00000283AF353348>
'''可以看到这样打印出来的是一个生成器
我们可以通过遍历得到所有的内容
'''
for r in res:
	print(r)
(结果)
<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    
<a href="http://www.baidu.com" class="second" id="two">this is an a label, too</a>
    
<a href="http://www.baidu.com" class="third" id="three">an a label also</a>
'''这样就可以打印出所有的元素'''

4.获取属性和内容

temp = pq(text)('body p.first')  # 获取body里的class为first的p标签
print(temp.attr('name'))
(结果)first
'''可以看到他打印了p标签的名字属性值'''
print(temp.attr.name)
(结果)first  # 效果一样

print(temp.text())
(结果)this is a p label
'''可以看到他会打印标签里的内容'''

temp = pq(text)('body p.second')  # 我们选取第二个p标签，注意第二个p标签里边是有个b标签的
print(temp.text())
(结果)this is a p label, too
'''他居然跳过了b标签直接打印了内容，如果我们想要得到b标签呢'''
print(temp.html())
(结果)<b>this is a p label, too</b>
'''这样就可以打印出'''

3.DOM操作

dom操作就是对一些节点进行动态的修改，如下

temp = pq(text)('body a#one')
print(temp)
temp.remove_class('first')
print(temp)
temp.add_class("first")
print(temp)
(结果)
<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    
<a href="http://www.baidu.com" class="" id="one">this is an a label</a>
    
<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>

'''可以看到通过remove_class可以移除掉class属性，add_class可以添加class属性
不过remove只是吧class属性变成空字符串，好像目前只有class的增删
如果我们还想修改其他的属性值呢？可以用attr，如下
'''
print(temp)
temp.attr('id', 'text')
print(temp)
(结果)
<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
    
<a href="http://www.baidu.com" class="first" id="test">this is an a label</a>
'''可以修改，同时还可以通过css方法来增加css属性'''
temp.css('font-size', '14px')
print(temp)
(结果)
<a href="http://www.baidu.com" class="first" id="test" style="font-size: 14px">this is an a label</a>
'''可以看到他加了一个style属性'''

'''现在我们新建一个内容'''
s = '''
<body>
	<p>
    	this is a test
    	<b>a test, too</b>
    </p>
</body>
'''
temp = pq(s)('p')
print(temp.text())
(结果)this is a test a test, too 
'''可以看到他把b标签里的内容也打印出来了，如果我们不想要b标签的内容呢
可以用remove移除标签，如下
'''
temp.find('p').remove()
(结果)this is a test  # 可以看到这样就可以完美移除b标签

4.伪类选择器

temp = pq(text)
res = temp("p:first-child")
print(res)
(结果)<p class="first" name="first">this is a p label</p>
'''它的意义是，我选择p标签，first-child表示是第一个元素，选择了第一个p标签'''

res = temp("a:last-child")
print(res)
(结果)<a href="http://www.baidu.com" class="third" id="three">an a label also</a>
'''他选择了最后一个a标签，需要注意的是
p:last-child和a:first-child是没有结果的，应该是p和a是并排的显示
first-child是一个p标签，而last是一个a标签
'''

res = temp("p:nth-child(1)")
print(res)
(结果)<p class="first" name="first">this is a p label</p>
'''可以看到第一个p被打印了，这里的索引是从1开始的，要注意'''

res = temp("a:nth-child(1)")
print(res)  #可以看到没有结果，看来p和a是并列的，他们是并排索引的
print(temp("a:nth-child(4)"))
(结果)<a href="http://www.baidu.com" class="first" id="one">this is an a label</a>
'''这才打印出第一个a标签，这里好奇葩，好吧，反正也没多少人用这些方法吧，权当做了解'''

以上就是pyquery的基本使用方法，有关BeautifulSoup的用法请看这篇python爬虫之BeautifulSoup的简单用法

一口木桶饭

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python爬虫之PyQuery的常用用法

安装依旧很简单pip install pyquery,对于pycharm的安装看这篇pycharm安装第三方库首先我们定义一个HTML文本段作为下边的例子<html> <head> <title>this is a title</title> </head> <body> <p class="first...
复制链接

扫一扫