认识爬虫：优秀的爬虫利器，pyquery 框架爬虫到底有多简洁？

最新推荐文章于 2023-05-25 14:39:53 发布

Python 集中营

最新推荐文章于 2023-05-25 14:39:53 发布

阅读量187

点赞数 1

分类专栏： python 文章标签： python 爬虫 pyquery

本文链接：https://blog.csdn.net/chengxuyuan_110/article/details/115473171

版权

python 专栏收录该内容

107 篇文章 35 订阅

订阅专栏

了解过了 BeautifulSoup 对象的爬虫解析、lxml 扩展库的 xpath 语法等 html 的解析库，现在来说说 pyquery ，看名称就长得和 jquery 很像。其实，pyquery 就是仿照 jquery 的语法来实现的，语法使用可以说是几乎相同，算是前端爬虫的福利语言，如果你恰好会一些 jquery 的语法使用起来就会非常简单。

1、安装并导入 pyquery 扩展库

1pip install -i https://pypi.mirrors.ustc.edu.cn/simple/ pyquery
2
3# -*- coding: UTF-8 -*-
4
5# 导入 pyquery 扩展库
6from pyquery import PyQuery as pq

2、pyquery 执行网页请求(不常用)

1'''
2直接使用 PyQuery 对象即可发送网页请求，返回响应信息
3'''
4
5# GET 请求
6print(PyQuery(url='http://www.baidu.com/', data={},headers={'user-agent': 'pyquery'},method='get'))
7
8# POST 请求
9print(PyQuery(url='http://httpbin.org/post',data={'name':u"Python 集中营"},headers={'user-agent': 'pyquery'}, method='post', verify=True))

3、pyquery 执行网页源代码解析(常用)

解析对象初始化

 1# 首先获取到网页下载器已经下载到的网页源代码
 2# 这里直接取官方的案例
 3html_doc = """
 4<html><head><title>The Dormouse's story</title></head>
 5<body>
 6<p class="title"><b>The Dormouse's story</b></p>
 7
 8<p class="story">Once upon a time there were three little sisters; and their names were
 9<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
10<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
11<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
12and they lived at the bottom of a well.</p>
13
14<p class="story">...</p>
15"""
16
17# 初始化解析对象
18pyquery_obj = PyQuery(html_doc)

css选择器模式提取元素及元素文本

 1# 获取a标签元素、文本
 2print(pyquery_obj('a'))
 3print(pyquery_obj('a').text())
 4
 5# 获取class=story元素、文本
 6print(pyquery_obj('.story'))
 7print(pyquery_obj('.story').text())
 8
 9# 获取id=link3元素、文本
10print(pyquery_obj('#link3'))
11print(pyquery_obj('#link3').text())
12
13# 获取body下面p元素、文本
14print(pyquery_obj('body p'))
15print(pyquery_obj('body p').text())
16
17# 获取body和p元素、文本
18print(pyquery_obj('p,a'))
19print(pyquery_obj('p,a').text())
20
21# 获取body和p元素、文本
22print(pyquery_obj("[class='story']"))
23print(pyquery_obj("[class='story']").text())

获取元素之后再进一步提取信息

1# 提取元素文本
2print("......元素再提取......")
3print("所有a元素文本",pyquery_obj('a').text())
4print("第一个a元素的html文本",pyquery_obj('a').html())
5print("a元素的父级元素",pyquery_obj('a').parent())
6print("a元素的子元素",pyquery_obj('a').children())
7print("所有a元素中id是link3的元素",pyquery_obj('a').filter('#link3'))
8print("最后一个a元素的href属性值",pyquery_obj('a').attr.href)

dom操作

 1# attr() 函数获取属性值
 2print(pyquery_obj('a').filter('#link3').attr('href'))
 3# attr.属性，获取属性值
 4print(pyquery_obj('a').filter('#link3').attr.href)
 5print(pyquery_obj('a').filter('#link3').attr.class_)
 6# 添加 class 属性值 w
 7pyquery_obj('a').filter('#link3').add_class('w')
 8print(pyquery_obj('a').filter('#link3').attr('class'))
 9
10# 移除 class 属性值 w
11pyquery_obj('a').filter('#link3').remove_class('sister')
12print(pyquery_obj('a').filter('#link3').attr('class'))
13# 移除标签
14pyquery_obj('html').remove('a')
15print(pyquery_obj)

更多精彩前往微信公众号【Python 集中营】，专注于 python 技术栈，资料获取、交流社区、干货分享，期待你的加入~

在这里插入图片描述

Python 集中营

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
认识爬虫：优秀的爬虫利器，pyquery 框架爬虫到底有多简洁？

了解过了 BeautifulSoup 对象的爬虫解析、lxml 扩展库的 xpath 语法等 html 的解析库，现在来说说 pyquery ，看名称就长得和 jquery 很像。其实，pyquery 就是仿照 jquery 的语法来实现的，语法使用可以说是几乎相同，算是前端爬虫的福利语言，如果你恰好会一些 jquery 的语法使用起来就会非常简单。1、安装并导入 pyquery 扩展库1pip install -i https://pypi.mirrors.ustc.edu.cn/simple/ py
复制链接

扫一扫