PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪的方法了。
官网地址:http://pyquery.readthedocs.io/en/latest/
jQuery参考文档: http://jquery.cuishifeng.cn/
1、字符串的初始化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
2、打开html文件
注意路劲问题
1 2 3 4 |
|
<title>Title</title>
</head>
<body>
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
</body>
</html>
<head>
<meta charset="UTF-8"/>
<title>Title</title>
</head>
3、打开某个网站
1 2 3 4 |
|
4、基于CSS选择器查找
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
<div>
<ul id="haha">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>
<span class="bold">third item</span>
5、可以通过已经查找的标签,查找这个标签下的子标签或者父标签,而不用从头开始查找。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
<ul id="haha">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
<div class="‘content’">
<ul id="haha">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
6、获取属性值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
<class 'pyquery.pyquery.PyQuery'>
<a href="link3.html"><span class="bold">third item</span></a>
link3.html
link3.html
7、获取标签的内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
#结果很有趣,他是找到所有标签的值,然后给连到一起打出来,就像一段话
second item third item fourth item fifth item
8、Dom操作
1、属性的增加删除操作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 haha"><a href="link3.html"><span class="bold">third item</span></a></li>
2、attrs和css
注意:下列操作有则改之,无则加之。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" id="id_test"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" id="id_test" style="font-size: 20px"><a href="link3.html"><span class="bold">third item</span></a></li>
3、删除某个标签,在爬去过程中我们通常爬去一下标签或者内容下来的时候总会有些不想要的标签,这个时候我们可以用下面的类似方法删除这个标签。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
first item second item third item fourth item fifth item
first item
转载地址:https://www.cnblogs.com/lei0213/p/7676254.html
实例:
# coding=utf-8
from pyquery import PyQuery as pq #pyquery模块
# 执行更新操作
content = '''<p class=MsoNormal><p class=MsoNormal><span lang="EN-US" style='font-family:"Times New Roman","serif"'>
<span lang="EN-US" style='font-family:"Times New Roman","serif"'>
<img height="68" id="图片 1786" src="/alEngin/upload/word/4028803a2c408e3e012c409026b60005/2c2880432c7d6f51012c7e300e9c0021/2c2880432dc13f98012e4c90f56401d6/2c2880432dc13f98012e4c9171c801d7/2c2880432dc13f98012e4c9171e701d8.files/image001.jpg" width="94"/>
</span></span></p>
</p>'''
doc = pq(content)
# span = doc('span')
# doc('p').remove_class('MsoNormal')
# doc('span').css('font-family','').attr('lang','')
# 清除所有的class
doc('[class=MsoNormal]').remove_class('MsoNormal')
# doc下所有的元素清除font和lang
doc('*').css('font-family','').attr('lang','')
print(doc)