目录
PyQuery允许对xml文档进行jQuery查询,该API尽可能类似于jQuery,PyQuery使用lxml进行快速的xml和html操作。
1.PyQuery简介
(1)初始化PyQuery对象包括:字符串初始化、URL初始化、文件初始化
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
# import requests
from pyquery import PyQuery as pq
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc('div'))
# URl初始化
# 如下等同:url_doc = pq(requests.get('http://news.baidu.com/').text)
url_doc = pq(url='http://news.baidu.com/')
print(url_doc('title'))
# 文件初始化
txt_doc = pq(filename='test.html')
print(txt_doc('title'))
(2)CSS选择器:https://www.w3school.com.cn/cssref/css_selectors.asp
在 CSS 中,选择器是一种模式,用于选择需要添加样式的元素,如下选取html_doc中class为“subject-item”的所有div节点。
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc('.subject-item div'))
“Run”结果:
<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div>
(3)查找节点
-
子节点:调用find_all()方法传入CSS选择器,选取img节点的所有子孙节点,可以用children()只筛选子节点。
print(string_doc('.nbg').find('img'))
print(type(string_doc('.nbg').find('img')))
# 输出结果:
<img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/>
<class 'pyquery.pyquery.PyQuery'>
# ------------------------------------------
print(string_doc('.cart-actions').children())
# 输出结果:
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span>
-
父节点:用parent() 查询直接父节点,parents() 查找祖先节点 传入CSS选择器即可,如下,用.buy-info选取class为buy-info的节点,然后调用parent() 方法得到其直接父节点,用parents() 查找所有的祖先节点,筛选某个祖先节点的话,可以向parents() 方法传入CSS选择器,如下筛选class为cart-actions的父节点。
print(string_doc('.buy-info').parent())
# 输出结果:
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div>
print(string_doc('.buy-info').parents())
# 输出结果:
<html><body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li></body></html><body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li></body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li><div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div>
print(string_doc('.buy-info').parents('.cart-actions'))
# 输出结果:
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div>
-
兄弟节点:siblings() 筛选兄弟节点,同样的,也可以传入CSS选择器来筛选指定的兄弟节点。
print(string_doc('.cart-actions').siblings())
# 输出结果:
<div class="collect-info"> </div>
(4)遍历:对于多个节点的结果就需要调用items()方法 遍历,如下:string_doc('span').items()遍历所有的div标签元素。
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
divs = string_doc('span').items()
# print(divs)
print(type(divs))
for div in divs:
print(div)
输出结果:
<class 'generator'>
<span class="allstar45"/>
<span class="rating_nums">9.0</span>
<span class="pl"> (
561845人评价) </span>
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span>
(5)获取属性:调用attr()方法来获取属性,对于返回的结果为多节点时,调用attr()方法只会得到第一个节点的属性,需要使用for循环来实现每个节点的遍历。
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
a = string_doc('div')
# print(a, type(a))
for item in a.items():
print(item.attr('class'))
运行结果:
pic
info
pub
star clearfix
ft
collect-info
cart-actions
(6)获取文本:调用text()方法来实现,不需要遍历即可获得所有节点内部的文本,如下:
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc.text())
“Run”结果:
小王子
[法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元
9.0 ( 561845人评价)
小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...
纸质版47.30元起
当然,也可以通过CSS选择器来筛选输出指定节点的文本:
a = string_doc('span')
print(a.text())
输出结果: 9.0 ( 561845人评价) 纸质版47.30元起
(7)节点操作:addClass() 为节点添加class属性,removeClass()动态移除节点的class属性。
# 字符串初始化
string_doc = pq(html_doc)
span = string_doc('.buy-info')
print(span)
span_mv = span.remove_class('buy-info')
print(span_mv)
span_add = span_mv.add_class('buy-info')
print(span_add)
# 输出结果:
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span>
<span class=""> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span>
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span>
attr()、text()、html()修改属性值、文本内容、html文本:
# 字符串初始化
string_doc = pq(html_doc)
span = string_doc('.buy-info')
# 修改class源属性值buy-info为price
print(span.attr('class', 'price'))
# 修改span内文本的内容为“价格:47.30”
print(span.text('价格:47.30'))
# 修改span内部的html文本为“<a>价格:47.30</a>”
print(span.html('<a>价格:47.30</a>'))
# 输出结果:
<span class="price"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span>
<span class="price">价格:47.30</span>
<span class="price"><a>价格:47.30</a></span>
其他方法及使用方法参考:https://pyquery.readthedocs.io/en/latest/api.html
(8)伪类选择器:CSS3的伪类选择器可以参考https://www.w3school.com.cn/css/css_pseudo_classes.asp,但是https://www.runoob.com/css/css-pseudo-classes.html里罗列的更为详细点。
- 伪类的语法:selector : pseudo-class {property: value}
- CSS 类与伪类搭配使用:selector.class : pseudo-class {property: value}
- first-child:选择父元素下的第一个子元素,只有当元素是另一个元素的第一个子元素时才能匹配。
- last-child:选择父元素下最后一个子元素。
- only-child:选择所有仅有一个子元素的某元素。
- nth-child(n):选择所有某元素的父元素的第n个子元素。
- nth-last-child(n):选择所有某元素倒数的第n个子元素。
# 字符串初始化
string_doc = pq(html_doc)
div1 = string_doc('div:first-child')
print(div1)
div2 = string_doc('div:last-child')
print(div2)
div3 = string_doc('div:only-child')
print(div3)
div4 = string_doc('div:nth-child(3)')
print(div4)
div5 = string_doc('div:nth-last-child(4)')
print(div5)
# 输出结果:
<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="collect-info"> </div>
<div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div>
<div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div>
<div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div>
2. PyQuery简单使用
新建tb_movie_comments表存储爬取的评论:
CREATE TABLE `tb_movie_comments` (
`cid` int(11) NOT NULL AUTO_INCREMENT COMMENT '编号',
`commentator` varchar(100) DEFAULT NULL COMMENT '评论人' ,
`comments` varchar(2000) DEFAULT NULL COMMENT '评论内容',
`votes` varchar(20) DEFAULT NULL COMMENT '点赞数' ,
`createdate` datetime default CURRENT_TIMESTAMP COMMENT '创建时间',
`ctype` char(2) DEFAULT NULL COMMENT '评论类型:1.好评、2.一般、3.差评',
PRIMARY KEY (`cid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
下面爬取《少年的你》短评,理想是爬完存入数据库然后做做大数据分析得出点有价值的信息,最好还能做什么词云之类酷炫的分析,事实是反爬机制让人认清现实,爬完十页就game over了,但是作为使用PyQuery的第一次,留点纪念如下:
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_comment.py
# @Project: Python Notes
# @CreateTime : 2020/5/15 14:52:37
import urllib
from pyquery import PyQuery as pq
import requests
import pymysql
import random
import time
def login(url):
user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.122 Safari/537.36}']
# headers参详URL→F12→Network
headers = {
'Cookie': 'gr_user_id=36d2fea0-91b3-4445-b0c4-2f1eec5e681e; bid=3VpjSZO1pLI; douban-fav-remind=1; '
'__yadk_uid=AwRZnSg2z94qiZ0ziZx8rRTJx0GARPvJ; '
'trc_cookie_storage=taboola%2520global%253Auser-id%3D54ee53eb-ce52-4f1e-b503-f2b4ba820774'
'-tuct2359b57; __gads=ID=953ce3860eb89d60:T=1571272451:S=ALNI_MYayAKeBBq7vr_NBvFfsaRTVepXaw; '
'_vwo_uuid_v2=D2CFD349D628C78D38815D8765A3EB401|d8942a02c6249450bd209b499e64d81c; ll="118297"; '
'douban-profile-remind=1; _ga=GA1.2.2128425525.1488504434; push_doumail_num=0; push_noty_num=0; '
'__utmv=30149280.19762; ct=y; __utmc=30149280; '
'_pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1589252865%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl'
'%3DjxWgT7kJtprsF-uyr7ziX2Rid2J_n9ZVC9_Qu-JHCj9InQNIG3Ew5bcMZK8paZow%26wd%3D%26eqid'
'%3Dae210c5000009ced000000065eba12fc%22%5D; '
'_pk_id.100001.8cb4=a9140f060c7b64ae.1488504433.95.1589252865.1589247679.; '
'viewed="25811418_25904568_4849666_27069880_27608412_2086633_11535042_33413575_34430051_1469051"; '
'dbcl2="77249558:xmnxDXaS+r8"; ck=h_ZU; '
'__utma=30149280.2128425525.1488504434.1589768146.1589771266.143; '
'__utmz=30149280.1589771266.143.73.utmcsr=accounts.douban.com|utmccn=('
'referral)|utmcmd=referral|utmcct=/passport/login',
'User-Agent': str(random.choice(user_agents)),
'Referer': 'https://accounts.douban.com/passport/login',
'Connection': 'keep-alive'
}
req = requests.get(url, headers=headers)
return req
# 定义函数传入url页码与评论类型参数
def comment(ctype, page):
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.122 Safari/537.36'}
num = page * 20
url = 'https://movie.douban.com/subject/30166972/comments?start=' + str(num) + '&limit=20&sort=new_score' \
'&status=P&percent_type=' + ctype
html = login(url)
html_doc = pq(html.text)
data_all = html_doc('.comment-item').items()
for data in data_all:
commentator = data('.comment-info a').text()
comments = data('.short').text()
votes = data('.votes').text()
createdate = data('.comment-time').text()
# print(commentator)
# 将数据存入数据库
db = pymysql.connect(host='192.183.3.***', port=3306, user='nn',
password='******', database='nntest', charset='utf8')
cur = db.cursor()
sql = 'INSERT INTO tb_movie_comments(commentator, comments, votes, createdate, ctype)' \
'VALUES(% s, % s, % s, % s, % s)'
try:
cur.execute(sql, (commentator, comments, votes, createdate, ctype))
print('Insert Successful!')
db.commit()
except:
print('Sorry,Failed!')
db.rollback()
cur.close()
db.close()
# 如果想批量爬取并存入数据库,可以采用如下代码:
ctypes = ['h', 'm', 'l']
for ctype in ctypes:
# 反爬原因爬10页就好了,page起始值为0,爬取10页
for page in range(0, 10, 1):
try:
comment(ctype, page)
print(ctype + '第' + str(page) + '页爬取并存入数据库成功')
except:
print(ctype + '第' + str(page) + '页爬取并存入数据库失败')
time.sleep(10)
得数据者得天下,最后的最后,重要的事情说三遍:爬数请开小号!请开小号!请开小号!!!不作死就不会死,做贼不能光明正大,偷数据的小贼付出的Rollback不了的代价如下(这是我的正经账号QAQ):
模拟登陆、代理什么的要安排上了。