Python手记-11:PyQuery爬取豆瓣电影评论

目录

1.PyQuery简介

2. PyQuery简单使用


PyQuery允许对xml文档进行jQuery查询,该API尽可能类似于jQuery,PyQuery使用lxml进行快速的xml和html操作。

1.PyQuery简介

(1)初始化PyQuery对象包括:字符串初始化URL初始化、文件初始化

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

# import requests
from pyquery import PyQuery as pq

html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc('div'))
# URl初始化
# 如下等同:url_doc = pq(requests.get('http://news.baidu.com/').text)
url_doc = pq(url='http://news.baidu.com/')
print(url_doc('title'))
# 文件初始化
txt_doc = pq(filename='test.html')
print(txt_doc('title'))

(2)CSS选择器:https://www.w3school.com.cn/cssref/css_selectors.asp

在 CSS 中,选择器是一种模式,用于选择需要添加样式的元素,如下选取html_doc中class为“subject-item”的所有div节点。

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc('.subject-item div'))

“Run”结果:

<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> 

(3)查找节点

  • 子节点:调用find_all()方法传入CSS选择器,选取img节点的所有子孙节点,可以用children()只筛选子节点。

print(string_doc('.nbg').find('img'))
print(type(string_doc('.nbg').find('img')))
# 输出结果:
<img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> 
<class 'pyquery.pyquery.PyQuery'>
# ------------------------------------------
print(string_doc('.cart-actions').children())
# 输出结果:
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> 
  • 父节点:用parent() 查询直接父节点,parents() 查找祖先节点    传入CSS选择器即可,如下,用.buy-info选取class为buy-info的节点,然后调用parent() 方法得到其直接父节点,用parents() 查找所有的祖先节点,筛选某个祖先节点的话,可以向parents() 方法传入CSS选择器,如下筛选class为cart-actions的父节点。

print(string_doc('.buy-info').parent())
# 输出结果:
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> 

print(string_doc('.buy-info').parents())
# 输出结果:
<html><body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li></body></html><body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li></body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li><div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> 

print(string_doc('.buy-info').parents('.cart-actions'))
# 输出结果:
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> 
  • 兄弟节点:siblings() 筛选兄弟节点,同样的,也可以传入CSS选择器来筛选指定的兄弟节点。

print(string_doc('.cart-actions').siblings())
# 输出结果:
<div class="collect-info"> </div> 

(4)遍历:对于多个节点的结果就需要调用items()方法 遍历,如下:string_doc('span').items()遍历所有的div标签元素。

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

from pyquery import PyQuery as pq
import requests

html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
divs = string_doc('span').items()
# print(divs)
print(type(divs))
for div in divs:
    print(div)

输出结果:

<class 'generator'>
<span class="allstar45"/> 
<span class="rating_nums">9.0</span> 
<span class="pl"> (
561845人评价) </span> 
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> 

(5)获取属性:调用attr()方法来获取属性,对于返回的结果为多节点时,调用attr()方法只会得到第一个节点的属性,需要使用for循环来实现每个节点的遍历。

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

from pyquery import PyQuery as pq
import requests

html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
a = string_doc('div')
# print(a, type(a))
for item in a.items():
    print(item.attr('class'))

运行结果:

pic
info
pub
star clearfix
ft
collect-info
cart-actions

(6)获取文本:调用text()方法来实现,不需要遍历即可获得所有节点内部的文本,如下:

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

from pyquery import PyQuery as pq
import requests

html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc.text())

“Run”结果:

小王子
[法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元
9.0 ( 561845人评价)
小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...
纸质版47.30元起

当然,也可以通过CSS选择器来筛选输出指定节点的文本:

    a = string_doc('span')
    print(a.text())

输出结果: 9.0 ( 561845人评价) 纸质版47.30元起

(7)节点操作:addClass() 为节点添加class属性,removeClass()动态移除节点的class属性。

# 字符串初始化
string_doc = pq(html_doc)
span = string_doc('.buy-info')
print(span)
span_mv = span.remove_class('buy-info')
print(span_mv)
span_add = span_mv.add_class('buy-info')
print(span_add)
# 输出结果:
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> 
<span class=""> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> 
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> 

attr()、text()、html()修改属性值、文本内容、html文本:

# 字符串初始化
string_doc = pq(html_doc)
span = string_doc('.buy-info')
# 修改class源属性值buy-info为price
print(span.attr('class', 'price'))
# 修改span内文本的内容为“价格:47.30”
print(span.text('价格:47.30'))
# 修改span内部的html文本为“<a>价格:47.30</a>”
print(span.html('<a>价格:47.30</a>'))
# 输出结果:
<span class="price"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> 
<span class="price">价格:47.30</span> 
<span class="price"><a>价格:47.30</a></span> 

其他方法及使用方法参考:https://pyquery.readthedocs.io/en/latest/api.html

(8)伪类选择器:CSS3的伪类选择器可以参考https://www.w3school.com.cn/css/css_pseudo_classes.asp,但是https://www.runoob.com/css/css-pseudo-classes.html里罗列的更为详细点。

  • 伪类的语法:selector : pseudo-class {property: value}
  • CSS 类与伪类搭配使用:selector.class : pseudo-class {property: value}
  • first-child:选择父元素下的第一个子元素,只有当元素是另一个元素的第一个子元素时才能匹配。
  • last-child:选择父元素下最后一个子元素。
  • only-child:选择所有仅有一个子元素的某元素。
  • nth-child(n):选择所有某元素的父元素的第n个子元素。
  • nth-last-child(n):选择所有某元素倒数的第n个子元素。
# 字符串初始化
string_doc = pq(html_doc)
div1 = string_doc('div:first-child')
print(div1)
div2 = string_doc('div:last-child')
print(div2)
div3 = string_doc('div:only-child')
print(div3)
div4 = string_doc('div:nth-child(3)')
print(div4)
div5 = string_doc('div:nth-last-child(4)')
print(div5)
# 输出结果:
<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="collect-info"> </div> 
<div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> <p>小王子是一个超凡脱俗的仙童,他住在一颗只比他大一丁点儿的小行星上。陪伴他的是一朵他非常喜爱的小玫瑰花。但玫瑰花的虚荣心伤害了小王子对她的感情。小王子告别小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">纸质版47.30元起</a> </span> </div> 

<div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人评价) </span> </div> 
<div class="pub"> [法]圣埃克苏佩里/马振聘/人民文学出版社/2003-8/22.00元 </div> 

2. PyQuery简单使用

 

新建tb_movie_comments表存储爬取的评论:

CREATE TABLE `tb_movie_comments` (
  `cid` int(11) NOT NULL AUTO_INCREMENT COMMENT '编号',
  `commentator` varchar(100) DEFAULT NULL COMMENT '评论人' ,
  `comments` varchar(2000) DEFAULT NULL COMMENT '评论内容',
  `votes` varchar(20)  DEFAULT NULL COMMENT '点赞数' ,
  `createdate` datetime default CURRENT_TIMESTAMP COMMENT '创建时间',
  `ctype` char(2) DEFAULT NULL COMMENT '评论类型:1.好评、2.一般、3.差评',
  PRIMARY KEY (`cid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

下面爬取《少年的你》短评,理想是爬完存入数据库然后做做大数据分析得出点有价值的信息,最好还能做什么词云之类酷炫的分析,事实是反爬机制让人认清现实,爬完十页就game over了,但是作为使用PyQuery的第一次,留点纪念如下:

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_comment.py
# @Project: Python Notes
# @CreateTime : 2020/5/15 14:52:37

import urllib
from pyquery import PyQuery as pq
import requests
import pymysql
import random
import time


def login(url):

    user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/81.0.4044.122 Safari/537.36}']
# headers参详URL→F12→Network
    headers = {
        'Cookie': 'gr_user_id=36d2fea0-91b3-4445-b0c4-2f1eec5e681e; bid=3VpjSZO1pLI; douban-fav-remind=1; '
                  '__yadk_uid=AwRZnSg2z94qiZ0ziZx8rRTJx0GARPvJ; '
                  'trc_cookie_storage=taboola%2520global%253Auser-id%3D54ee53eb-ce52-4f1e-b503-f2b4ba820774'
                  '-tuct2359b57; __gads=ID=953ce3860eb89d60:T=1571272451:S=ALNI_MYayAKeBBq7vr_NBvFfsaRTVepXaw; '
                  '_vwo_uuid_v2=D2CFD349D628C78D38815D8765A3EB401|d8942a02c6249450bd209b499e64d81c; ll="118297"; '
                  'douban-profile-remind=1; _ga=GA1.2.2128425525.1488504434; push_doumail_num=0; push_noty_num=0; '
                  '__utmv=30149280.19762; ct=y; __utmc=30149280; '
                  '_pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1589252865%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl'
                  '%3DjxWgT7kJtprsF-uyr7ziX2Rid2J_n9ZVC9_Qu-JHCj9InQNIG3Ew5bcMZK8paZow%26wd%3D%26eqid'
                  '%3Dae210c5000009ced000000065eba12fc%22%5D; '
                  '_pk_id.100001.8cb4=a9140f060c7b64ae.1488504433.95.1589252865.1589247679.; '
                  'viewed="25811418_25904568_4849666_27069880_27608412_2086633_11535042_33413575_34430051_1469051"; '
                  'dbcl2="77249558:xmnxDXaS+r8"; ck=h_ZU; '
                  '__utma=30149280.2128425525.1488504434.1589768146.1589771266.143; '
                  '__utmz=30149280.1589771266.143.73.utmcsr=accounts.douban.com|utmccn=('
                  'referral)|utmcmd=referral|utmcct=/passport/login',
        'User-Agent': str(random.choice(user_agents)),
        'Referer': 'https://accounts.douban.com/passport/login',
        'Connection': 'keep-alive'
    }
    req = requests.get(url, headers=headers)
    return req


# 定义函数传入url页码与评论类型参数
def comment(ctype, page):
    headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/81.0.4044.122 Safari/537.36'}
    num = page * 20
    url = 'https://movie.douban.com/subject/30166972/comments?start=' + str(num) + '&limit=20&sort=new_score' \
                                                                                   '&status=P&percent_type=' + ctype
    html = login(url)
    html_doc = pq(html.text)
    data_all = html_doc('.comment-item').items()
    for data in data_all:
        commentator = data('.comment-info a').text()
        comments = data('.short').text()
        votes = data('.votes').text()
        createdate = data('.comment-time').text()
        # print(commentator)
        # 将数据存入数据库
        db = pymysql.connect(host='192.183.3.***', port=3306, user='nn',
                             password='******', database='nntest', charset='utf8')
        cur = db.cursor()
        sql = 'INSERT INTO tb_movie_comments(commentator, comments, votes, createdate, ctype)' \
              'VALUES(% s, % s, % s, % s, % s)'
        try:
            cur.execute(sql, (commentator, comments, votes, createdate, ctype))
            print('Insert Successful!')
            db.commit()
        except:
            print('Sorry,Failed!')
            db.rollback()
        cur.close()
        db.close()


# 如果想批量爬取并存入数据库,可以采用如下代码:
ctypes = ['h', 'm', 'l']
for ctype in ctypes:
    # 反爬原因爬10页就好了,page起始值为0,爬取10页
    for page in range(0, 10, 1):
        try:
            comment(ctype, page)
            print(ctype + '第' + str(page) + '页爬取并存入数据库成功')
        except:
            print(ctype + '第' + str(page) + '页爬取并存入数据库失败')
    time.sleep(10)

得数据者得天下,最后的最后,重要的事情说三遍:爬数请开小号!请开小号!请开小号!!!不作死就不会死,做贼不能光明正大,偷数据的小贼付出的Rollback不了的代价如下(这是我的正经账号QAQ):

模拟登陆、代理什么的要安排上了。

 

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值