Scrapy爬虫从入门到实例精讲（下）

最新推荐文章于 2024-05-10 19:39:09 发布

Wilson_Iceman

最新推荐文章于 2024-05-10 19:39:09 发布

阅读量1.7k

点赞数 3

分类专栏：爬虫文章标签：网络爬虫 Scrapy 寻梦环游记

本文链接：https://blog.csdn.net/Wilson_Iceman/article/details/79190356

版权

爬虫专栏收录该内容

9 篇文章 0 订阅

订阅专栏

今天是这个系列的最后一篇文章了，所以我们来一个大一点的，复杂一点的程序。

今天我们还是爬豆瓣的内容，是豆瓣影评数据。我们找到一个前段时间特别火的一部皮克斯的动画片《寻梦环游记Coco》来进行今天的测试，我们要把豆瓣上这部电影的所有影评数据全部抓取下来。这里说一句题外话，我前一段时间去电影院看了这部电影，真的是良心之作，死亡的话题竟然还能这样讲述，反正我在电影院是哭的稀里哗啦的，强烈建议大家去看看这部电影，真心好看。

好了，言归正传。这部电影的影评大概有4千多篇，因此想要把所有的影评都抓取下来，就需要登录豆瓣。这里建议大家在爬虫豆瓣信息的时候用自己的小号去爬，如果用自己豆瓣的大号去爬，万一哪天被豆瓣发现了，你的大号可能就被禁用了。

首先第一步还是准备工作，今天我们需要安装Faker这个库，待会将详细介绍它的具体作用。

pip install Faker

安装完成之后，我们就可以开始今天的测试了。

scrapy startproject douban

开始一个项目后，我们先来定义需要返回的数据结构，即items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class DoubanMovieCommentItem(scrapy.Item):
	"""docstring for DoubanMovieCommentItem"""
	useful_num = scrapy.Field()
	no_help_num = scrapy.Field()
	people = scrapy.Field()
	people_url = scrapy.Field()
	star = scrapy.Field()
	comment = scrapy.Field()
	title = scrapy.Field()
	comment_page_url = scrapy.Field()

我们定义了一些需要返回的数据，主要是comment，title，还有“有用”和“无用”等，当然你也可以返回自己想要的数据。

pipelines.py中不需要做修改，因为我们的数据解析都在spider中完成了。pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class DoubanPipeline(object):
	def process_item(self, item, spider):
		return item

class DoubanMovieCommentPipeline(object):
	def process_item(self, item, spider):
		return item

我们主要还是看爬虫的程序，我们新建一个文件douban_coco_spider.py。由于这个文件中的代码比较多，所以我们一段一段来看。

# -*- coding: utf-8 -*-

import scrapy
from faker import Factory
from douban.items import DoubanMovieCommentItem
# import urlparse
from urllib import parse as urlparse

f = Factory.create()


class MailSpider(scrapy.Spider):
	"""docstring for MailSpider"""
	name = 'douban-comment'
	allowed_domain = ['accounts.douban.com', 'douban.com']
	start_urls = ['https://www.douban.com']

	headers = {
		'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
		'Accept' : 'gzip, deflate, br',
		'Accept-Language'	: 'zh-CN,zh;q=0.8,en;q=0.6',
		# 'Cache-Control' : 'max-age=0',
		'Connection' : 'keep-alive',
		'Host' : 'accounts.douban.com',
		'User-Agent': f.user_agent() #每次发送请求时在头部随机生成一个UA，以防止浏览器发现是爬虫，而不是浏览器
	}

	formdata = {
		'form_mail' : '*********', #这里是自己的豆瓣用户名
		'form_password' : '***********', #这里是自己的豆瓣用户名的密码
		'login' : '登录',
		'redir' : 'https://www.douban.com',
		'source' : 'None'
	}

首先还是定义用到的模块，这里需要强调的就是faker模块，这个模块是帮助我们生成一个半自动的随机数，这里是每次发送请求的时候，在头部生成一个随机UA，以伪装成一个浏览器去发送请求。User-agent也是HTTP协议的内容，这里不再多说。

这里是定义一个header和一个表单formdata，利用该表单来登录豆瓣。

	def  start_requests(self):
		return [scrapy.Request(url = 'https://www.douban.com/accounts/login', 
								headers = self.headers,
								meta = {'cookiejar' : 1},
								callback = self.parse_login)]
	def parse_login(self, response):
		# if 'captcha_image' in response.body:
		if hasattr(response.body, 'captcha_image'):
			print ("Copy the link")
			link = response.xpath('//img[@class="captcha_image"]/@src').extract()[0]
			print (link)
			captcha_solution = raw_input('captcha_solution:')
			captcha_id = urlparse.parse_qs(urlparse.urlparse(link).query, True)['id']
			self.formdata['captcha-solution'] = captcha_solution
			self.formdata['captcha-id'] = captcha_id
		return [scrapy.FormRequest.from_response(response,
							formdata = self.formdata,
							headers = self.headers,
							meta = {'cookiejar' : response.meta['cookiejar']},
							callback = self.after_login
							)]
	def after_login(self, response):
		print ('after_login: %d' % response.status)
		self.headers['Host'] = 'www.douban.com'
		yield scrapy.Request(url = 'https://movie.douban.com/subject/20495023/reviews',
							meta = {'cookiejar' : response.meta['cookiejar']},
							headers = self.headers,
							callback = self.parse_comment_url,
							dont_filter = True)

		yield scrapy.Request(url = 'https://movie.douban.com/subject/20495023/reviews',
							meta = {'cookiejar' : response.meta['cookiejar']},
							headers = self.headers,
							callback = self.parse_next_page,
							dont_filter = True)

这里定义了三个函数，分别是start_requests, parse_login以及after_login。第一个函数 start_requests就是发起请求，带上头部和cookies，然后返回的页面用parse_login函数来处理。如果返回的页面中有需要填写验证码的属性['captcha-image']，就需要处理验证码登录的问题。这里做了一个简单的处理，就是当需要有验证码登录时，就把这个link打印出来，然后手动的在浏览器里输入这个地址，看到验证码后再在命令行终端raw_input这个验证码。当然关于验证码登录还有很多方法，有些还涉及到图像处理的问题，这里就不多说了。由于我是最近才登陆过豆瓣，所以并没有要求我输入验证码。得到验证码后，连着其他信息一起发送一个post表单到网站的后台，验证通过后，会返回登录成功的页面，我们用after_login这个函数来处理登录成功后的页面。

after_login函数会对同一个页面发起两个请求，当然你也可以把这两个请求写一个Request里面，这里只是为了方便说明问题。

这个请求就是《寻梦环游记》的影评页面，第一次得到这个页面后用parse_comment_url函数来处理，这个函数主要是获取该页影评

里的所有影评详情的url。然后再发起一次请求，这次用parse_next_page函数来处理，该函数是获取下一页的链接地址。这里要提醒

一句，在参数中，我们增加了一个dont_filter = True参数，当我们向同一个地址发起请求时，不要进行过滤。因为Scrapy会默认当向

同一个地址多次发起请求时，后面的请求会自动过滤掉，这里我们不要它过滤。

def parse_next_page(self, response):
	print ('parse_next_page: %d' % response.status)
	try:
		next_url = response.urljoin(response.xpath('//span[@class="next"]/a/@href').extract()[0])
		print ("下一页")
		print (next_url)
		yield scrapy.Request(url = next_url,
					meta = {'cookiejar' : response.meta['cookiejar']},
					headers = self.headers,
					callback = self.parse_comment_url,
					dont_filter = True)
		yield scrapy.Request(url = next_url,
					meta = {'cookiejar' : response.meta['cookiejar']},
					headers = self.headers,
					callback = self.parse_next_page,
					dont_filter = True) 
	except:
		print("Next page error")
		return

def parse_comment_url(self, response):
	print ('parse_comment_url: %d' % response.status)
	for item in response.xpath('//div[@class="main review-item"]'):
		comment_url = item.xpath('div[@class="main-bd"]/h2/a/@href').extract()[0]
		comment_title = item.xpath('div[@class="main-bd"]/h2/a/text()').extract()[0]
		print(comment_title)
		print(comment_url)
		yield scrapy.Request(url = comment_url,
					meta = {'cookiejar' : response.meta['cookiejar']},
					headers = self.headers,
					callback = self.parse_comment)

def parse_comment(self, response):
	print ('parse_comment: %d' % response.status)
	for item in response.xpath('//div[@id="content"]'):
		comment = DoubanMovieCommentItem()
		comment['useful_num'] = item.xpath('//div[@class="main-panel-useful"]/button[1]/text()').extract()[0].strip()
		comment['no_help_num'] = item.xpath('//div[@class="main-panel-useful"]/button[2]/text()').extract()[0].strip()
		comment['people'] = item.xpath('//span[@property="v:reviewer"]/text()').extract()[0]
		comment['people_url'] = item.xpath('//header[@class="main-hd"]/a[1]/@href').extract()[0]
		comment['star'] = item.xpath('//header[@class="main-hd"]/span[1]/@title').extract()[0]

		data_type = item.xpath('//div[@id="link-report"]/div/@data-original').extract()[0]
		print("data_type: " + data_type)
		if data_type == '0':
			comment['comment'] = "\t#####\t".join(map(lambda x:x.strip(), item.xpath('//div[@id="link-report"]/div/p/text()').extract()))
		elif data_type == '1':
			comment['comment'] = "\t#####\t".join(map(lambda x:x.strip(), item.xpath('//div[@id="link-report"]/div[1]/text()').extract()))
			comment['title'] = item.xpath('//span[@property="v:summary"]/text()').extract()[0]
			comment['comment_page_url'] = response.url
		#print comment
		yield comment

这里是 douban_coco_spider.py文件的最后一部分代码了，parse_next_page在处理时也是同样的方法，同一个页面发送两次请

求。请求完之后还是做两个处理，一个是获取该页影评里的所有影评详情的url。另一个获取下一页的链接地址。这里使用了try关键

字，主要是防止到了最后一页后，程序会出错。parse_comment_url函数中，会向详细影评地址发送请求，请求发送之后，用

parse_comment来处理返回的页面，之后对数据进行解析，获取我们想要的数据，此处大部分都是xpath的解析工作。最后返回

comment对象。

最后要说一下settings.py，在settings.py中，我们需要配置几个内容。主要还是需要伪装成浏览器的样子去发出请求

首先是使用faker来随机生成user-agent，

from faker import Factory
f = Factory.create()
USER_AGENT = f.user_agent()

然后禁用cookies

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

之后是设置延时

DOWNLOAD_DELAY = 1

当然我们还是需要配置一下pipelines，告诉scrapy本次爬虫使用哪个pipeline。（尽管我们这次没有使用pipeline）

ITEM_PIPELINES = {
	#'douban.pipelines.DoubanBookPipeline': 300,
	#'douban.pipelines.DoubanMailPipeline': 600,
	'douban.pipelines.DoubanMovieCommentPipeline': 900
}

我们还可以设置一个默认的headers信息

DEFAULT_REQUEST_HEADERS = {
	'Host': 'book.douban.com',
	'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
	'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
	'Accept-Encoding': 'gzip, deflate, br',
	'Connection': 'keep-alive'
}

至此代码部分就全部说明完毕，好了，我们可以在命令行终端跑一下了

scrapy crawl douban-comment -o commenInfo.csv

将结果保存在commentInfo.csv中。下面是运行结果的截图，仅供参考

我们用了三篇文章来说明scrapy这个爬虫框架的使用方法，下面我们来总结一下。

首先使用scrapy startproject + 【项目名称】来启动和初始化一个爬虫项目
在spiders目录下新建自己的spider文件，当然可以新建多个spider爬虫文件，这个爬虫文件是用来解析数据的
在items.py定义需要返回的数据
在pipeline中定义处理数据的方法。在大型项目中，我们都是在这里处理数据的，spider中只是起到download页面的作用
在settings.py文件中配置网络请求的信息，包括，pipeline的选用，默认headers头部信息，user-agent，cookies信息等等
最后在命令行终端启动爬虫：scrapy crawl [爬虫名称] -o xxxxxx.csv

以上就是我这几天学习的心得，当然这里面也充满了很多挑战，也有调试不通过的时候。所以语言这个东西就是多写多练，

时间长了，自然就熟悉了。

与您共勉！

Wilson_Iceman

关注

3
点赞
踩
9

收藏

觉得还不错? 一键收藏
2
评论
Scrapy爬虫从入门到实例精讲（下）

今天是这个系列的最后一篇文章了，所以我们来一个大一点的，复杂一点的程序。今天我们还是爬豆瓣的内容，是豆瓣影评数据。我们找到一个前段时间特别火的一部皮克斯的动画片《寻梦环游记Coco》来进行今天的测试，我们要把豆瓣上这部电影的所有影评数据全部抓取下来。这里说一句题外话，我前一段时间去电影院看了这部电影，真的是良心之作，死亡的话题竟然还能这样讲述，反正我在电影院是哭的稀里哗啦的，强烈建议大家去看看
复制链接

扫一扫