Scrapy框架工作原理
项目采用Scrapy框架开发,该框架是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。
所谓网络爬虫,就是一个在网上随处或定向抓取数据的程序,当然,这种说法不够专业,更专业的描述就是,抓取特定网站网页的HTML数据。
抓取网页的一般方法是定义一个入口页面,然后页面上都会有其他页面的URL,于是从当前页面获取到这些URL加入到爬虫的抓取队列中,然后进入到新页面后再递归的进行上述的操作,其实说来就跟深度遍历或广度遍历一样。
Scrapy框架的安装
由于Scrapy框架不是Python的标准模块,因此我们需要使用pip完成Scrapy模块的本地下载和安装,这个过程很简单,只需在Dos控制台执行以下命令即可(特别强调,先安装 Twisted模块库):
创建工程项目
在命令提示符下,进入你要保存项目的路径。
然后执行命令scrapy startproject 工程项目
这样Scrapy项目就创建完成了。之后,使用 Dos指令查看工程文件夹结构tree /f
以豆瓣为例:
items.py # 采集数据的封装实体类
pipelines.py # 采集后数据的处理
settings.py # 框架核心配置文件
spiders # 爬虫主程序脚本文件夹
核心爬虫脚本程序的创建
又仅需简单一条指令,即可帮助我们创建出一个爬虫的脚本模板
创建Spider爬虫程序模板:
进入工程项目所在文件夹cd 工程项目文件夹名称
scrapy genspider 爬虫脚本名称 访问网站的域名
检测程序与网站的连接
命令:scrapy shell 网站Url地址
我们发现错误信息,如何处理呢?我们需要设置header头部信息,用user-agent进行头部请求伪装。
通过rotate_useragent.py我们可以快速得到一个user-agent的列表,并实现自动随机选取。
rotate_useragent.py
# 导入random模块
import random
# 导入useragent用户代理模块中的UserAgentMiddleware类
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
# RotateUserAgentMiddleware类,继承 UserAgentMiddleware 父类
# 作用:创建动态代理列表,随机选取列表中的用户代理头部信息,伪装请求。
# 绑定爬虫程序的每一次请求,一并发送到访问网址。
# 发爬虫技术:由于很多网站设置反爬虫技术,禁止爬虫程序直接访问网页,
# 因此需要创建动态代理,将爬虫程序模拟伪装成浏览器进行网页访问。
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
#这句话用于随机轮换user-agent
ua = random.choice(self.user_agent_list)
if ua:
# 输出自动轮换的user-agent
print(ua)
request.headers.setdefault('User-Agent', ua)
# the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
# for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
# 编写头部请求代理列表
user_agent_list = [\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
将rotate_useragent.py拷贝到我们的工程项目文件中去,doubanmovie中。
设置一下settings.py,将roate_useragent.py配置到框架中去,此时框架在发送请求时,会自动添加随机的头部user-agent列表其中一个代理信息。
再次使用scrapy shell 指令,查看Scrapy框架的服务器反馈信息。
请求成功!
现在我们已经成功的搭建了我们的Scrapy工程
主要流程:
- 创建scrapy工程
scrapy startproject doubanmovie
- 创建Spider爬虫模板
cd doubanmovie
scrapy genspider moviespider douban.com
- 设置user-agent,配置settings.py
下面我们就要进行爬虫程序的编写了,首先我们应对目标网址https://movie.douban.com/top250进行分析,分析我们想要的数据,并想办法获取。
实现爬虫数据的采集:
- 修改items.py设置数据采集项
- 修改moviespider.py
- 修改piplines.py,在控制台打印数据
首先先进行网页的分析,进入豆瓣top250,建议使用火狐或者谷歌浏览器,打开开发者模式,按F12。查看数据对应的源代码:
我们需要采集的数据项是电影排名,及电影名称。
打开工程中的items.py,设置电影排名以及名称这两个数据采集项。
**注意:**Scrapy框架中 的items.py文件以采集对象的方式存在,将每一个采集项作为
一个采集对象的属性处理,而且每一个属性统一使用scrapy.Field()
函数创建,非常方便。
在工程moviespider.py中设置访问网页的URL。
利用Xpath解析Html标签,从而获取相应的数据值。具体可参考我的另一篇文章https://blog.csdn.net/qq_41251963/article/details/81605331爬取豆瓣电影top250,这篇文章里面对XPath有所介绍,及使用方法。
# -*- coding: utf-8 -*-
import scrapy
#导入items.py中的DoubanmovieItem类
from doubanmovie.items import DoubanmovieItem #导入item.py中的DoubanmovieItem类
class MoviespiderSpider(scrapy.Spider):
name = 'moviespider'
allowed_domains = ['douban.com']
start_urls = ['http://movie.douban.com/top250']#设置爬虫程序访问的网址
def parse(self, response):
#获取当前页面所有<divclass="item"></div>
movie_items=response.xpath('//div[@class="item"]')
#循环遍历每一个电影代码块,解析每一个电影的排,名称//*[@id="content"]/div/div[1]/ol/li[1]/div/div[1]/em
for item in movie_items:
#创建一个空的DoubanmovieItem对象
movie=DoubanmovieItem()
#为对象中的属性赋值
movie['rank']=item.xpath('div[@class="pic"]/em/text()').extract()
#title电影名称
movie['title']=item.xpath('div[@class="info"]/div[@class="hd"]/a/span[@class="title"][1]/text()').extract()
#将生成好的电影采集数据对象添加到生成器中
yield movie
pass
#自动请求翻页实现爬虫的深度采集
nextPage=response.xpath('//span[@class="next"]/a/@href')
#判断nextPage是否有效
if nextPage:
#拼接下一页的地址
url=response.urljoin(nextPage[0].extract())
#发送url后页请求
yield scrapy.Request(url,self.parse)
pass
piplines.py文件是Scrapy框架的输出管道(输出方式),我们可以直接在控制台中输出。
要想使用该输出模式,我们需要在settings.py中设置输出项:
下面的三条信息是我设置的其他输出方式,piplines2json是将数据保存到json文件中,
piplines2excel是保存到电子表格中,piplines2mysql是保存到数据库中。
以下三个源码,只是说明如何用不同的形式保存数据。如果想使用哪种方式,就需要在settings.py中设置输出项。
piplines2excel.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#python 对于Excel操作需要第三方模块库的支持、
#xlwt:实现python对Excel文件的写入操作
#xlrd :实现python对Excel文件的读取操作
#xlutils:实现对Excel 的工具包
import time
import xlwt
import xlrd
from xlutils.copy import copy
class DoubanmoviePipeline(object):
#构造方法:创建一个Excel文件以及内容模板
def __init__(self):
folder_name='output'
current_name=time.strftime("%Y%m%d",time.localtime())
file_name='doubanmovieTop250_'+current_name+'.xls'
#最终文件路径
self.excelpath=folder_name+'/'+file_name
#构建workbook
self.workbook=xlwt.Workbook(encoding='UTF_8')
#创建sheet工作也
self.sheet=self.workbook.add_sheet(u'豆瓣电影数据')
#设置Excel内容的标题
headers=['排名','电影名称']
#设置标题文字的样式
headstyle=xlwt.easyxf('font: color-index black,bold on')
#for循环写入标题内容
for colIndex in range(0,len(headers)):
#按照规定好的字体样式将标题写入
self.sheet.write(0,colIndex,headers[colIndex],headstyle)
pass
#保存创建好的Excel文件
self.workbook.save(self.excelpath)
#全局变量行数
self.rowIndex=1
pass
def process_item(self, item, spider):
print("-->Excel:write to excel file...........")
#读取已经创建好的Excel文件
oldwb=xlrd.open_workbook(self.excelpath,formatting_info='True')
#拷贝一个副本
newwb=copy(oldwb)
#获取Excel要操作的工作页
sheet=newwb.get_sheet(0)
#将采集到的数据转化为一个list
line=[item['rank'],item['title']]
#使用for循环遍历Excel中的每一个cell(行和列定位)
for colIndex in range(0,len(item)):
#将数据写入到指定的行列中去
sheet.write(self.rowIndex,colIndex,line[colIndex])
pass
#完毕后保存Excel文件,自动覆盖源文件
newwb.save(self.excelpath)
#全局的行变量加1
self.rowIndex=self.rowIndex+1
return item
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#导入os模块
import os
import time
import json
class DoubanmoviePipeline(object):
#创建一个构造方法,用于创建所有项目输出文件的文件夹
def __init__(self):
#设置输出文件夹的名称
self.folderName="output"
#判断文件夹是否存在
if not os.path.exists(self.folderName):
#创建文件夹
os.mkdir(self.folderName)
def process_item(self, item, spider):
#s输出提示
print(">>write to json file....")
#获取当前日期的字符串类型数据
now=time.strftime('%Y%m%d',time.localtime())
#设置json文件的名称
jsonFileName='doubanmovie_'+now+'.json'
#打开json文件以追加方式
try:
with open(self.folderName+os.sep+jsonFileName,'a')as jsonfile:
#当前数据序列json格式
data=json.dumps(dict(item),ensure_ascii=False)+'\n'
#写入到json文件
jsonfile.write(data)
except IOError as err:
#输出报错信息
raise ('json file error:{0}'.format(str(err)))
finally:
#关闭文件流
jsonfile.close()
return item
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#导入os模块
import os
import time
import json
class DoubanmoviePipeline(object):
#创建一个构造方法,用于创建所有项目输出文件的文件夹
def __init__(self):
#设置输出文件夹的名称
self.folderName="output"
#判断文件夹是否存在
if not os.path.exists(self.folderName):
#创建文件夹
os.mkdir(self.folderName)
def process_item(self, item, spider):
#s输出提示
print(">>write to json file....")
#获取当前日期的字符串类型数据
now=time.strftime('%Y%m%d',time.localtime())
#设置json文件的名称
jsonFileName='doubanmovie_'+now+'.json'
#打开json文件以追加方式
try:
with open(self.folderName+os.sep+jsonFileName,'a')as jsonfile:
#当前数据序列json格式
data=json.dumps(dict(item),ensure_ascii=False)+'\n'
#写入到json文件
jsonfile.write(data)
except IOError as err:
#输出报错信息
raise ('json file error:{0}'.format(str(err)))
finally:
#关闭文件流
jsonfile.close()
return item
这就是整个项目了。
数据采集部分的流程:
- 设置数据采集类的属性items.py
- 编写spider爬虫程序模板
- 编写piplines.py文件设置输出方式
- 启动项目
scrapy crawl 爬虫文件的名称
整个项目的目录结构:
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for doubanmovie project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'doubanmovie'
SPIDER_MODULES = ['doubanmovie.spiders']
NEWSPIDER_MODULE = 'doubanmovie.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'doubanmovie (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'doubanmovie.middlewares.DoubanmovieSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'doubanmovie.middlewares.DoubanmovieDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':None,
'doubanmovie.rotate_useragent.RotateUserAgentMiddleware':400
}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'doubanmovie.pipelines.DoubanmoviePipeline': 300,
'doubanmovie.pipelines2json.DoubanmoviePipeline': 301,
'doubanmovie.pipelines2excel.DoubanmoviePipeline': 302,
'doubanmovie.pipelines2mysql.DoubanmoviePipeline': 303,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class DoubanmovieSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class DoubanmovieDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
这就是一个完整的Scrapy项目,所有文件的代码都已经写在博客中!