scrapy之基础概念与用法
框架
所谓的框架就是一个项目的半成品。也可以说成是一个已经被集成了各种功能(高性能异步下载、队列、分布式、解析、持久化等)的具有很强通用性的项目模板。
安装
Linux:
pip3 install scrapy // pip3具体看自己的pip是pip3
windows:
a. 下载安装wheel
pip3 install wheel
b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
c. 进入下载文件的目录,下载那安装Twisted
pip3 install Twisted-17.1.0-cp35-cp35m-win_amd64.whl # cp35为python的版本
d. 下载安装pywin32
pip3 install pywin32
e. 下载安装scrapy
pip install scrapy
使用
创建工程
scrapy startproject xxoo # xxoo为项目工程名称
创建爬虫文件
需要先切换到工程项目的目录中
cd xxoo # xxoo为项目名称
然后创建爬虫文件
scrapy genspider ooxx www.xxoo.com # ooxx为爬虫文件的名称, www.xxoo.com为起始URL
爬虫文件会自动创建到spiders文件夹中。
执行完上边的命令,会产生一个项目工程,文件结构入下:
-- xxoo
?-- xxoo
?-- spiders # 放置爬虫文件的地方,可以存放多个爬虫文件
?-- __init__.py
?-- ooxx.py # 创建的爬虫文件
?-- __init__.py
?-- items.py # 跟管道一起使用
?-- middlewares.py # 中间件
?-- pipelindes.py # 管道,做通信使用的,传送解析到的数据,然后进解析到的数据行持久化存储。
?-- settings.py # 配置文件
?-- scrapy.cfg # scrapy框架的配置文件,最好不要打开或者擅自修改
爬虫文件ooxx.py的内的代码:
# -*- coding: utf-8 -*- import scrapy
# 在虫过程中要接触到四种父类,Spider是其中的一种
# 进行数据的爬取和解析 class OoxxSpider(scrapy.Spider): # OoxxSpider这个类名称是和爬虫文件的文件名称有关系,前边的是爬虫文件名称的首字母大写的名字,后边是Spider
name = 'ooxx' # 爬虫文件的名称,根据名称可以定位到指定的爬虫文件
allowed_domains = ['www.xxoo.com'] # 允许的域名
start_urls = ['https://www.xxoo.com/'] # 起始URL列表,存放的是起始的URL,是通过创建爬虫文件指定的起始URL指定的,可以改变。
# 用于解析:response就是起始URL对应的响应对象
def parse(self, response):
print(response)
print(response.text) # 获取字符串类型的相应内容
print(response.body) # 获取字节类型的相应内容
response.xpath('') # ''单引号中写xpath解析式
allowed_domains通常都注释掉。当allowed_domains没有注释掉时,start_urls中的URL必须为allowed_domains的子域名,通常网页中的图片的URL都不为allowed_domains的子域名,所以allowed_domains通常都注释掉。
start_urls可以指定多个URL,有几个URL就调用几次parse()方法。通常start_urls里边只存放一个URL,而这只URL通常为首页URL。
执行
在cmd(终端)中执行下一跳代码:
scrapy crawl ooxx # ooxx为爬虫文件的名称
执行上一条代码,将得到打印结果和日志信息,通常我么关注的只是WARING和ERROR级别的日志信息。
scrapy crawl ooxx --nolog # 只打印结果,打印日志信息,降低CPU的使用率
settings.py文件的配置
# -*- coding: utf-8 -*- # Scrapy settings for firstblood project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'firstblood' SPIDER_MODULES = ['firstblood.spiders'] NEWSPIDER_MODULE = 'firstblood.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
# 使用USER_AGENT进行伪装,将请求载体伪装成浏览器 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36' # Obey robots.txt rules
# ROBOTSTXT_OBEY值为True的时候,遵从ROBOTS协议;值为False时,不遵从ROBOTS协议 ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # False 不处理cookie,True 处理cookie,注释掉默认处理cookie,如果为True,则每次都处理cookie,占用资源,降低爬虫的效率 # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'firstblood.middlewares.FirstbloodSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'firstblood.middlewares.FirstbloodDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'firstblood.pipelines.FirstbloodPipeline': 300, # 300表示优先级,数值越小,优先级越高。 #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'