一、Scrapy爬虫框架常用命令
命令
说明
格式
startproject
创建一个新工程
scrapy startproject [dir]
genspider
创建一个爬虫
scrapy genspider [options]
settings
获取爬虫配置信息
scrapy settings [options]
crawl
运行一个爬虫
scrapy crawl
list
列出工程中所有爬虫
scrapy list
shell
启动URL调试命令行
scrapy shell [url]
二、scrapy使用
scrapy startproject demo 新建一个项目demo
scrapy genspider spiderdemo www.baidu.com 创建一个百度的爬虫
scrapy crawl spiderdemo 运行爬虫
三、Request类
方法
说明
.url
Request对应的URL
.method
'GET' 'POST'等
.headers
字典类型风格的请求头
.body
请求内容主体,字符串类型
.meta
用户添加的扩展信息,scrapy内部模块间传递
.copy()
复制请求
四、Response类
方法
说明
.url
Response对应的URL
.status
HTTP状态码
.headers
Response的头部
.body
Response内容主体,字符串类型
.flags
一组标记
.request
产生对应的Request对象
.copy()
复制响应
五、Scrapy框架的股票数据定向爬虫
# pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class Python123DemoPipeline(object):
def process_item(self, item, spider):
return item
class BaidustocksPipeline(object):
def open_spider(self, spider):
self.f = open('Stock.txt', 'w')
def close_spider(self,spider):
self.f.close()
def process_item(self, item, spider):
try:
line = str(dict(item)) + '\n'
self.f.write(line)
except:
pass
return item
# demo.py
# -*- coding: utf-8 -*-
import scrapy
import re
class DemoSpider(scrapy.Spider):
name = 'demo'
# allowed_domains = ['python123.io']
start_urls = ['http://quote.eastmoney.com/stocklist.html']
stock_info_url = 'https://gupiao.baidu.com/stock/'
def parse(self, response):
for href in response.css('a::attr(href)').extract():
try:
stock = re.findall(r"[s][hz]\d{6}", href)[0]
url = 'https://gupiao.baidu.com/stock/' + stock + '.html'
yield scrapy.Request(url, callback=self.parse_stock)
except:
continue
def parse_stock(self, response):
infoDict = {}
stockInfo = response.css('.stock-bets')
name = stockInfo.css('.bets-name').extract()[0]
keyList = stockInfo.css('dt').extract()
valueList = stockInfo.css('dd').extract()
for i in range(len(keyList)):
key = re.findall(r'>.*', keyList[i])[0][1:-5]
try:
val = re.findall(r'\d+\.?.*', valueList[i])[0][0:-5]
except:
val = "--"
infoDict.update({'股票名称':re.findall('\s.*\(', name)[0].split()[0]
+ re.findall('\>.*\<', name)[0][1:-1]}
)
yield infoDict
# settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for python123demo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'python123demo'
SPIDER_MODULES = ['python123demo.spiders']
NEWSPIDER_MODULE = 'python123demo.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'python123demo (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'python123demo.middlewares.Python123DemoSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'python123demo.middlewares.Python123DemoDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'python123demo.pipelines.BaidustocksPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'