Scrapy
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual
https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
pip install D:\Python\Python37\Scripts\Twisted-18.9.0-cp37-cp37m-win_amd64.whl
C:\Windows\System32>pip install scrapy
C:\Windows\System32>scrapy -h
Scrapy 1.5.1 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
Engine
- 控制所有模块之间的数据流
- 根据条件触发事件
Downloader
- 根据请求下载网页
Scheduler
- 对所有爬取请求进行调度管理
Downloader Middleware
- 目的:实施Engine、Scheduler和Downloader之间进行用户可配置的控制
- 功能:修改、丢弃、新增请求或响应
Spider
- 解析Downloader返回的响应(Response)
- 产生爬取项(scraped item)
- 产生额外的爬取请求(Request)
Item Pipelines
- 以流水线方式处理Spider产生的爬取项
- 由一组操作顺序组成,类似流水线,每个操作是一个Item Pipeline类型
- 可能操作包括:清理、检验和查重爬取项中的HTML数据、将数据存储到数据库
Spider Middleware
目的:对请求和爬取项的再处理
功能:修改、丢弃、新增请求或爬取项
scrapy <commnd>[options][args]
startproject 创建一个新工程 scrapy stratproject<name>[dir]
genspider 创建一个爬虫 scrapy genspider [options]<name><domain>
settings 获得爬虫配置信息 scrapy settings [options]
crawl 运行一个爬虫 scrapy crawl <spider>
list 列出工程中所有爬虫 scrapy list
shell 启动URL调试命令行 crapy shell [url]
http://python123.io/ws/demo.html
C:\Windows\System32>D:
D:\>scrapy startproject python123demo
New Scrapy project 'python123demo', using template directory 'd:\\python\\python37\\lib\\site-packages\\scrapy\\templates\\project', created in:
D:\python123demo
You can start your first spider with:
cd python123demo
scrapy genspider example example.com
scrapy.cfg 部署Scrapy爬虫的配置文件
__init__.py 初始化脚本
items.py Items代码模板(继承类)
middlewares.py Middlewares代码模板(继承类)
pipelines.py Pipelines代码模板(继承类)
settings.py Scrapy爬虫的配置文件
spiders/ Spiders代码模板目录(继承类)
__init__py 初始文件,无序修改
__pycache__/ 缓存目录,无需修改
D:\>cd \python123demo
D:\python123demo>scrapy genspider demo python.io
Created spider 'demo' using template 'basic' in module:
python123demo.spiders.demo
D:\python123demo\python123demo\spiders
demo.py
# -*- coding: utf-8 -*-
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo'
allowed_domains = ['python.io']
start_urls = ['http://python.io/']
def parse(self, response):
pass
# -*- coding: utf-8 -*-
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo'
#allowed_domains = ['python.io']
start_urls = ['http://python123.io/ws/demo.html']
def parse(self, response):
fname = response.url.split('/')[-1]
with open(fname,'wb') as f:
f.write(response.body)
self.log('Saved file %s.' % name)
D:\python123demo>scrapy crawl demo
yield关键字
- 生成器是一个不断产生值得函数
- 包含yield语句得函数是一个生成器
- 生成器每次产生一个值(yield语句),函数被冻结,被唤醒后 再产生一个值
生成器写法:
>>> def gen(n):
for i in range(n):
yield i ** 2
>>> for i in gen(5):
print(i," ",end = "")
0 1 4 9 16
普通写法:
>>> def square(n):
ls = [i ** 2 for i in range(n)]
return ls
>>> for i in square(5):
print(i," ",end = "")
0 1 4 9 16
完整
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo'
def start_requests(self):
urls = [
http://python123.io/ws/demo.html
]
for url in urls:
yield scrapy.Requests(url = url,callback = self.parse)
def parse(self,response):
fname = response.url.split('/')[-1]
with open(fname,'wb') as f:
f.write(response.body)
self.log('Saved file %s.' % fname)
Requests类
class scrapy.http.Requests()
- Requests对象表示一个HTTP请求
- 由Spider生成,由Downloader执行
.url Requests对应得请求URL地址
.method 对应的请求方式,'GET''POST'等
.headers 字典类型风格的请求头
.body 请求内容主体,字符串类型
.meta 用户添加的扩展信息,在Scrapy内部模块间传递信息使用
.copy 复制该请求
Response类
calss scrapy.http.Response()
- Response对象表示一个HTTP响应
- 由Downloader生成,由Spider处理
.url Response对应的URL地址
.status HTTP状态码,默认是200
.headers Response对应的头部信息
.body Response对应的内容信息,字符串类型
.flags 一组标记
.request 产生Response类型对应的Request对象
.copy 复制该响应
Item类
class scrapy.item.Item()
- Item对象表示一个从HTTP页面中提取的信息内容
- 由Spider生成,由Item Pipeline处理
- Item类似字典类型,可以按照字典类型操作
CSS Selector
<HTML>.css('a::attr(href').extract()
股票数据Scrapy爬虫
东方财富网http://quote.eastmoney.com/stocklist.html
百度股票https://gupiao.baidu.com/stock/
单个股票https://gupiao.baidu.com/stock/xxx.html
C:\Windows\System32>cd ..\..\Python3.7.0
C:\Python3.7.0>scrapy startproject BaiduStocks
New Scrapy project 'BaiduStocks', using template directory 'd:\\python\\python37\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Python3.7.0\BaiduStocks
You can start your first spider with:
cd BaiduStocks
scrapy genspider example example.com
C:\Python3.7.0>cd BaiduStocks
C:\Python3.7.0\BaiduStocks>scrapy genspider stocks baidu.com
Created spider 'stocks' using template 'basic' in module:
BaiduStocks.spiders.stocks
CONCURRENT_REQUESTS_PER_DOMAIN Download最大并发请求下载量,默认32
CONCURRENT_ITEMS Item Pipeline最大ing发ITEM处理数量,默认100
CONCURRENT_REQUESTS_PER_DOMAIN 每个目标域名最大的兵法请求数量,默认8
CONCURRENT_REQUESTS_PER_IP 每个目标IP最大的兵法请求数量,默认0,非0有效
# -*- coding: utf-8 -*-
import scrapy
import re
class StockSpider(scrapy.Spider):
name = 'stocks'
start_urls = ['http://quote.eastmoney.com/stocklist.html']
def parse(self, response):
for href in response.css('a::attr(href)').extract():
try:
stock = re.findall(r"[s][hz]\d{6}").href[0]
url = 'https://gupiao.baidu.com/stock/' + stock + '.html'
yield scrapy.Request(url,callback = self.parse_stock)
except:
continue
def parse_stock(self,response):
infoDict = {}
stockInfo = response.css('.stock-bets')
name = stockInfo.css('.bets-name').extract()[0]
keyList = stockInfo.css('dt').extract()
valueList = stockInfo.css('dd').extract()
for i in range(len(keyList)):
key = re.findall(r'>.*</dt>',keyList[i])[0][1:-5]
try:
val = re.findall(r'\d+\.?*</dd>',valueList[i])[0][0:-5]
except:
val = '--'
infoDict[key] = val
infoDict.update(
{'股票名称':re.findall('\s.*\(',name)[0].split()[0] + \
re.findall('\>.*\<',name)[0][1:-i]})
yield infoDict
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class BaidustocksPipeline(object):
def process_item(self, item, spider):
return item
class BaidustocksIndoPipeline(object):
def open_spider(self,spider):
self.f = open('BaiduStockInfo.txt''w')
def cloose_spider(self,spider):
self.f.close()
def process_item(self,item,spider):
try:
line = str(dict(item)) + '\n'
self.f.write(line)
except:
pass
return item
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'BaiduStock.pipelines.BaidustocksInfoPipeline': 300,
}
C:\Python3.7.0\BaiduStock>pip install pypiwin32