scrapy组件
- 引擎(engine)
负责总体调度 - 调度器(scheduler)
a. 接受引擎(爬虫组件)发送来的Requests对象,保存
b. 弹出Requests对象,交给引擎(下载器) - 下载器(downloader)
接收引擎(调度器)发过来的Requests对象,发送网络请求,并且获取响应,把响应交给引擎(爬虫组件) - 爬虫组件(spiders)
接收引擎(下载器)传递过来的Response,同时解析response
a.把提取出的数据交给引擎(管道)
b.提取出url,构造Requests请求,交给引擎(调度器) - 管道(pipeline)
保存数据
使用
1.安装
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy
加上镜像会自动安装依赖
2.创建项目
cmd中输入 scrapy startproject project_name
例:scrapy startproject demo
切换文件目录
cd project_name
新建spider
scrapy genspider spidername xxx.com
例:
cd demo
scrapy genspider spider1 xxx.com
└── demo //demo文件夹
└── spiders // spider文件夹
├── init.py
└── spider1.py //上面操作创建的爬虫
├── init.py
├── items.py // 保存数据的结构
├── middlewares.py // 中间件
├── pipelines.py //管道(pipeline)
└── settings.py // 包含一些设置
修改 spider1.py文件
# -*- coding: utf-8 -*-
import scrapy
class Spider1Spider(scrapy.Spider):
name = 'spider1'
allowed_domains = ['xxx.com'] #允许的域名
start_urls = ['http://xxx.com/'] #起始url
def parse(self, response): #起始url下载后的response对象交给parse方法解析
pass
默认发送的起始url请求没有修改user-agent,也可用如下方法自定义起始请求
import scrapy
class Spider1Spider(scrapy.Spider):
name = 'spider1'
allowed_domains = ['baidu.com'] #允许的域名
# start_urls = ['http://xxx.com/'] #起始url
def start_requests(self): #重写起始请求
url = 'https://www.baidu.com/'
#添加用户代理
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"}
#返回请求对象
yield scrapy.Request(url,headers=headers,callback=self.parse)
def parse(self, response): #在parse方法中定义起始响应的解析逻辑
print(response.text)
下面来获取百度首页的一些简单数据并保存
首先,可以在此处设置robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
在items.py中定义items对象存储数据的字段
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DemoItem(scrapy.Item):
# define the fields for your item here like:
titles = scrapy.Field()
# pass
在spider1.py中导入items.py文件中的DemoItem对象,解析响应并保存在对应字段中
# -*- coding: utf-8 -*-
import scrapy
from ..items import DemoItem
class Spider1Spider(scrapy.Spider):
name = 'spider1'
allowed_domains = ['baidu.com'] #允许的域名
# start_urls = ['http://xxx.com/'] #起始url
def start_requests(self): #重写起始请求
url = 'https://www.baidu.com/'
#添加用户代理
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"}
#返回请求对象
yield scrapy.Request(url,headers=headers,callback=self.parse)
def parse(self, response): #在parse方法中定义起始响应的解析逻辑
titles = response.xpath('//ul[@id="hotsearch-content-wrapper"]/li/a/span[2]/text()').getall()
item = DemoItem()
item['titles'] = titles
yield item #返回item对象
# print(response.text)
在settings.py中打开piplines(取消注释)
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'demo.pipelines.DemoPipeline': 300,
}
在piplines.py文件中定义数据保存的逻辑
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
class DemoPipeline(object):
def __init__(self):
self.file = open('data.txt','w',encoding='utf-8')
def process_item(self, item, spider):
self.file.write(json.dumps(dict(item),ensure_ascii=False,indent='')+',\n')
return item
def __del__(self):
self.file.close()
流程总结
- 重写start_requests构造起始请求对象并指定解析响应的回调函数,yield返回请求对象
- 定义解析响应的逻辑,将数据保存在item中,通过yield返回给引擎
- 在settings.py中打开piplines,piplines收到引擎发送的item对象后保存在自定义路径文件中