框架Feapder介绍
feapder 是一款简单、快速、轻量级的爬虫框架。起名源于 fast、easy、air、pro、spider的缩写,以开发快速、抓取快速、使用简单、功能强大为宗旨,历时4年倾心打造。支持轻量爬虫、分布式爬虫、批次爬虫、爬虫集成,以及完善的爬虫报警机制。
框架流程图
模块说明:
- spider 框架调度核心
- parser_control 模版控制器,负责调度parser
- collector 任务收集器,负责从任务队里中批量取任务到内存,以减少爬虫对任务队列数据库的访问频率及并发量
- parser 数据解析器
- start_request 初始任务下发函数
- item_buffer 数据缓冲队列,批量将数据存储到数据库中
- request_buffer 请求任务缓冲队列,批量将请求任务存储到任务队列中
- request 数据下载器,封装了requests,用于从互联网上下载数据
- response 请求响应,封装了response, 支持xpath、css、re等解析方式,自动处理中文乱码
流程说明:
- spider调度start_request生产任务
- start_request下发任务到request_buffer中
- spider调度request_buffer批量将任务存储到任务队列数据库中
- spider调度collector从任务队列中批量获取任务到内存队列
- spider调度parser_control从collector的内存队列中获取任务
- parser_control调度request请求数据
- request请求与下载数据
- request将下载后的数据给response,进一步封装
- 将封装好的response返回给parser_control(图示为多个parser_control,表示多线程)
- parser_control调度对应的parser,解析返回的response(图示多组parser表示不同的网站解析器)
- parser_control将parser解析到的数据item及新产生的request分发到item_buffer与request_buffer
- spider调度item_buffer与request_buffer将数据批量入库
Scrapy安装
pip3 install feapder
创建项目
feapder create -p Amazon-spider
Amazon-spider 项目生成成功
我用的pycharm,先右键,将这个项目加入到工作区间。(右键项目名,Mark Directory as -> Sources Root)
创建爬虫
cd Amazon-spider/spiders
feapder create -s list_spider
ListSpider 生成成功
代码 如下 :
import feapder
class ListSpider(feapder.AirSpider):
def start_requests(self):
yield feapder.Request("https://www.baidu.com")
def parse(self, request, response):
print(response)
if __name__ == "__main__":
ListSpider().start()
写爬虫
下发任务:
def start_requests(self):
url = "https://www.amazon.com/s?k=power+strips"
yield feapder.Request(url , render=True)
编写解析函数
观察页面结构,写xapth解析
def parse(self, request, response):
# print(response.text)
DivShopS = response.xpath('//*[@id="search"]/div[1]/div/div[1]/div/span[3]/div[2]/div')
# DivShopS = response.xpath('//*[@id="search"]/div[1]/div[2]/div/span[3]/div[2]/div/div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a/span')
# DivShopS = response.xpath('//*[@id="search"]/div[1]/div/div[1]/div/span[3]/div[2]/div/div/span/div/div/div[2]/h2/a/span')
print('DivShopS -->>> ',len(DivShopS))
for DivShop in DivShopS:
ShopTitle = DivShop.xpath('./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a/span')
ShopComNum = DivShop.xpath('./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[2]/div/span[2]/a/span')
print('ShopComNum -->>> ', ShopComNum,ShopTitle)
数据入库
创建表 amazon_shop_list
CREATE TABLE `amazon_shop_list` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`ShopTitle` varchar(255) DEFAULT NULL,
`ShopComNum` varchar(255) DEFAULT NULL,
`ShopUrl` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
数据入库方式
数据入库有很多方式,
- 直接导入
pymysql
然后拼接sql语句入库, - 使用框架自带的
MysqlDB
。 - feapder有一种更方便的入库方式,自动入库
自动入库AirSpider是不支持的,因为他比较轻量嘛,作者为了保持轻量的特性,暂时没支持自动入库。
不过分布式爬虫Spider是支持的,我们直接将继承类改为Spider 即可
class ListSpider(feapder.AirSpider):
改为
class ListSpider(feapder.Spider):
同时 设置 redis 配置 和 种子请求
生成item
item是与表一一对应的,与数据入库机制有关,可用feapder命令生成。
首先配置下数据库连接信息,在setting中配置的
生成item:
cd items
feapder create -i amazon_shop_list
MainThread|2021-03-25 13:04:20,004|mysqldb.py|__init__|line:91|DEBUG| 连接到mysql数据库 122.51.82.213 : crawler
_project
amazon_shop_list_item.py 生成成功
数据入库
以 yield item
的方式将数据返回给框架,框架自动批量入库
def parse(self, request, response):
# time.sleep(50)
# print(response.text)
DivShopS = response.xpath('//*[@id="search"]/div[1]/div/div[1]/div/span[3]/div[2]/div')
DivShopS = response.xpath('//*[@id="search"]/div[1]/div[2]/div/span[3]/div[2]/div')
# DivShopS = response.xpath('//*[@id="search"]/div[1]/div[2]/div/span[3]/div[2]/div/div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a/span')
# DivShopS = response.xpath('//*[@id="search"]/div[1]/div/div[1]/div/span[3]/div[2]/div/div/span/div/div/div[2]/h2/a/span')
print('DivShopS -->>> ',len(DivShopS))
for DivShop in DivShopS:
Shoplist_item = amazon_shop_list_item.AmazonShopListItem()
ShopTitle = DivShop.xpath(' ./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a/span/text()').extract_first()
ShopComNum = DivShop.xpath('./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[2]/div/span[2]/a/span/text()').extract_first()
# ShopTitle = DivShop.xpath('./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a/span/text').extract_first()
# ShopComNum = DivShop.xpath('./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[2]/div/span[2]/a/span/text').extract_first()
print('ShopComNum -->>> ', ShopComNum,ShopTitle)
Shoplist_item.ShopComNum = ShopComNum
Shoplist_item.ShopTitle = ShopTitle
Shoplist_item.ShopUrl = response.url
yield Shoplist_item # 直接返回,框架实现批量入库
整体代码
import feapder
import time
from items import amazon_shop_list_item
class ListSpider(feapder.Spider):
# 种子 请求
def start_requests(self):
url = "https://www.amazon.com/s?k=power+strips"
headers = {
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'rtt': '50',
'downlink': '10',
'ect': '4g',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.amazon.cn/s?__mk_zh_CN=^%^E4^%^BA^%^9A^%^E9^%^A9^%^AC^%^E9^%^80^%^8A^%^E7^%^BD^%^91^%^E7^%^AB^%^99^&i=aps^&k=power^%^20strip^&ref=nb_sb_noss_1^&url=search-alias^%^3Daps',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
# yield feapder.Request(url,headers=headers)
yield feapder.Request(url,headers=headers, render=True)
def parse(self, request, response):
# time.sleep(50)
# print(response.text)
DivShopS = response.xpath('//*[@id="search"]/div[1]/div/div[1]/div/span[3]/div[2]/div')
DivShopS = response.xpath('//*[@id="search"]/div[1]/div[2]/div/span[3]/div[2]/div')
# DivShopS = response.xpath('//*[@id="search"]/div[1]/div[2]/div/span[3]/div[2]/div/div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a/span')
# DivShopS = response.xpath('//*[@id="search"]/div[1]/div/div[1]/div/span[3]/div[2]/div/div/span/div/div/div[2]/h2/a/span')
print('DivShopS -->>> ',len(DivShopS))
for DivShop in DivShopS:
Shoplist_item = amazon_shop_list_item.AmazonShopListItem()
ShopTitle = DivShop.xpath(' ./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a/span/text()').extract_first()
ShopComNum = DivShop.xpath('./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[2]/div/span[2]/a/span/text()').extract_first()
# ShopTitle = DivShop.xpath('./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a/span/text').extract_first()
# ShopComNum = DivShop.xpath('./div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[2]/div/span[2]/a/span/text').extract_first()
print('ShopComNum -->>> ', ShopComNum,ShopTitle)
Shoplist_item.ShopComNum = ShopComNum
Shoplist_item.ShopTitle = ShopTitle
Shoplist_item.ShopUrl = response.url
yield Shoplist_item # 直接返回,框架实现批量入库
if __name__ == "__main__":
spider = ListSpider(redis_key="amazon:shop_list")
spider.start()