scrapy的手动请求发送实现全站数据爬取
- yield scrapy.Reques(url,callback) 发起的get请求
- callback指定解析函数用于解析数据
- yield scrapy.FormRequest(url,callback,formdata)发起的post请求
- formdata:字典,请求参数
- 为什么start_urls列表中的url会被自动进行get请求的发送?
- 因为列表中的url其实是被start_requests这个父类方法实现的get请求
# 父类方法:这个是该方法的原始实现
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(url=url,callback=self.parse)
- 如何将start_urls中的url默认进行post请求发送?
# 重写父类方法默认进行post请求发送
def start_requests(self):
for u in self.start_urls:
yield scrapy.FormRequest(url=url,callback=self.parse)
开始
创建一个爬虫工程:scrapy startproject proName
进入工程目录创建爬虫源文件:scrapy genspider spiderName www.xxx.com
执行工程:scrapy crawl spiderName
配置pipelines.py文件
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class GpcPipeline:
def process_item(self