本章主要介绍数据管道使用方法,创建好Scrapy项目后进行设置,加入下面代码减少过的的日志打印
LOG_LEVEL = "WARNING"
拒绝协议
ROBOTSTXT_OBEY = False
加入请求头伪装一下
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
}
配置数据管道
ITEM_PIPELINES = {
"caipiao.pipelines.CaipiaoPipeline": 300,
}
打开items.py文件进行数据配置,这里的配置类似于数据库建表后的字段配置
class CaipiaoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
qihao = scrapy.Field()
red_balls = scrapy.Field()
blue_balls = scrapy.Field()
导入
from caipiao.items import CaipiaoItem
配置好之后就可以接收数据了,下面是网页爬取的双色球数据
def parse(self, response, **kwargs):
trs = response.xpath("//tbody[@id='tdata']/tr")
for tr in trs:
qihao = tr.xpath("./td[@align='center']/text()").extract()
red_balls = tr.xpath("./td[@class='chartBall01']/text()").extract()
blue_balls = tr.xpath("./td[@class='chartBall02']/text()").extract()
cai = CaipiaoItem()
cai["qihao"] = qihao
cai["red_balls"] = red_balls
cai["blue_balls"] = blue_balls
yield cai
接收后可以在pipelines.py文件中打印出来
class CaipiaoPipeline:
def process_item(self, item, spider):
print(item)
return item
运行start.py文件就行了