pipline使⽤
-----------------------------pipline使用-------------------------------------------
-----
------从pipeline的字典形可以看出来,pipeline可以有多个,⽽且确实pipeline能够 定义多个-----
为什么需要多个pipeline:
- 1.可能会有多个spider,不同的pipeline处理不同的item的内容
- 2.⼀个spider的内容可以要做不同的操作,⽐如存⼊不同的数据库中
注意:
- 1.pipeline的权重越⼩优先级越⾼
- 2 .pipeline中process_item⽅法名不能修改为其他的名称
------------------------爬虫模块的使用-----------------------------------------
import scrapy
import logging
logger = logging.getLogger(__name__)
class QbSpider(scrapy.Spider):
name = 'qb'
allowed_domains = ['qiushibaike.com']
start_urls = ['http://qiushibaike.com/']
def parse(self, response):
for i in range(10):
item = {}
item['content'] = "haha"
# logging.warning(item)
logger.warning(item)
yield item
------------------------pipeline文件(管道)---------------------------------
import logging
logger = logging.getLogger(__name__)
class MyspiderPipeline(object):
def process_item(self, item, spider):
# print(item)
logger.warning(item)
item['hello'] = 'world'
return item
保存到本地,在setting⽂件中 LOG_FILE = ‘./log.log’
basicConfig样式设置:
- https://www.cnblogs.com/felixzh/p/6072417.html
pipline 管道文件
- 在爬虫文件中要使用yield关键字把数据给管道
- 在setting文件里面一定要开启管道
。open_spider() 爬虫开始了 方法名不能变
。close_spider() 爬虫结束了
yield关键字
- 生成器可以更加节省空间
- 使用灵活
。return在运行之后程序就结束了,yield可以在返回之后可以继续执行后续的操作
。yield给pipline.翻页yield scrapy.Requesft对象,会获取该对象把链接给引擎然后在由引擎交给调度器
item
注意:
- xxx=scrapy.Field() 这个xxx (变量) 他不是一个dict对象
如何进行翻页
--------------------------------前导知识------------------------------------------
scrapy.Request知识点
- scrapy.Request(url, callback=None, method=‘GET’, headers=None, bod y=None,cookies=None, meta=None, encoding=‘utf-8’, priority=0, 2 dont_filter=False, errback=None, flags=None)
常⽤参数为:
- callback:指定传⼊的URL交给那个解析函数去处理
- meta:实现不同的解析函数中传递数据,meta默认会携带部分信息,⽐如下载延迟,请求深 度
- dont_filter:让scrapy的去重不会过滤当前URL,scrapy默认有URL去重功能,对需要 重复请求的URL有重要⽤途
item的介绍和使⽤
items.py
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
position = scrapy.Field()
date = scrapy.Field()
scrapy log信息的认知
Scrapy settings说明和配置
配置⽂件存放⼀些公共的变量(⽐如数据库的地址,账号密码等) ⽅便⾃⼰和别⼈修改 ⼀般⽤全⼤写字⺟命名变量名 SQL_HOST = ‘192.168.0.1’
settings⽂件详细信息:https://www.cnblogs.com/cnkai/p/7399573.html
-----
古诗文翻页
-------------------------------------分析网站页面-------------------------------------
-----
- 第一步 创建Scrapy项目
scrapy startproject gsw
- 第二步 创建爬虫项目
cd gsw
scrapy genspider gs gushiwen.org
- 第三步 在settings里面做一些基本配置
LOG_LEVEL = ‘WARNING’
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
‘user-agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
}
ITEM_PIPELINES = {
‘gsw.pipelines.GswPipeline’: 300,
}
- 第四步 页面分析
。需求:标题 朝代 作者 内容 翻页
https://www.gushiwen.org/default_1.aspx 第一页
https://www.gushiwen.cn/default_2.aspx 第二页
https://www.gushiwen.cn/default_3.aspx 第三页
gushiwen.org gushiwen.cn
通过页面结构分析 所有的内容在 class='left’标签里面 每首诗 class='sons’里面
- 第五步 保存数据(管道中保存)
import json
class GswPipeline:
def open_spider(self,spider):
self.fp = open('gsw.txt','w',encoding='utf-8')
def process_item(self, item, spider):
#print(item)
item_json = json.dumps(dict(item),ensure_ascii=False)
self.fp.write(item_json + '\n')
return item
def close_spider(self,spider):
self.fp.close()
- 第六页翻页
第一步 先找到 下一页的url地址
第二步 通过 yield scrapy.Request(url)
翻页出处理
next_href = response.xpath('//a[@id="amore"]/@href').extract_first()
if next_href:
next_url = response.urljoin(next_href) # 补全url 补全 1 拼串 2
urljoin()
request = scrapy.Request(next_url)
yield request
-------------------------------------------代码-------------------------------------
----
-------爬虫项目代码
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CgsSpider(CrawlSpider):
name = 'cgs'
allowed_domains = ['gushiwen.org','gushiwen.cn']
start_urls = ['http://gushiwen.org/default_1.aspx']
#Rule是一个类,定义提取url的规则
#LinkExtractor 链接提取器
# allow=r'Items/'存放url(正则 重点) callback 回调函数 follow=True 继续跟进(跟进下一页)
rules = (
#列表页1---> #列表页改为cn数据就会源源不断
Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_\d+.aspx'),follow=True),
#详情页
Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item')
)
def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
content = response.xpath('//div[@class="contyishang"]/p/text()').extract()
detail = ''.join(content).strip()
item['detail_content'] = detail
print(item)
return item
-------Scrapy 管道------
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json
class GswPipeline:
def open_spider(self,spider):
self.fp=open('gsw.txt','w',encoding='utf-8')
def process_item(self, item, spider):
#print(item)
#这里的item是一个对象,所以要用的时候必须要转化一下
item_json=json.dumps(dict(item),ensure_ascii=False)
self.fp.write(item_json+'\n')
return item
def close_spider(self,spider): #爬虫结束的方法 item位置参数 close_spider(self,item)方法名固定,不能够写错
self.fp.close()
-----Scrapy itmes模块----
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class GswItem(scrapy.Item):
# define the fields for your item here like:
title= scrapy.Field()#标题
dynasty= scrapy.Field() #朝代
author= scrapy.Field() #作者
content = scrapy.Field() #内容
detail_href=scrapy.Field() #详情页的地址
detail_content=scrapy.Field()#译文及注释
-----Scrapy设置模块
# Scrapy settings for gsw project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'gsw'
SPIDER_MODULES = ['gsw.spiders']
NEWSPIDER_MODULE = 'gsw.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'gsw (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL='WARNING'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#Override the default request headers:#请求头
DEFAULT_REQUEST_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'gsw.middlewares.GswSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'gsw.middlewares.GswDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines #管道
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'gsw.pipelines.GswPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
------Scrapy 启动模块–start
from scrapy import cmdline
cmdline.execute(['scrapy','crawl','gs'])
--------------------------------------------完-------------------------------------
总结:
- 当域名不一样的时候 增加那个不同的 例如 allowed_domains = [‘gushiwen.org’,‘gushiwen.cn’] 也可以给一个最大的 时情况而定
- 如何处理 列表为空的逻辑 非空判断 看豆瓣 在这个案例当中我们要爬取的数据直接保存,所以用
的是 try…except语句来进行出来
- 在 pipline中注意 Item的对象 如果没有用Item.py文件中进行设置 它就是一个字典 反之则不是字典对象 需要处理
- 翻页处理
。 1 可以去找页数规律
。2 直接找到下一页的Url 然后 yield.scrapy.Request(url)
- 遇到url 地址不全的时候两种解决方式
。1 拼串
。 2 urljoin()
-----------------------------------------------------------------------------------
腾讯岗位爬取
需求:
- 爬取腾讯岗位中的职位名称,信息,
- 并进行翻页
- 第一步 创建scrapy项目
scrapy startproject tencent
- 第二步 创建爬虫项目
cd tencent
scrapy genspider hr careers.tencent.com
- 第三步 在settings里面做一些基本配置
LOG_LEVEL = ‘WARNING’
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
‘user-agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
}
ITEM_PIPELINES = {
‘hr.pipelines.GswPipeline’: 300,
}
- 第四步页面保存
- 起始url:
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1611984804055&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn- 详情url,这个必须通过页数url中进行提取:
https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1611992149963&postId=1&language=zh-cn
页面中岗位信息的提取
- item[‘job_name’]=job[‘RecruitPostName’] #工作岗位
post_id=job[‘PostId’]
- 第五步 管道中保存
class TencentPipeline:
def open_spider(self, spider):
self.fp = open('wx.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
# print(item)
# 这里的item是一个对象,所以要用的时候必须要转化一下
item_json = json.dumps(dict(item), ensure_ascii=False)
self.fp.write(item_json + '\n')
return item
def close_spider(self, spider):
self.fp.close()
- 第六步 翻页
> #翻页
for page in range(2,20):
url=self.one_url.format(page)
yield scrapy.Request(url=url)
---------------------------------代码-----------------------------------------------
—
------爬虫项目代码
import scrapy
import json
from 爬虫.Day20.demo1.tencent.tencent.items import TencentItem
class HrSpider(scrapy.Spider):
name = 'hr'
allowed_domains = ['tencent.com']
one_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1611984804055&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
#详情页的url
detail_url='https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1611992149963&postId={}&language=zh-cn'
start_urls = [one_url.format(1)]
def parse(self, response):
#解析数据
data=json.loads(response.text)
for job in data['Data']['Posts']:
item = TencentItem()
item['job_name']=job['RecruitPostName'] #工作岗位
post_id=job['PostId']
d_url=self.detail_url.format(post_id)
#向详情页发起请求
yield scrapy.Request(
url=d_url,
callback=self.detail_content,
meta={'item':item}#传递数据
)
def detail_content(self,response):
#如何传递数据
# item=response.meta['item'] #第一种
item=response.meta.get('item') #获取传递到的数据
data=json.loads(response.text)
item['job_duty']=data['Data']['Responsibility']
yield item
#翻页
for page in range(2,20):
url=self.one_url.format(page)
yield scrapy.Request(url=url)
"""
scrapy.Request()
url 继续发起请求的url
callback 回调函数,其实是在处理整个爬虫文件的逻辑
meta 实现不同的解析函数之间的数据传递
"""
---------pipelines 代码(管道代码)
class TencentPipeline:
def open_spider(self, spider):
self.fp = open('wx.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
# print(item)
# 这里的item是一个对象,所以要用的时候必须要转化一下
item_json = json.dumps(dict(item), ensure_ascii=False)
self.fp.write(item_json + '\n')
return item
def close_spider(self, spider):
self.fp.close()
----Scrapy items模块
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
#name = scrapy.Field()
job_name=scrapy.Field() #工作岗位
job_duty=scrapy.Field() #工作职责
pass
------Scrapy 设置模块
BOT_NAME = 'tencent'
SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders'
LOG_LEVEL='WARNING'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tencent (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'tencent.middlewares.TencentSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'tencent.middlewares.TencentDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'tencent.pipelines.TencentPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
总结
scrapy.Request()
url 继续发起请求的url
callback 回调函数,其实是在处理整个爬虫文件的逻辑
meta 实现不同的解析函数之间的数据传递yield scrapy.Request( url=d_url,#向详情页发起的请求 callback=self.detail_content,#将数据所要传递到的函数 meta={'item':item}#传递数据 ) ```
for page in range(2,20): #进行翻页的请求 url=self.one_url.format(page) yield scrapy.Request(url=url)