Scrapy--2 pipeline管道

pipline使⽤

-----------------------------pipline使用-------------------------------------------
-----
------从pipeline的字典形可以看出来,pipeline可以有多个,⽽且确实pipeline能够 定义多个-----

为什么需要多个pipeline:
  • 1.可能会有多个spider,不同的pipeline处理不同的item的内容
  • 2.⼀个spider的内容可以要做不同的操作,⽐如存⼊不同的数据库中
注意:
  • 1.pipeline的权重越⼩优先级越⾼
  • 2 .pipeline中process_item⽅法名不能修改为其他的名称

------------------------爬虫模块的使用-----------------------------------------

import scrapy 
import logging

logger = logging.getLogger(__name__) 
class QbSpider(scrapy.Spider): 
	name = 'qb' 
	allowed_domains = ['qiushibaike.com'] 
	start_urls = ['http://qiushibaike.com/'] 
	def parse(self, response): 
		for i in range(10): 
			item = {} 
			item['content'] = "haha" 
			# logging.warning(item) 
			logger.warning(item) 
			yield item

------------------------pipeline文件(管道)---------------------------------

import logging  
logger = logging.getLogger(__name__) 
class MyspiderPipeline(object): 
	def process_item(self, item, spider): 
		# print(item) 
		logger.warning(item) 
		item['hello'] = 'world' 
		return item

保存到本地,在setting⽂件中 LOG_FILE = ‘./log.log’

basicConfig样式设置:

  • https://www.cnblogs.com/felixzh/p/6072417.html
pipline 管道文件
  • 在爬虫文件中要使用yield关键字把数据给管道
  • 在setting文件里面一定要开启管道
    。open_spider() 爬虫开始了 方法名不能变
    。close_spider() 爬虫结束了
yield关键字
  • 生成器可以更加节省空间
  • 使用灵活
    。return在运行之后程序就结束了,yield可以在返回之后可以继续执行后续的操作
    。yield给pipline.翻页yield scrapy.Requesft对象,会获取该对象把链接给引擎然后在由引擎交给调度器
item

注意:

  • xxx=scrapy.Field() 这个xxx (变量) 他不是一个dict对象


如何进行翻页

--------------------------------前导知识------------------------------------------

scrapy.Request知识点
  • scrapy.Request(url, callback=None, method=‘GET’, headers=None, bod y=None,cookies=None, meta=None, encoding=‘utf-8’, priority=0, 2 dont_filter=False, errback=None, flags=None)

常⽤参数为:

  • callback:指定传⼊的URL交给那个解析函数去处理
  • meta:实现不同的解析函数中传递数据,meta默认会携带部分信息,⽐如下载延迟,请求深 度
  • dont_filter:让scrapy的去重不会过滤当前URL,scrapy默认有URL去重功能,对需要 重复请求的URL有重要⽤途
item的介绍和使⽤
items.py 
import scrapy 
class TencentItem(scrapy.Item): 
	# define the fields for your item here like: 
	title = scrapy.Field() 
	position = scrapy.Field() 
	date = scrapy.Field()
scrapy log信息的认知

这里是引用

这里是引用
在这里插入图片描述
这里是引用

Scrapy settings说明和配置

配置⽂件存放⼀些公共的变量(⽐如数据库的地址,账号密码等) ⽅便⾃⼰和别⼈修改 ⼀般⽤全⼤写字⺟命名变量名 SQL_HOST = ‘192.168.0.1’

settings⽂件详细信息:https://www.cnblogs.com/cnkai/p/7399573.html

-----

古诗文翻页

-------------------------------------分析网站页面-------------------------------------
-----

  • 第一步 创建Scrapy项目

scrapy startproject gsw


  • 第二步 创建爬虫项目

cd gsw
scrapy genspider gs gushiwen.org


  • 第三步 在settings里面做一些基本配置

LOG_LEVEL = ‘WARNING’
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
‘user-agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
}
ITEM_PIPELINES = {
‘gsw.pipelines.GswPipeline’: 300,
}


  • 第四步 页面分析

。需求:标题 朝代 作者 内容 翻页

https://www.gushiwen.org/default_1.aspx 第一页
https://www.gushiwen.cn/default_2.aspx 第二页
https://www.gushiwen.cn/default_3.aspx 第三页
gushiwen.org gushiwen.cn

通过页面结构分析 所有的内容在 class='left’标签里面 每首诗 class='sons’里面
在这里插入图片描述


  • 第五步 保存数据(管道中保存)
 import json
class GswPipeline:
def open_spider(self,spider):
self.fp = open('gsw.txt','w',encoding='utf-8')
def process_item(self, item, spider):
#print(item)
item_json = json.dumps(dict(item),ensure_ascii=False)
self.fp.write(item_json + '\n')
return item
def close_spider(self,spider):
self.fp.close()

  • 第六页翻页

第一步 先找到 下一页的url地址
第二步 通过 yield scrapy.Request(url)
翻页出处理

next_href = response.xpath('//a[@id="amore"]/@href').extract_first()
if next_href:
		next_url = response.urljoin(next_href) # 补全url 补全 1 拼串 2
	urljoin()
request = scrapy.Request(next_url)
yield request



-------------------------------------------代码-------------------------------------
----
-------爬虫项目代码

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CgsSpider(CrawlSpider):
    name = 'cgs'
    allowed_domains = ['gushiwen.org','gushiwen.cn']
    start_urls = ['http://gushiwen.org/default_1.aspx']
    #Rule是一个类,定义提取url的规则
    #LinkExtractor 链接提取器
    # allow=r'Items/'存放url(正则 重点) callback 回调函数 follow=True 继续跟进(跟进下一页)
    rules = (
        #列表页1---> #列表页改为cn数据就会源源不断
        Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_\d+.aspx'),follow=True),
        #详情页
        Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item')
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        content = response.xpath('//div[@class="contyishang"]/p/text()').extract()
        detail = ''.join(content).strip()
        item['detail_content'] = detail
        print(item)
        return item



-------Scrapy 管道------

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json


class GswPipeline:
    def open_spider(self,spider):
        self.fp=open('gsw.txt','w',encoding='utf-8')


    def process_item(self, item, spider):
        #print(item)
        #这里的item是一个对象,所以要用的时候必须要转化一下
        item_json=json.dumps(dict(item),ensure_ascii=False)
        self.fp.write(item_json+'\n')
        return item

    def close_spider(self,spider): #爬虫结束的方法 item位置参数  close_spider(self,item)方法名固定,不能够写错
        self.fp.close()



-----Scrapy itmes模块----

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class GswItem(scrapy.Item):
    # define the fields for your item here like:
    title= scrapy.Field()#标题
    dynasty= scrapy.Field() #朝代
    author= scrapy.Field() #作者
    content = scrapy.Field() #内容
    detail_href=scrapy.Field() #详情页的地址
    detail_content=scrapy.Field()#译文及注释




-----Scrapy设置模块

# Scrapy settings for gsw project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'gsw'

SPIDER_MODULES = ['gsw.spiders']
NEWSPIDER_MODULE = 'gsw.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'gsw (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL='WARNING'


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

#Override the default request headers:#请求头
DEFAULT_REQUEST_HEADERS = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)      Chrome/86.0.4240.111 Safari/537.36",
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',

}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'gsw.middlewares.GswSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'gsw.middlewares.GswDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines #管道
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'gsw.pipelines.GswPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'



------Scrapy 启动模块–start

from scrapy import cmdline
cmdline.execute(['scrapy','crawl','gs'])

--------------------------------------------完-------------------------------------

总结:

  • 当域名不一样的时候 增加那个不同的 例如 allowed_domains = [‘gushiwen.org’,‘gushiwen.cn’] 也可以给一个最大的 时情况而定
  • 如何处理 列表为空的逻辑 非空判断 看豆瓣 在这个案例当中我们要爬取的数据直接保存,所以用
    的是 try…except语句来进行出来

这里是引用

  • 在 pipline中注意 Item的对象 如果没有用Item.py文件中进行设置 它就是一个字典 反之则不是字典对象 需要处理
  • 翻页处理
    。 1 可以去找页数规律
    。2 直接找到下一页的Url 然后 yield.scrapy.Request(url)
  • 遇到url 地址不全的时候两种解决方式
    。1 拼串
    。 2 urljoin()

在这里插入图片描述

-----------------------------------------------------------------------------------

腾讯岗位爬取

需求:

  • 爬取腾讯岗位中的职位名称,信息,
  • 并进行翻页
  • 第一步 创建scrapy项目

scrapy startproject tencent

  • 第二步 创建爬虫项目

cd tencent
scrapy genspider hr careers.tencent.com

  • 第三步 在settings里面做一些基本配置

LOG_LEVEL = ‘WARNING’
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
‘user-agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
}
ITEM_PIPELINES = {
‘hr.pipelines.GswPipeline’: 300,
}

  • 第四步页面保存
  • 起始url:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1611984804055&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
  • 详情url,这个必须通过页数url中进行提取:
    https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1611992149963&postId=1&language=zh-cn

页面中岗位信息的提取

  • item[‘job_name’]=job[‘RecruitPostName’] #工作岗位
    post_id=job[‘PostId’]
  • 第五步 管道中保存
class TencentPipeline:
    def open_spider(self, spider):
        self.fp = open('wx.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # print(item)
        # 这里的item是一个对象,所以要用的时候必须要转化一下
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.fp.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.fp.close()

  • 第六步 翻页
>   #翻页
        for page in range(2,20):
            url=self.one_url.format(page)
            yield scrapy.Request(url=url)

---------------------------------代码-----------------------------------------------

------爬虫项目代码

import scrapy
import json
from 爬虫.Day20.demo1.tencent.tencent.items import TencentItem

class HrSpider(scrapy.Spider):
    name = 'hr'
    allowed_domains = ['tencent.com']
    one_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1611984804055&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
    #详情页的url
    detail_url='https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1611992149963&postId={}&language=zh-cn'
    start_urls = [one_url.format(1)]


    def parse(self, response):
        #解析数据
        data=json.loads(response.text)

        for job in data['Data']['Posts']:
            item = TencentItem()
            item['job_name']=job['RecruitPostName'] #工作岗位
            post_id=job['PostId']
            d_url=self.detail_url.format(post_id)

            #向详情页发起请求
            yield scrapy.Request(
                url=d_url,
                callback=self.detail_content,
                meta={'item':item}#传递数据
            )

    def detail_content(self,response):
        #如何传递数据
        # item=response.meta['item'] #第一种
        item=response.meta.get('item') #获取传递到的数据
        data=json.loads(response.text)
        item['job_duty']=data['Data']['Responsibility']

        yield item




        #翻页
        for page in range(2,20):
            url=self.one_url.format(page)
            yield scrapy.Request(url=url)

"""
scrapy.Request()
url 继续发起请求的url
callback 回调函数,其实是在处理整个爬虫文件的逻辑
meta 实现不同的解析函数之间的数据传递

"""

---------pipelines 代码(管道代码)


class TencentPipeline:
    def open_spider(self, spider):
        self.fp = open('wx.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # print(item)
        # 这里的item是一个对象,所以要用的时候必须要转化一下
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.fp.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.fp.close()

----Scrapy items模块


import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    #name = scrapy.Field()
    job_name=scrapy.Field() #工作岗位
    job_duty=scrapy.Field() #工作职责
    pass

------Scrapy 设置模块

BOT_NAME = 'tencent'

SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders'
LOG_LEVEL='WARNING'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tencent (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)      Chrome/86.0.4240.111 Safari/537.36",
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tencent.pipelines.TencentPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

总结

scrapy.Request()
url 继续发起请求的url
callback 回调函数,其实是在处理整个爬虫文件的逻辑
meta 实现不同的解析函数之间的数据传递

            yield scrapy.Request(
                url=d_url,#向详情页发起的请求
                callback=self.detail_content,#将数据所要传递到的函数
                meta={'item':item}#传递数据
            ) ```
        for page in range(2,20): #进行翻页的请求
            url=self.one_url.format(page)
            yield scrapy.Request(url=url)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值