Scrapy--2 pipeline管道

最新推荐文章于 2023-01-02 19:15:30 发布

无痕的雨

最新推荐文章于 2023-01-02 19:15:30 发布

阅读量431

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_45451647/article/details/113734970

版权

爬虫专栏收录该内容

24 篇文章 0 订阅

订阅专栏

pipline使⽤

-----------------------------pipline使用-------------------------------------------
-----
------从pipeline的字典形可以看出来，pipeline可以有多个，⽽且确实pipeline能够定义多个-----

为什么需要多个pipeline：

1.可能会有多个spider，不同的pipeline处理不同的item的内容
2.⼀个spider的内容可以要做不同的操作，⽐如存⼊不同的数据库中

注意：

1.pipeline的权重越⼩优先级越⾼
2 .pipeline中process_item⽅法名不能修改为其他的名称

------------------------爬虫模块的使用-----------------------------------------

import scrapy 
import logging

logger = logging.getLogger(__name__) 
class QbSpider(scrapy.Spider): 
	name = 'qb' 
	allowed_domains = ['qiushibaike.com'] 
	start_urls = ['http://qiushibaike.com/'] 
	def parse(self, response): 
		for i in range(10): 
			item = {} 
			item['content'] = "haha" 
			# logging.warning(item) 
			logger.warning(item) 
			yield item

------------------------pipeline文件（管道）---------------------------------

import logging  
logger = logging.getLogger(__name__) 
class MyspiderPipeline(object): 
	def process_item(self, item, spider): 
		# print(item) 
		logger.warning(item) 
		item['hello'] = 'world' 
		return item

保存到本地，在setting⽂件中 LOG_FILE = ‘./log.log’

basicConfig样式设置：

https://www.cnblogs.com/felixzh/p/6072417.html

pipline 管道文件

在爬虫文件中要使用yield关键字把数据给管道
在setting文件里面一定要开启管道
。open_spider() 爬虫开始了方法名不能变
。close_spider() 爬虫结束了

yield关键字

生成器可以更加节省空间
使用灵活
。return在运行之后程序就结束了，yield可以在返回之后可以继续执行后续的操作
。yield给pipline.翻页yield scrapy.Requesft对象，会获取该对象把链接给引擎然后在由引擎交给调度器

item

注意：

xxx=scrapy.Field() 这个xxx (变量) 他不是一个dict对象

如何进行翻页

--------------------------------前导知识------------------------------------------

scrapy.Request知识点

scrapy.Request(url, callback=None, method=‘GET’, headers=None, bod y=None,cookies=None, meta=None, encoding=‘utf-8’, priority=0, 2 dont_filter=False, errback=None, flags=None)

常⽤参数为：

callback：指定传⼊的URL交给那个解析函数去处理
meta：实现不同的解析函数中传递数据，meta默认会携带部分信息,⽐如下载延迟，请求深度
dont_filter:让scrapy的去重不会过滤当前URL，scrapy默认有URL去重功能，对需要重复请求的URL有重要⽤途

item的介绍和使⽤

items.py 
import scrapy 
class TencentItem(scrapy.Item): 
	# define the fields for your item here like: 
	title = scrapy.Field() 
	position = scrapy.Field() 
	date = scrapy.Field()

scrapy log信息的认知

Scrapy settings说明和配置

配置⽂件存放⼀些公共的变量(⽐如数据库的地址，账号密码等) ⽅便⾃⼰和别⼈修改⼀般⽤全⼤写字⺟命名变量名 SQL_HOST = ‘192.168.0.1’

settings⽂件详细信息：https://www.cnblogs.com/cnkai/p/7399573.html

-----

古诗文翻页

-------------------------------------分析网站页面-------------------------------------
-----

第一步创建Scrapy项目

scrapy startproject gsw

第二步创建爬虫项目

cd gsw
scrapy genspider gs gushiwen.org

第三步在settings里面做一些基本配置

LOG_LEVEL = ‘WARNING’
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
‘user-agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
}
ITEM_PIPELINES = {
‘gsw.pipelines.GswPipeline’: 300,
}

第四步页面分析

。需求：标题朝代作者内容翻页

https://www.gushiwen.org/default_1.aspx 第一页
https://www.gushiwen.cn/default_2.aspx 第二页
https://www.gushiwen.cn/default_3.aspx 第三页
gushiwen.org gushiwen.cn

通过页面结构分析所有的内容在 class='left’标签里面每首诗 class='sons’里面

第五步保存数据(管道中保存)

 import json
class GswPipeline:
def open_spider(self,spider):
self.fp = open('gsw.txt','w',encoding='utf-8')
def process_item(self, item, spider):
#print(item)
item_json = json.dumps(dict(item),ensure_ascii=False)
self.fp.write(item_json + '\n')
return item
def close_spider(self,spider):
self.fp.close()

第六页翻页

第一步先找到下一页的url地址
第二步通过 yield scrapy.Request(url)
翻页出处理

next_href = response.xpath('//a[@id="amore"]/@href').extract_first()
if next_href:
		next_url = response.urljoin(next_href) # 补全url 补全 1 拼串 2
	urljoin()
request = scrapy.Request(next_url)
yield request

-------------------------------------------代码-------------------------------------
----
-------爬虫项目代码

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CgsSpider(CrawlSpider):
    name = 'cgs'
    allowed_domains = ['gushiwen.org','gushiwen.cn']
    start_urls = ['http://gushiwen.org/default_1.aspx']
    #Rule是一个类，定义提取url的规则
    #LinkExtractor 链接提取器
    # allow=r'Items/'存放url(正则 重点) callback 回调函数 follow=True 继续跟进(跟进下一页)
    rules = (
        #列表页1---> #列表页改为cn数据就会源源不断
        Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_\d+.aspx'),follow=True),
        #详情页
        Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item')
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        content = response.xpath('//div[@class="contyishang"]/p/text()').extract()
        detail = ''.join(content).strip()
        item['detail_content'] = detail
        print(item)
        return item

-------Scrapy 管道------

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json


class GswPipeline:
    def open_spider(self,spider):
        self.fp=open('gsw.txt','w',encoding='utf-8')


    def process_item(self, item, spider):
        #print(item)
        #这里的item是一个对象，所以要用的时候必须要转化一下
        item_json=json.dumps(dict(item),ensure_ascii=False)
        self.fp.write(item_json+'\n')
        return item

    def close_spider(self,spider): #爬虫结束的方法 item位置参数  close_spider(self,item)方法名固定，不能够写错
        self.fp.close()

-----Scrapy itmes模块----

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class GswItem(scrapy.Item):
    # define the fields for your item here like:
    title= scrapy.Field()#标题
    dynasty= scrapy.Field() #朝代
    author= scrapy.Field() #作者
    content = scrapy.Field() #内容
    detail_href=scrapy.Field() #详情页的地址
    detail_content=scrapy.Field()#译文及注释

-----Scrapy设置模块

# Scrapy settings for gsw project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'gsw'

SPIDER_MODULES = ['gsw.spiders']
NEWSPIDER_MODULE = 'gsw.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'gsw (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL='WARNING'


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

#Override the default request headers:#请求头
DEFAULT_REQUEST_HEADERS = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)      Chrome/86.0.4240.111 Safari/537.36",
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',

}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'gsw.middlewares.GswSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'gsw.middlewares.GswDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines #管道
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'gsw.pipelines.GswPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

------Scrapy 启动模块–start

from scrapy import cmdline
cmdline.execute(['scrapy','crawl','gs'])

--------------------------------------------完-------------------------------------

总结：

当域名不一样的时候增加那个不同的例如 allowed_domains = [‘gushiwen.org’,‘gushiwen.cn’] 也可以给一个最大的时情况而定

如何处理列表为空的逻辑非空判断看豆瓣在这个案例当中我们要爬取的数据直接保存，所以用
的是 try…except语句来进行出来

在 pipline中注意 Item的对象如果没有用Item.py文件中进行设置它就是一个字典反之则不是字典对象需要处理

翻页处理
。 1 可以去找页数规律
。2 直接找到下一页的Url 然后 yield.scrapy.Request(url)

遇到url 地址不全的时候两种解决方式
。1 拼串
。 2 urljoin()

-----------------------------------------------------------------------------------

腾讯岗位爬取

需求：

爬取腾讯岗位中的职位名称，信息，
并进行翻页

第一步创建scrapy项目

scrapy startproject tencent

第二步创建爬虫项目

cd tencent
scrapy genspider hr careers.tencent.com

第三步在settings里面做一些基本配置

LOG_LEVEL = ‘WARNING’
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
‘user-agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
}
ITEM_PIPELINES = {
‘hr.pipelines.GswPipeline’: 300,
}

第四步页面保存

起始url:
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1611984804055&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
详情url,这个必须通过页数url中进行提取:
https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1611992149963&postId=1&language=zh-cn

页面中岗位信息的提取

item[‘job_name’]=job[‘RecruitPostName’] #工作岗位
post_id=job[‘PostId’]

第五步管道中保存

class TencentPipeline:
    def open_spider(self, spider):
        self.fp = open('wx.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # print(item)
        # 这里的item是一个对象，所以要用的时候必须要转化一下
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.fp.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.fp.close()

第六步翻页

>   #翻页
        for page in range(2,20):
            url=self.one_url.format(page)
            yield scrapy.Request(url=url)

---------------------------------代码-----------------------------------------------
—
------爬虫项目代码

import scrapy
import json
from 爬虫.Day20.demo1.tencent.tencent.items import TencentItem

class HrSpider(scrapy.Spider):
    name = 'hr'
    allowed_domains = ['tencent.com']
    one_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1611984804055&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
    #详情页的url
    detail_url='https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1611992149963&postId={}&language=zh-cn'
    start_urls = [one_url.format(1)]


    def parse(self, response):
        #解析数据
        data=json.loads(response.text)

        for job in data['Data']['Posts']:
            item = TencentItem()
            item['job_name']=job['RecruitPostName'] #工作岗位
            post_id=job['PostId']
            d_url=self.detail_url.format(post_id)

            #向详情页发起请求
            yield scrapy.Request(
                url=d_url,
                callback=self.detail_content,
                meta={'item':item}#传递数据
            )

    def detail_content(self,response):
        #如何传递数据
        # item=response.meta['item'] #第一种
        item=response.meta.get('item') #获取传递到的数据
        data=json.loads(response.text)
        item['job_duty']=data['Data']['Responsibility']

        yield item




        #翻页
        for page in range(2,20):
            url=self.one_url.format(page)
            yield scrapy.Request(url=url)

"""
scrapy.Request()
url 继续发起请求的url
callback 回调函数，其实是在处理整个爬虫文件的逻辑
meta 实现不同的解析函数之间的数据传递

"""

---------pipelines 代码（管道代码）


class TencentPipeline:
    def open_spider(self, spider):
        self.fp = open('wx.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # print(item)
        # 这里的item是一个对象，所以要用的时候必须要转化一下
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.fp.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.fp.close()

----Scrapy items模块


import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    #name = scrapy.Field()
    job_name=scrapy.Field() #工作岗位
    job_duty=scrapy.Field() #工作职责
    pass

------Scrapy 设置模块

BOT_NAME = 'tencent'

SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders'
LOG_LEVEL='WARNING'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tencent (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)      Chrome/86.0.4240.111 Safari/537.36",
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tencent.pipelines.TencentPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

总结

scrapy.Request()
url 继续发起请求的url
callback 回调函数，其实是在处理整个爬虫文件的逻辑
meta 实现不同的解析函数之间的数据传递

            yield scrapy.Request(
                url=d_url,#向详情页发起的请求
                callback=self.detail_content,#将数据所要传递到的函数
                meta={'item':item}#传递数据
            ) ```

        for page in range(2,20): #进行翻页的请求
            url=self.one_url.format(page)
            yield scrapy.Request(url=url)

无痕的雨

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy--2 pipeline管道

pipline使⽤-----------------------------pipline使用------------------------------------------------------从pipeline的字典形可以看出来，pipeline可以有多个，⽽且确实pipeline能够定义多个-----为什么需要多个pipeline：1.可能会有多个spider，不同的pipeline处理不同的item的内容2.⼀个spider的内容可以要做不同的操作，⽐如存⼊不同的数据库中
复制链接

扫一扫