scrapy框架一

本文详细介绍了Scrapy爬虫框架的安装、基本使用、设置配置以及如何进行持久化存储,包括存储到本地文件和MySQL数据库。同时,展示了如何利用Scrapy处理图片下载,并给出了具体的代码示例。通过这些步骤,读者可以全面了解Scrapy并快速上手进行网络数据抓取和存储。
摘要由CSDN通过智能技术生成

scrapy学习

scrapy解释

  • scrapy就是一个基于异步的爬虫框架

环境安装

  • linux/Mac
    • pip install scrapy
  • windows
    • pip install wheel
    • 根据py和系统版本下载对应的twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    • 进入下载的目录Twisted-20.3.0-cp36-cp36m-win_amd64.whl
    • pip install pywin
    • pip install scrapy

基本使用

  1. 创建一个工程: scrapy startproject ProName
  2. 进入工程: cd ProName
  3. 创建一个爬虫文件: scrapy genspider spiderName www.xx.com
  4. 执行工程: scrapy crawl spiderName
    • 将parse返回的值存储到本地文件中的命令: scrapy crawl spiderName -o 文件名.指定后缀名

settings相关配置

  • 指定UA: USER_AGENT = '...'

  • 关闭robots: ROBOTSTXT_OBEY = False

  • 指定日志等级: LOG_LEVEL = 'ERROR'

  • # 管道类的优先级,数字越小优先级越高
    ITEM_PIPELINES = {
       'ProName.pipelines.ProNamePipeline': 300,
    }
    
  • 指定图片下载位置: IMAGES_STORE = 'img'

持久化存储案例

  1. 创建工程: scrapy startproject one

  2. 进入到工程: cd one

  3. 创建爬虫文件: scrapy genspider first www.xx.com

  4. # 目录结构
    one
        │  items.py
        │  middlewares.py
        │  pipelines.py
        │  settings.py
        │  __init__.py
        │
        ├─spiders
        │  │  first.py
        │  │  __init__.py
    
  5. 爬虫文件编写

    # first.py
    import scrapy
    from one.items import OneItem
    
    
    class FirstSpider(scrapy.Spider):
        name = 'first'
        # allowed_domains = ['www.xx.com']
        start_urls = ['http://www.xx.com/']  # 填写相应的URL
        url = 'http://www.xx.com/%d/'  # 爬取多页
        page = 2  # 页码
    
        def parse(self, response):
            li_list = response.xpath("...")
            for li in li_list:
                title = li.xpath('...').extract_first()
                detail_url = li.xpath('...').extract_first()  # 详情页面的url
    
                item = OneItem()  # 实例化OneItem对象
                item['title'] = title
                item['url'] = url
                
                yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})  # 请求传参
            
            if self.page <= 50:
                new_url = self.url % self.page
                self.page += 1
                # 手动发送请求
                yield scrapy.Request(new_url, callback=self.parse)  # 触发递归
            
        def parse_detail(self, response):
            item = response.meta['item']
            url = response.xpath('...').extract_first()
            item['url'] = url
            yield item
     
    
  6. # items.py
    import scrapy
    
    
    class OneItem(scrapy.Item):
        # define the fields for your item here like:
        title = scrapy.Field()
        url = scrapy.Field()
    
  7. 数据库相关配置

    mysql -uroot -p123  # 进入到数据库
    create databases scrapy;  # 创建数据库
    use scrapy;  # 使用数据库
    create table first (title varchar(50), url varchar(128));  # 创建相关表
    
  8. 持久化存储相关配置

    import pymysql
    
    
    class TextPipeline:
    
        fp = None  # 新建一个文件句柄
    
        def open_spider(self, spider):
            """ 爬虫开始触发 """
            print("爬虫开始了")
            self.fp = open('a.txt', 'w', encoding='utf-8')
    
        def process_item(self, item, spider):
            self.fp.write(item['title'] + ": " + item['url'] + '\n')
            return item
    
        def close_spider(self, spider):
            """ 爬虫结束触发 """
            print("爬虫结束了")
            self.fp.close()
    
    
    class MysqlPipeline:
    
        conn = None
        cursor = None
    
        def open_spider(self, spider):
            """ 爬虫开始触发 """
            print("爬虫开始了(Mysql存储)")
            self.conn = pymysql.Connect(
                host='127.0.0.1',
                port=3306,
                database='scrapy',
                user='root',
                password='123',
                charset='utf8'
            )
    
        def process_item(self, item, spider):
            self.cursor = self.conn.cursor()
            sql = 'insert into first values ("%s", "%s")' % (item['title'], item['url'])
            try:
                self.cursor.execute(sql)
                self.conn.commit()
            except Exception as e:
                print(e)
                self.conn.rollback()
    
            return item
    
        def close_spider(self, spider):
            """ 爬虫结束触发 """
            print("爬虫结束了(Mysql存储)")
            self.cursor.close()
            self.conn.close()
            
    
  9. settings文件相关配置

    # 添加或修改以下配置
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
    
    ROBOTSTXT_OBEY = False
    LOG_LEVEL = 'ERROR'
    
    ITEM_PIPELINES = {  # 设置存储权重
       'one.pipelines.MysqlPipeline': 300,
       'one.pipelines.TextPipeline': 301,
    }
    
  10. 启动工程: scrapy crawl first

图片存储案例

  1. 创建工程: scrapy startproject one

  2. 进入到工程: cd one

  3. 创建爬虫文件: scrapy genspider first www.xx.com

  4. # 目录结构
    one
        │  items.py
        │  middlewares.py
        │  pipelines.py
        │  settings.py
        │  __init__.py
        │
        ├─spiders
        │  │  first.py
        │  │  __init__.py
    
  5. 爬虫文件编写

    # first.py
    import scrapy
    from one.items import OneItem
    
    
    class FirstSpider(scrapy.Spider):
        name = 'first'
        # allowed_domains = ['www.xx.com']
        start_urls = ['http://www.xx.com/']  # 填写相应的URL
        url = 'http://www.xx.com/%d/'  # 爬取多页
        page = 2  # 页码
    
        def parse(self, response):
            li_list = response.xpath("...")
            for li in li_list:
                title = li.xpath('...').extract_first() + '.jpg'
                detail_url = li.xpath('...').extract_first()  # 详情页面的url
    
                item = OneItem()  # 实例化OneItem对象
                item['title'] = title
                item['url'] = url
                
                yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})  # 请求传参
            
            if self.page <= 50:
                new_url = self.url % self.page
                self.page += 1
                # 手动发送请求
                yield scrapy.Request(new_url, callback=self.parse)  # 触发递归
            
        def parse_detail(self, response):
            item = response.meta['item']
            url = response.xpath('...').extract_first()
            item['url'] = url
            yield item
     
    
  6. # items.py
    import scrapy
    
    
    class OneItem(scrapy.Item):
        # define the fields for your item here like:
        title = scrapy.Field()
        url = scrapy.Field()
    
  7. 图片存储和下载相关配置

    import scrapy
    
    from scrapy.pipelines.images import ImagesPipeline
    
    
    class ImgPipeline(ImagesPipeline):
        # 对指定url发送请求
        def get_media_requests(self, item, info):
            yield scrapy.Request(item['url'], meta={'item': item})
    
        # 指定图片存储的路径
        def file_path(self, request, response=None, info=None, *, item=None):
            item = request.meta['item']
            title = item['title']
            return title
    
  8. settings文件相关配置

    # 添加或修改以下配置
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
    
    ROBOTSTXT_OBEY = False
    LOG_LEVEL = 'ERROR'
    IMAGES_STORE = 'img'  # 设置图片文件存储路径
    
    ITEM_PIPELINES = {  # 设置存储权重
       'one.pipelines.ImgPipeline': 300,
    }
    
  9. 启动工程: scrapy crawl first

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值