基于Scrapy框架影视信息采集与分析

一. Scrapy的基础知识

  1. 定义
    Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。

  2. 爬虫与scrapy
    通用爬虫框架:
    在这里插入图片描述

scrapy框架运行流程:
在这里插入图片描述

  1. scrapy 框架分为四个组成部分:
  • Item Pipeline
  • spider
  • Scheduler
  • Downloader
  • Engine
  1. 工作流程:

1)当输入需要的网站地址后,找到处理该网站的Spider并向该spider请求第一个要抓取的URL(s)。

2)引擎从Spider中获取到第一个要抓取的URL并在调度器(Scheduler)以Request调度。

3)引擎向调度器请求下一个要爬取的URL。

4)调度器返回下一个要抓取的URL给引擎,引擎将URL通过下载中间件(请求(request)方向)转发给下载器(Downloader).

5)一旦页面下载完毕,下载器生成一个该页面的Response,并将其通过下载中间件(返回(response)方向)发送给引擎。

6)引擎从下载器中接收到Response并通过Spider中间件(输入方向)发送给Spider处理。

7)Spider处理Response并返回爬取到的Item给Item Pipeline,将(Spider返回的)Request给调度器。

8)引擎将(Spider返回的)爬取的Item给Item Pipeline,将(Spider返回的)Request给调度器。

9) (从第二步)重复直到调度器中没有更多地request,引擎关闭该网站。

  1. Scrapy的结构

DouBan/
├── DouBan
│   ├── __init__.py
│   ├── items.py            # 设置的是数据库存储的模版
│   ├── middlewares.py      # 中间件信息
│   ├── pipelines.py        # 数据存储的模块
│   ├── __pycache__
│   ├── settings.py 
│   └── spiders             # 爬虫目录核心
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

二. Scrapy的使用

  1. 创建项目

      scrapy startproject project_name
    
  2. 创建爬虫的文件名

     scrapy genspider example exameple.com
    
  3. 编写各个文件,完成爬虫

items.py:定义爬虫程序的数据模型,类似于实体类。

pipelines.py:管道文件,负责对spider返回数据的处理。

spiders目录 负责存放继承自scrapy的爬虫类

scrapy.cfg.scrapy 基础配置

init:初始化文件

setting.py:负责对整个爬虫的配置

  1. 运行

      scrapy crawl  project_name
    

eg: 豆瓣网站TOP250的爬取

  • douban.py 文件

      import scrapy
      from scrapy import Request
      from scrapy_redis.spiders import RedisSpider
      from DoouBan.items import DoubanItem
      
      class DoubanSpider(RedisSpider):
          name = 'douban'
          allowed_domains = ['douban.com']
      
          redis_key = 'douban:start_urls'
          url = 'https://movie.douban.com/top250'
    
      def parse(self, response):
          item = DoubanItem()
          movies = response.xpath('//ol[@class = "grid_view"]/li')
          for movie in movies:
              title = movie.xpath('.//span[@class = "title"]/text()').extract()[0]
              # print(title)
          rating_num = movie.xpath('.//span[@class ="rating_num"]/text()').extract()[0]
          # print(rating_num)
          name = movie.xpath('.//div[@class = "bd"]/p/text()').extract()[0].strip()
          author = name.split('主演')[0]
          # print(author)
          evaluates = movie.xpath('.//span[@class="inq"]/text()').extract()
          evaluate = evaluates[0] if evaluates else ''
          # print(evaluate)
          # release_data = movie.xpath('.//div[@class="bd"]/p').extract()
          # print(release_data)
          comment_num= movie.xpath('.//div[@class= "star"]/span/text()').extract()[1]
          # print(comment_num)
          img_url = movie.xpath('.//div[@class ="pic"]/a/img/@src').extract()[0]
          # print(img_url)
          detail_url = movie.xpath('.//div[@class = "pic"]/a/@href').extract()[0]
          # print(detail_url)
    
          item['title'] = title
          item['rating_num']  = rating_num
          item['author'] = author
          item['evaluate'] = evaluate
          item['comment_num'] = comment_num
          item['img_url'] = img_url
          item['detail_url'] = detail_url
    
    
             yield   Request(item['detail_url'],meta={'item': item},callback=self.detailParser)
    
    
      def detailParser(self,response):
          item = response.request.meta['item']
          item['movie_length'] = response.xpath('.//span[@property="v:runtime"]/text()').extract()[0]
          print('movie_length:' , item['movie_length'])
          yield item
    
  • items.py

      import scrapy
      
      
      class DoubanItem(scrapy.Item):
          # define the fields for your item here like:
          # name = scrapy.Field()
          title = scrapy.Field()
          rating_num = scrapy.Field()
          author = scrapy.Field()
          evaluate = scrapy.Field()
          comment_num = scrapy.Field()
          img_url = scrapy.Field()
          detail_url = scrapy.Field()
          img_path = scrapy.Field()
          movie_length = scrapy.Field()
    
  • pipelines.py

      import json
      import scrapy
      import pymysql
      from scrapy.exceptions import DropItem
      from scrapy.pipelines.images import ImagesPipeline
      
      
      class DooubanPipeline(object):
          def process_item(self, item, spider):
              return item
      
      
      class AddScoreNum(object):
          def process_item(self,item,spider):
              if item['rating_num']:
                  score = float(item['rating_num'])
                  item['rating_num'] = str(score +1)
                  return  item
              else:
                  raise Exception("爬取失败")
      
      
      
      class JsonWritePipeline(object):
          def open_spider(self,spider):
              self.file = open('douban.json','w')
      
          def process_item(self,item,spider):
              line = json.dumps(dict(item),indent = 4,ensure_ascii = False)
              self.file.write(line)
              return item
      
          def close_spider(self,spider):
              self.file.close()
      
      
      
      
      class MysqlPipeline(object):
          def open_spider(self,spider):
              self.connect = pymysql.connect(
                  host = '127.0.0.1',
                  port = 3306,
                  db = 'scrapyProject',
                  user = 'root',
                  password = 'westos',
                  charset = 'utf8',
                  use_unicode = True,
                  autocommit = True
              )
              self.cursor = self.connect.cursor()
              self.cursor.execute("create table if not exists doubanTop("
                                  "title varchar(50) unique ,"
                                  "rating_num  float,"
                                  "author varchar(100),"
                                  "comment_num int,"
                                  "evaluate varchar(100));")
      
          def process_item(self,item,spider):
              insert_sqli = "insert into doubanTop(title,rating_num,author,comment_num,evaluate) values ('%s','%s','%s','%s','%s')" %(item['title'],item['rating_num'],item['author'],item['comment_num'],item['evaluate'],)
              print(insert_sqli)
              try:
                  self.cursor.execute(insert_sqli)
                  self.connect.commit()
              except Exception as e:
                  self.connect.rollback()
              return  item
      
      
          def close_spider(self,spider):
              self.connect.commit()
              self.cursor.close()
              self.connect.close()
      
      
      
      class MyImagesPipeline(ImagesPipeline):
          def get_media_requests(self,item,info):
              return  scrapy.Request(item['img_url'])
      
          def item_completed(self, results, item, info):
              img_paths = [x['path'] for isok,x in results if isok]
      
              if not img_paths:
                  raise  DropItem('Item contains images')
      
              item['img_path']  = img_paths[0]
              return  item
    
  • setting.py

      BOT_NAME = 'DoouBan'
      
      SPIDER_MODULES = ['DoouBan.spiders']
      NEWSPIDER_MODULE = 'DoouBan.spiders'
      
      
      # Crawl responsibly by identifying yourself (and your website) on the user-agent
      #USER_AGENT = 'DoouBan (+http://www.yourdomain.com)'
      from fake_useragent import  UserAgent
      ua = UserAgent()
      USER_AGENT = ua.random
      # Obey robots.txt rules
      # ROBOTSTXT_OBEY = True
      ROBOTSTXT_OBEY = False
          ITEM_PIPELINES = {
              # 'scrapy.pipelines.images.ImagesPipeline' : 1,
              # 'DoouBan.pipelines.MyImagesPipeline' :2,
             # 'DoouBan.pipelines.DooubanPipeline': 300,
             #  'DoouBan.pipelines.JsonWritePipeline': 200,
              'scrapy_redis.pipelines.RedisPipeline' :100,
              # 'DoouBan.pipelines.AddScoreNum': 100,
              # 'DoouBan.pipelines.MysqlPipeline' : 200
          
          }
          DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
          SCHEDULER = "scrapy_redis.scheduler.Scheduler"
          SCHEDULER_PERSIST = True
          REDIS_HOST = '172.25.254.46'
          REDIS_PORT = 6379
          
          
          IMAGES_STORE = './images'
          IMAGES_THUMBS = {
              'small' : (100,100),
              'big' : (270,270)
          }
          IMAGES_MIN_HEIGHT = 110
          IMAGES_MIN_WIDTH = 110
    

运行结果:
将爬取到的信息保存在csv文件中,也可以保存在数据库中
在这里插入图片描述

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值