scrapy框架

scrapy框架的结构图:

在这里插入图片描述

实际操作:
在用scrapy框架时先pip install acrspy,下载好后创建工程,比如创建一个爬豆瓣电影的工程:scrapy startproject DouBan
创建爬虫程序:cd DouBan/
scrapy genspider douban ‘douban.com’

运行:scarpy crawl douban
里面的setting里放的是配置文件;
items里放的是清洗出来的数据名字,等把数据放到数据库里时用;
pipelines是redis中的管道,他可以使客户端在没有读取旧的响应时,处理新的请求。这样便可以向服务器发送多个命令,而不必等待答复,直到最后一个步骤中读取答复。这被称为管线(PipeLine),并且是几十年来广泛使用的技术。
douban里放的是实际操作,清洗数据过程

话不多上,上代码

douban:

# -*- coding: utf-8 -*-
import re
import scrapy
from scrapy import Request

from DouBan.items import DouBanMovieItem


class DoubanSpider(scrapy.Spider):
    name = 'douban'  # 爬虫名称, 随便起, 但是不能重复;
    allowed_domains = ['douban.com', 'doubanio.com']  # 允许爬去的网站;
    # start_urls = ['http://douban.com/'] # 种子URL, 最开始要爬去的url地址, 通过引擎传递给调度器。
    start_urls = [
        'https://movie.douban.com/top250'
    ]
    url = 'https://movie.douban.com/top250'

    def parse(self, response):
        item = DouBanMovieItem()
        # <ol class="grid_view">
        movies = response.xpath('//ol[@class="grid_view"]/li')
        for movie in movies:
            # 电影名称( title): <span class="title">肖申克的救赎</span>
            # extract()将对象转换成字符串
            item['title'] = movie.xpath(
                './/span[@class="title"]/text()'
            ).extract()[0]

            # 电影评分( score): <span class="rating_num" property="v:average">9.7</span>
            item['score'] = movie.xpath(
                './/span[@class="rating_num"]/text()'
            ).extract()[0]

            # 电影评语( quote): 有的电影没有短评, 存储空字符串即可;
            quote = movie.xpath(
                './/span[@class="inq"]/text()'
            ).extract()
            item['quote'] = quote[0] if quote else ''

            # 电影导演( director)
            """
            info:
                ['导演: 奥利维·那卡什 Olivier Nakache / 艾力克·托兰达 Eric Toledano\xa0\xa0\xa0主...', '2011\xa0/\xa0剧, '', '\n                            ']
            """
            info = movie.xpath(
                './/div[@class="bd"]/p/text()'
            ).extract()

            director = info[0].split('主演')[0].strip()
            item['director'] = director

            # 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p1454261925.jpg'
            item['image_url'] = movie.xpath('.//div[@class="pic"]/a/img/@src').extract()[0]
            # print("image url: ", item['image_url'])


            item['detail_url'] = movie.xpath('.//div[@class="hd"]//a/@href').extract()[0]
            # print("detail url: ", item['detail_url'])
            # 上映日期(release_date):
            yield item

        """
        <span class="next">
        <link rel="next" href="?start=50&amp;filter=">
        <a href="?start=50&amp;filter=">后页&gt;</a>
        </span>
        """
        # nextLink = response.xpath('.//span[@class="next"]/link/@href').extract()  # 返回的是列表
        # if nextLink:
        #     nextLink = nextLink[0]
        #     print('Next Link: ', nextLink)
        #     yield Request(self.url + nextLink, callback=self.parse)
        #
        #

items:

# -*- coding: utf-8 -*-

"""
# 1. item.py文件的功能?
item.py主要目标是从非结构化来源(通常是网页)提取结构化数据。Scrapy爬虫可以将提取的数据作为Python语句返回。

# 2. 为什么使用item.py?
虽然方便和熟悉,Python dicts缺乏结构:很容易在字段名称中输入错误或返回不一致的数据,特别是在与许多爬虫的大项目。

# 3. item.py文件的优势?
- 定义公共输出数据格式,Scrapy提供Item类。
- Item对象是用于收集所抓取的数据的简单容器。
- 提供了一个类似字典的 API,具有用于声明其可用字段的方便的语法。

"""
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


# class DoubanItem(scrapy.Item):
#     # define the fields for your item here like:
#     # name = scrapy.Field()
#     title = scrapy.Field()
#     rating_num = scrapy.Field()


class DouBanMovieItem(scrapy.Item):
    """
    确定要爬取的数据的类型和名称,包含:
        电影名称( title) ;
        电影评分( score) ;
        电影评语( quote) ;
        电影导演( director) ,
        上映日期(release_date)
        评论数(comment_num)
    通过 Field( ) 方法来声明数据字段。
    """
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()  # 电影名称
    score = scrapy.Field()  # 电影评分
    quote = scrapy.Field()  # 电影评语
    director = scrapy.Field()  # 电影导演
    release_date = scrapy.Field()  # 上映日期
    comment_num = scrapy.Field()  # 评论数
    image_url = scrapy.Field()  # 图片的url地址
    detail_url = scrapy.Field()  # 电影详情页信息;
    image_path = scrapy.Field()  # 下載的封面本地存儲位置

setting:

# -*- coding: utf-8 -*-

# Scrapy settings for DouBan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DouBan'

SPIDER_MODULES = ['DouBan.spiders']
NEWSPIDER_MODULE = 'DouBan.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'DouBan (+http://www.yourdomain.com)'
# 设置随机的用户代理;
from fake_useragent import UserAgent

ua = UserAgent()
USER_AGENT = ua.random

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False #在实际爬时,这个参数如果为True就是遵循协议,只能爬别人让你爬的数据;
#                                   如果为False,就可以爬一些别人不让你爬的数据

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'DouBan.middlewares.DoubanSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'DouBan.middlewares.DoubanDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
    'scrapy.pipelines.files.FilesPipeline': 2,
    'DouBan.pipelines.MyImagesPipeline': 2,
    'DouBan.pipelines.DoubanPipeline': 300,
    'DouBan.pipelines.JsonWriterPipeline': 200,  # 数字越小, 越先执行;
    'DouBan.pipelines.AddScoreNum': 100,  # 处理爬去的数据, 处理完成后才保存;
    'DouBan.pipelines.MysqlPipeline': 200,  # 处理爬去的数据, 处理完成后才保存;
}

# FILES_STORE = '/tmp/files/'  # 文件存储路径
IMAGES_STORE = '/tmp/images/'  # 图片存储路径
# 90 days of delay for files expiration
# FILES_EXPIRES = 90
# 30 days of delay for images expiration
IMAGES_EXPIRES = 30
# 图片缩略图
IMAGES_THUMBS = {
    'small': (250, 250),
    'big': (270, 270),
}
# 图片过滤器,最小高度和宽度
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines:

# -*- coding: utf-8 -*-


# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

import pymysql
import scrapy
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline


class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item


class AddScoreNum(object):
    """在原有评分的基础上加1"""

    def process_item(self, item, spider):
        if item['score']:
            score = float(item['score'])
            item['score'] = str(score + 1)
            return item
        else:
            raise Exception("没有爬去到score")


class JsonWriterPipeline(object):
    """爬虫之前打开文件对象, 爬虫之后, 关闭文件对象"""

    def open_spider(self, spider):
        self.file = open('douban.json', 'w')

    def process_item(self, item, spider):
        # dict(item): 将item对象转成字典
        # json.dumps: 将字典序列化成json字符串;
        # indent=4: 存储是缩进为4;
        # ensure_ascii=False: 解决中文乱码问题
        line = json.dumps(dict(item), indent=4, ensure_ascii=False)
        self.file.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()


class MysqlPipeline(object):
    """编写MySQL存储插件"""

    def open_spider(self, spider):
        # 连接数据库
        self.connect = pymysql.connect(
            host='127.0.0.1',  # 数据库地址
            port=3306,  # 数据库端口
            db='scrapyProject',  # 数据库名
            user='root',  # 数据库用户名
            passwd='westos',  # 数据库密码
            charset='utf8',  # 编码方式
            use_unicode=True,
            autocommit=True
        )
        # 通过cursor执行增删查改
        self.cursor = self.connect.cursor()
        self.cursor.execute("create table if not exists douBanTop("
                            "title varchar(50) unique, "
                            "score float , "
                            "quote varchar(100), "
                            "director varchar(100), "
                            "comment_num int, "
                            "release_date varchar(10));")

    def process_item(self, item, spider):
        insert_sqli = "insert into douBanTop(title, score, quote,director) values ('%s', '%s', '%s', '%s')" % (
            item['title'], item['score'], item['quote'], item['director'],)
        print(insert_sqli)
        try:
            self.cursor.execute(insert_sqli)
            # 提交sql语句
            self.connect.commit()
        except Exception as e:
            self.connect.rollback()
        return item  # 必须实现返回

    def close_spider(self, spider):
        self.connect.commit()
        self.cursor.close()
        self.connect.close()


class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):  # 單個的item對象;
        """
        自動請求獲取圖片信息並下載;
        :param item:
        :param info:
        :return:
        """
        print("item: ", item)
        yield scrapy.Request(item['image_url'])



    #
    # def item_completed(self, results, item, info):
    #     """
    #     :param results:
    #         [(True,  {'url': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p1454261925.jpg',
    #             'path': 'full/e9cc62a6d6a0165314b832b1f31a74ca2487547a.jpg',
    #             'checksum': '5d77f59d4d634b795780b2138c1bf572'})]
    #     :param item:
    #     :param info:
    #     :return:
    #     """
    #     # for result in results:
    #     #     print("result: ", result)
    #     image_paths = [x['path'] for isok, x in results if isok]
    #     # print("image_paths: ", image_paths[0])
    #     if not image_paths:
    #         raise DropItem("Item contains no images")
    #
    #     item['image_path'] = image_paths[0]
    #     return item
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值