网络爬虫(六)之爬虫框架【Scrapy】

最新推荐文章于 2021-08-27 17:22:35 发布

qq_43059674

最新推荐文章于 2021-08-27 17:22:35 发布

阅读量199

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qq_43059674/article/details/103350110

版权

python 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

在这里插入图片描述

项目目录介绍

新创建一个目录，按住shift-右键-在此处打开命令窗口

输入：scrapy startproject 项目名
文件夹目录如下：

|-你的项目名称

|-scrapy.cfg

  |-__init__.py

  |-items.py

  |-middlewares.py

  |-pipelines.py

  |-settings.py

  |-spiders

    |-__init__.py

文件的功能：
scrapy.cfg：配置文件

spiders：存放你Spider文件，也就是你爬取的py文件

items.py：相当于一个容器，和字典较像

middlewares.py：定义Downloader Middlewares(下载器中间件)和Spider Middlewares(蜘蛛中间件)的实现

pipelines.py:定义Item Pipeline的实现，实现数据的清洗，储存，验证。

settings.py：全局配置

使用 Scrapy 爬去起点中文网数据

首先我们要在项目目录的 spiders 目录下，运行 scrapy genspider xs "qidian.com"来创建我一个spider（自己定义的爬虫文件）。下面是我写好的内容：

# -*- coding: utf-8 -*-  这是到正目录下运行 scrapy genspider xs "qidian.com" 命令创建的
import scrapy
from python16.items import Python16Item

class XsSpider(scrapy.Spider):
    # 爬虫的名字
    name = 'xs'
    #  是允许爬取的域名，比如一些网站有相关链接，域名就和本网站不同，这些就会忽略。
    allowed_domains = ['qidian.com']
    #是Spider爬取的网站，定义初始的请求url，可以多个。
    start_urls = ['https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1']

    # 是Spider的一个方法，在请求start_url后，之后的方法，这个方法是对网页的解析，与提取自己想要的东西。
    # response参数：是请求网页后返回的内容，也就是你需要解析的网页。
    def parse(self, response):
        # 获取 网页返回的数据
        # print(response.body.decode("UTF-8"))
        # 解析网页数据 xpath (response 已经提供了这样的方法)
        lis =  response.xpath("//ul[contains(@class,'all-img-list')]/li")
        for li in lis:
            item = Python16Item()
            # 说明 scrapy 中拿到的 text() 数据是长这样的：[<Selector xpath=".//div[@class='book-mid-info']/h4/a/text()" data='圣墟'>]
            # 我们还要通过这样的方式去提取 .extract_first() 拿到第一条数据
            name = li.xpath(".//div[@class='book-mid-info']/h4/a/text()").extract_first()
            author = li.xpath(".//div[@class='book-mid-info']/p[@class='author']/a/text()").extract_first()
            # 去空格
            content = str(li.xpath("./div[@class='book-mid-info']/p[@class='intro']/text()").extract_first()).strip()
            item['name'] = name
            item['author'] = author
            item['content'] = content
            # 通过这种方式将数据返回给 管道（千万不要用 yield 集合 因为性能不好，毕竟在内存里面）
            yield item
        # 获取下一页超链接
        nextUrl = response.xpath("//a[contains(@class, 'lbf-pagination-next')]/@href").extract_first()
        if nextUrl != "javascript:;":
            yield scrapy.Request(url="http:" + nextUrl, callback=self.parse)

定义Item
item是保存爬取数据的容器，使用的方法和字典差不多。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


# 相当于实体类
class Python16Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy .Field()
    name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

pipelines.py 这是我们处理数据的地方

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import csv


# 存数据的 类似于实体类
class Python16Pipeline(object):
    def __init__(self):
        # newline 时防止写的时候发生空一行的问题
        self.f = open("起点中文网.csv", "w", newline="")
        self.writer = csv.writer(self.f)
        self.writer.writerow(["书名", "作者", "简介"])

    # 处理 item
    def process_item(self, item, spider):
        # 保存成 csv 文件
        name = item["name"]
        author = item["author"]
        content = item["content"]
        self.writer.writerow([name, author, content])
        #如果还有下一个处理的 管道就是交给下一个处理管道的值，当然页可以不返回，那么那边 process_item item 值就是itemNone
        return item;

下面贴扫 settings.py 这个文件中的代码：

# -*- coding: utf-8 -*-

# Scrapy settings for python16 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'python16'

SPIDER_MODULES = ['python16.spiders']
NEWSPIDER_MODULE = 'python16.spiders'
# 设置日志输出级别
LOG_LEVEL = "WARNING"

# Crawl responsibly by identifying yourself (and your website) on the user-agent 这就是配置我们自己的 ua值了
#USER_AGENT = 'python16 (+http://www.yourdomain.com)'

# Obey robots.txt rules  觉得要爬去的东西是否符合 rules 协议
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers: 这是默认的请求头我们可以覆盖掉
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'python16.middlewares.Python16SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'python16.middlewares.Python16DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#  管道优先级（例如下面的300）值越小就优先级别越高（取值范围值 0-1000）
# 默认是没有用管道的这个配置所以我们要打开
# 这里可以配置多个管道来，通过优先级别来决定管道的执行顺序
# 每个管道最后 return 的值作为下一个管道 process_item 函数 item 参函接收的值 
ITEM_PIPELINES = {
   'python16.pipelines.Python16Pipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

测试：
在命令行通过 scrapy crawl xs 就能运行了
在这里插入图片描述
当然上面这种方式不是很方便我们可以在项目中编写一个 start.py 文件通过运行这个文件来启动爬虫程序

from scrapy import cmdline

# 这里必须分割 通过 split
cmdline.execute("spiders>scrapy crawl xs".split())

qq_43059674

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
网络爬虫(六)之爬虫框架【Scrapy】

项目目录介绍新创建一个目录，按住shift-右键-在此处打开命令窗口输入：scrapy startproject 项目名文件夹目录如下：|-你的项目名称|-scrapy.cfg |-__init__.py |-items.py |-middlewares.py |-pipelines.py |-settings.py |-spiders ...
复制链接

扫一扫