网络爬虫(六)之爬虫框架【Scrapy】

在这里插入图片描述

项目目录介绍

新创建一个目录,按住shift-右键-在此处打开命令窗口

输入:scrapy startproject 项目名
文件夹目录如下:

|-你的项目名称

|-scrapy.cfg

  |-__init__.py

  |-items.py

  |-middlewares.py

  |-pipelines.py

  |-settings.py

  |-spiders

    |-__init__.py

文件的功能:
scrapy.cfg:配置文件

spiders:存放你Spider文件,也就是你爬取的py文件

items.py:相当于一个容器,和字典较像

middlewares.py:定义Downloader Middlewares(下载器中间件)和Spider Middlewares(蜘蛛中间件)的实现

pipelines.py:定义Item Pipeline的实现,实现数据的清洗,储存,验证。

settings.py:全局配置

使用 Scrapy 爬去起点中文网数据

首先我们要在 项目目录的 spiders 目录下,运行 scrapy genspider xs "qidian.com"来创建我一个spider(自己定义的爬虫文件)。下面是我写好的内容:

# -*- coding: utf-8 -*-  这是到正目录下运行 scrapy genspider xs "qidian.com" 命令创建的
import scrapy
from python16.items import Python16Item

class XsSpider(scrapy.Spider):
    # 爬虫的名字
    name = 'xs'
    #  是允许爬取的域名,比如一些网站有相关链接,域名就和本网站不同,这些就会忽略。
    allowed_domains = ['qidian.com']
    #是Spider爬取的网站,定义初始的请求url,可以多个。
    start_urls = ['https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1']

    # 是Spider的一个方法,在请求start_url后,之后的方法,这个方法是对网页的解析,与提取自己想要的东西。
    # response参数:是请求网页后返回的内容,也就是你需要解析的网页。
    def parse(self, response):
        # 获取 网页返回的数据
        # print(response.body.decode("UTF-8"))
        # 解析网页数据 xpath (response 已经提供了这样的方法)
        lis =  response.xpath("//ul[contains(@class,'all-img-list')]/li")
        for li in lis:
            item = Python16Item()
            # 说明 scrapy 中拿到的 text() 数据是长这样的:[<Selector xpath=".//div[@class='book-mid-info']/h4/a/text()" data='圣墟'>]
            # 我们还要通过这样的方式去提取 .extract_first() 拿到第一条数据
            name = li.xpath(".//div[@class='book-mid-info']/h4/a/text()").extract_first()
            author = li.xpath(".//div[@class='book-mid-info']/p[@class='author']/a/text()").extract_first()
            # 去空格
            content = str(li.xpath("./div[@class='book-mid-info']/p[@class='intro']/text()").extract_first()).strip()
            item['name'] = name
            item['author'] = author
            item['content'] = content
            # 通过这种方式将数据返回给 管道(千万不要用 yield 集合 因为性能不好,毕竟在内存里面)
            yield item
        # 获取下一页超链接
        nextUrl = response.xpath("//a[contains(@class, 'lbf-pagination-next')]/@href").extract_first()
        if nextUrl != "javascript:;":
            yield scrapy.Request(url="http:" + nextUrl, callback=self.parse)

定义Item
item是保存爬取数据的容器,使用的方法和字典差不多。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


# 相当于实体类
class Python16Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy .Field()
    name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

pipelines.py 这是我们处理数据的地方

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import csv


# 存数据的 类似于实体类
class Python16Pipeline(object):
    def __init__(self):
        # newline 时防止写的时候发生空一行的问题
        self.f = open("起点中文网.csv", "w", newline="")
        self.writer = csv.writer(self.f)
        self.writer.writerow(["书名", "作者", "简介"])

    # 处理 item
    def process_item(self, item, spider):
        # 保存成 csv 文件
        name = item["name"]
        author = item["author"]
        content = item["content"]
        self.writer.writerow([name, author, content])
        #如果还有下一个处理的 管道就是交给下一个处理管道的值,当然页可以不返回,那么那边 process_item item 值就是itemNone
        return item;

下面贴扫 settings.py 这个文件中的代码:

# -*- coding: utf-8 -*-

# Scrapy settings for python16 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'python16'

SPIDER_MODULES = ['python16.spiders']
NEWSPIDER_MODULE = 'python16.spiders'
# 设置日志输出级别
LOG_LEVEL = "WARNING"

# Crawl responsibly by identifying yourself (and your website) on the user-agent 这就是配置我们自己的 ua值了
#USER_AGENT = 'python16 (+http://www.yourdomain.com)'

# Obey robots.txt rules  觉得要爬去的东西是否符合 rules 协议
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers: 这是默认的请求头我们可以覆盖掉
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'python16.middlewares.Python16SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'python16.middlewares.Python16DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#  管道优先级(例如下面的300)值越小就优先级别越高(取值范围值 0-1000)
# 默认是没有用管道的这个配置所以我们要打开
# 这里可以配置多个管道来,通过优先级别来决定管道的执行顺序
# 每个管道最后 return 的值作为下一个管道 process_item 函数 item 参函接收的值 
ITEM_PIPELINES = {
   'python16.pipelines.Python16Pipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

测试:
在命令行通过 scrapy crawl xs 就能运行了
在这里插入图片描述
当然上面这种方式不是很方便我们可以在项目中编写一个 start.py 文件通过运行这个文件来启动爬虫程序
在这里插入图片描述

from scrapy import cmdline

# 这里必须分割 通过 split
cmdline.execute("spiders>scrapy crawl xs".split())
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值