[python爬虫之路day19:] scrapy框架初入门day1——爬取百思不得姐段子

最新推荐文章于 2024-01-21 22:33:19 发布

荏苒冬春去^

最新推荐文章于 2024-01-21 22:33:19 发布

阅读量167

点赞数

分类专栏：爬虫小白学习文章标签： python 大数据中间件

本文链接：https://blog.csdn.net/dinnersize/article/details/104941383

版权

爬虫小白学习专栏收录该内容

23 篇文章 2 订阅

订阅专栏

好久没学习爬虫了，今天再来记录一篇我的初入门scrapy。
首先scrapy是针对大型数据的爬取，简单便捷，但是需要操作多个文件以下介绍：
写一个爬虫，需要做很多的事情。比如：
发送网络请求，
数据解析，
数据存储，
反反爬虫机制（更换ip代理、设置请求头等）
异步请求等。
这些工作如果每次都要自己从零开始写的话，比较浪费时间。因此Scrapy把一些基础的东西封装好了，在他上面写爬虫可以变的更加的高效（爬取效率和开发效率）。因此真正在公司里，一些上了量的爬虫，都是使用Scrapy框架来解决。

框架图

Scrapy Engine（引擎）：Scrapy框架的核心部分。负责在Spider和ItemPipeline、Downloader、Scheduler中间通信、传递数据等。
Spider（爬虫）：发送需要爬取的链接给引擎，最后引擎把其他模块请求回来的数据再发送给爬虫，爬虫就去解析想要的数据。这个部分是我们开发者自己写的，因为要爬取哪些链接，页面中的哪些数据是我们需要的，都是由程序员自己决定。
Scheduler（调度器）：负责接收引擎发送过来的请求，并按照一定的方式进行排列和整理，负责调度请求的顺序等。
Downloader（下载器）：负责接收引擎传过来的下载请求，然后去网络上下载对应的数据再交还给引擎。
Item Pipeline（管道）：负责将Spider（爬虫）传递过来的数据进行保存。具体保存在哪里，应该看开发者自己的需求。
Downloader Middlewares（下载中间件）：可以扩展下载器和引擎之间通信功能的中间件。
Spider Middlewares（Spider中间件）：可以扩展引擎和爬虫之间通信功能的中间件。

一. 创建项目：

要使用Scrapy框架创建项目，需要通过命令来创建。首先进入到你想把这个项目存放的目录。然后使用以下命令创建：

scrapy startproject [项目名称]

下面进行目录介绍：
8. items.py：用来存放爬虫爬取下来数据的模型。
9. middlewares.py：用来存放各种中间件的文件。
10. pipelines.py：用来将items的模型存储到本地磁盘中。
11. settings.py：本爬虫的一些配置信息（比如请求头、多久发送一次请求、ip代理池等）。
12. scrapy.cfg：项目的配置文件。
13. spiders包：以后所有的爬虫，都是存放到这个里面。

下面来看具体操作：.
二.使用命令创建一个爬虫：
#注意此处是
cd qsbk
之后在命令行进行下面操作，不重名

scrapy gensipder qsbk_spider "budejie.com"

创建了一个名字叫做qsbk的爬虫，并且能爬取的网页只会限制在budejie.com这个域名下。
这是通过该命令产生的代码:
爬虫代码解析：

import scrapy

class QsbkSpider(scrapy.Spider):
    name = 'qsbk'
    allowed_domains = ['budejie.com']
    start_urls = ['http://budejie.com/']

    def parse(self, response):
        pass

其实这些代码我们完全可以自己手动去写，而不用命令。只不过是不用命令，自己写这些代码比较麻烦。
要创建一个Spider，那么必须自定义一个类，继承自scrapy.Spider，然后在这个类中定义三个属性和一个方法。

注意：

name：这个爬虫的名字，名字必须是唯一的。
allow_domains：允许的域名。爬虫只会爬取这个域名下的网页，其他不是这个域名下的网页会被自动忽略。
start_urls：爬虫从这个变量中的url开始。
parse：引擎会把下载器下载回来的数据扔给爬虫解析，爬虫再把数据传给这个parse方法。这个是个固定的写法。这个方法的作用有两个，第一个是提取想要的数据。第二个是生成下一个请求的url。
二.
#修改settings.py代码：

在做一个爬虫之前，一定要记得修改setttings.py中的设置。两个地方是强烈建议设置的。

ROBOTSTXT_OBEY设置为False。默认是True。即遵守机器协议，那么在爬虫的时候，scrapy首先去找robots.txt文件，如果没有找到。则直接停止爬取。
DEFAULT_REQUEST_HEADERS添加User-Agent。这个也是告诉服务器，我这个请求是一个正常的请求，不是一个爬虫。

完成的爬虫代码：
qsbk_spider.py

# -*- coding: utf-8 -*-
import scrapy
from qsbk.items import QsbkItem
from scrapy.http.response.html import HtmlResponse
from scrapy.selector.unified import SelectorList
class QsbkScrapySpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['budejie.com']
    start_urls = ['http://www.budejie.com/1']
    base_url = "http://www.budejie.com/"
    def parse(self, response):
        #SelectorList
        duanzilis=response.xpath("//div[@class='j-r-list']//li")

        for duanzi in duanzilis:
           #Selector
            author = duanzi.xpath(".//div[@class='u-txt']/a[@class='u-user-name']/text()").get()
            if author is not None:   #没有匹配到元素的情况的处理
               author=author.strip()   #同上
               # print(author)            #同上
            duanzi_text=duanzi.xpath(".//div[@class='j-r-list-c-desc']/a/text()").get()
            if duanzi_text is not None:
                duanzi_text="".join(duanzi_text)
                #print(duanzi_text)
            if duanzi_text and author is not None:
                duanziz={"author":author,"duanzi_text":duanzi_text}
                print(duanziz)
                item=QsbkItem(author=author,duanzi_text=duanzi_text)
                yield item
            next_url=response.xpath('//div[@class="m-page m-page-sr m-page-sm"]/a[last()]/@href').get()
            # if not next_url:
            #     return
            # else:
            yield scrapy.Request(self.base_url+next_url,callback=self.parse)

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for qsbk project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'qsbk'

SPIDER_MODULES = ['qsbk.spiders']
NEWSPIDER_MODULE = 'qsbk.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'qsbk (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'qsbk.middlewares.QsbkSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'qsbk.middlewares.QsbkDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'qsbk.pipelines.QsbkPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py
可以在终端运行，此处在pycharm运行。

from scrapy import cmdline
cmdline.execute("scrapy crawl qsbk_spider".split())

pipelines.py
此处介绍三种方法

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


一.传统方法
# import json
#
# class QsbkPipeline(object):
#     def __init__(self):
#         self.fp=open("duanzi.json","w",encoding="utf-8")
#
#     def open_spider(self,spider):
#         print("爬虫开始了……")
#     def process_item(self, item, spider):
#         item_json=json.dumps(dict(item),ensure_ascii=False)
#         self.fp.write(item_json+'\n')
#         return item
#     def close_spider(self,spider):
#         self.fp.close()
#         print("爬虫结束了……")

二.JsonItemExporter，保存的是列表形式，不换行
# from scrapy.exporters import JsonItemExporter
#
#
# class QsbkPipeline(object):
#     def __init__(self):
#         self.fp=open("duanzi.json","wb")
#         self.exporter=JsonItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
#     def open_spider(self,spider):
#         print("爬虫开始了……")
#     def process_item(self, item, spider):
#         self.exporter.export_item(item)
#         return item
#     def close_spider(self,spider):
#         self.fp.close()
#         print("爬虫结束了……")

三.保存的是字典形式，和第一种一样，每行一个数据

from scrapy.exporters import JsonLinesItemExporter


class QsbkPipeline(object):
    def __init__(self):
        self.fp=open("duanzi.json","wb")
        self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
    def open_spider(self,spider):
        print("爬虫开始了……")
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
    def close_spider(self,spider):
        self.fp.close()
        print("爬虫结束了……")

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class QsbkItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author=scrapy.Field()
    duanzi_text=scrapy.Field()

保存结果截图：
在这里插入图片描述
此文部分选自up主神奇的老黄的笔记，有删改，仅供同行者查阅及自我复习。
（累趴…

荏苒冬春去^

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[python爬虫之路day19:] scrapy框架初入门day1——爬取百思不得姐段子

好久没学习爬虫了，今天再来记录一篇我的初入门scrapy。首先scrapy是针对大型数据的爬取，简单便捷，但是需要操作多个文件以下介绍：写一个爬虫，需要做很多的事情。比如：发送网络请求，数据解析，数据存储，反反爬虫机制（更换ip代理、设置请求头等）异步请求等。这些工作如果每次都要自己从零开始写的话，比较浪费时间。因此Scrapy把一些基础的东西封装好了，在他上面写爬虫可以变的更加的...
复制链接

扫一扫

专栏目录