我的第一个Scrapy爬虫，入门级原理详解，附详细源码

最新推荐文章于 2022-11-07 05:00:00 发布

搬码工琪老师

最新推荐文章于 2022-11-07 05:00:00 发布

阅读量191

点赞数 2

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/hq606/article/details/115304542

版权

爬虫专栏收录该内容

8 篇文章 4 订阅

订阅专栏

平时写了很多小爬虫，最近找到一个小网页，页面特别简单，文章地址如下：
（闲来无事，写个小爬虫，爬取1800篇高中作文）可以先了解一下爬取原理https://blog.csdn.net/hq606/article/details/115215425，
可以实现几十页的批量爬取。具体文章我已经写到博客里面了，里面有详细的思路和代码。写完以后，我突发奇想，既然这个爬虫这么简单好实现，这么多页面，为啥不用scrapy试试呢？以前想学scrapy但是实在不懂人家的方法，但是凭着直觉，我感觉我的爬虫肯定可以比较简单的放入scrapy框架，同时也能进一步学习它的原理和用法。

爬取的网页域名 http://www.zuowen.com
页面结构http://www.zuowen.com/gaokaozw/manfen/index_+页码+.shtml
每一页提取的数据页面结构如下：很容易提取
http://www.zuowen.com/e/20201201/5fc644d87ed36.shtml
http://www.zuowen.com/e/20201201/5fc6453d1e41b.shtml
http://www.zuowen.com/e/20201201/5fc64653d9ea8.shtml
http://www.zuowen.com/e/20201201/5fc6451d5130f.shtml
http://www.zuowen.com/e/20201201/5fc646982fefb.shtml

第一步创建爬虫框架

# 命令行先输入 cd desktop 把目录切换到桌面 这是我的坏习惯 为了在桌面方便操作
# 命令行输入  scrapy startproject zuowen

在这里插入图片描述
上面是自动生成的文件

init #初始化设置一般是空的

middlewares #中间件，个人理解为中间传递工具，一个框架工具，暂时不管他

pipelines
#管道负责输出你的数据，并且设置写入文件，这里为了简单我只是让他print（）输出

settings #爬取参数的设置，设置时间，线程，请求头，日志等等

items #这里定义你要爬取得到的所有数据，有几种，就写几个，注意这里的数据写好 spiders 里面的爬虫以后，才能知道你有哪些数据要获得，然后才放入这个item

下面放出代码：

init.py
#这个是自己生成的必须有但暂时是空的文件，可能是逻辑需要，或者设置备用

#这里好空啊啥也没有

再看spiders文件夹里面的爬虫代码，
zuowen.py
在这里插入图片描述

import scrapy
import sys
#sys.path.append("..") 
#import items
from ..items import ZuowenItem
import requests
import bs4
from bs4 import BeautifulSoup
import time

# 这个爬虫和我https://blog.csdn.net/hq606/article/details/115215425里面写的一样主要是用来产生网页地址然后用BeautifulSoup解析

class GaokmfSpider(scrapy.Spider):#自动生成的类
    name = 'zuowen'#爬虫名字 自己写一个
    allowed_domains = ['http://www.zuowen.com'] #爬取网页的域名 或者首页吧
    
    def start_requests(self):# 这个函数就是用来产生需要爬取的地址
        start_urls = []  #里面存放大量地址为了爬取，
                         #我这里因为下面重新定义了一个列表result，所以这里做备用
                         #例如地址http://www.zuowen.com/e/20201201/5fc644d87ed36.shtml
        urlhead = 'http://www.zuowen.com/gaokaozw/manfen/index_'
        for n in range(1,10):#自己设置需要爬取的页数，先弄10页试试效果
            if n==1:
                url =urlhead[:-1] +'.shtml' # 按照页码构造需要请求的网页第一页的链接
            else:
                url =urlhead+str(n)+'.shtml' # 按照页码构造需要请求第二页之后的网页的链接
            html = requests.get(url)  # get方式请求数据
            # print(html.status_code)  # 查看请求的状态码（200表示请求正常）
            html.encoding = 'gbk'
            content = html.text

            
            # 解析器自带的是html.parser，也可以用lxml，兼容性更好，不过要单独安装第三方库
            soup = BeautifulSoup(content, "lxml")
            words=soup.find_all('div',class_="artbox_l_c")
            #print(words)
            #global result
            #global drl
            result = []
            
            for div in words:
                drl=div.find('a')
                #print(type(drl))
                result.append(drl['href'])  # 获取每页的20篇文章地址放入列表
                #print(drl['href'])
                
            for i in result:
                url =i
                start_urls.append(i)  #暂时放入列表，这里仅仅备用
                #print(start_urls)
                yield scrapy.Request(url=url,callback=self.parse) 
                # yield把每个地址传送给Request   #这是我的理解
                
    def parse(self, response):#response 我理解为，是框架已经自动执行了Request(url）的动作
        item=ZuowenItem() #ZuowenItem()函数定义你要爬取的数据，感觉像是定义了一个字典
        #里面放的两个东西一个是我将来要的标题title 一个是文章内容cont2
        #title = scrapy.Field()
        #cont2 = scrapy.Field()
        #response.encoding = 'gbk'
        soup = bs4.BeautifulSoup(response.text,'lxml')   #暂时放入列表，这里仅仅备用     
        result2=[]
        words=soup.find_all('p')#获取作文内但是这是一句一句的话，组成的列表
        item['title']=soup.find('h1').text   #获取作文标题就一个标签 赋值给 item['title']   
        #item['cont1']=titl.text
       
        
        for th in words[1:-11]: #把作文内容文字部分，全部放入一个列表，切片方法去掉广告文字
            result2.append(th.text) #把作文内容文字部分，全部放入一个列表，去掉广告文字
        #items['titl']=result2.append(titl.text)          
            
        
        item['cont2'] = result2[:] #获取作文所有句子，赋值给 item['cont2']
        #print({item['title']:item['cont2']})
        yield item
#urlhead = 'http://www.zuowen.com/gaokaozw/manfen/index_'

#a=GaokmfSpider(scrapy.Spider)
#a.start_requests()

#print(a.start_requests())
    
    #a.parse(i)

那个item我为了简单，就给出了两个title和cont2，这是每次爬取作文要得到的，一个是作文标题一个是作文内容。
#Item 是保存爬取到的数据的容器；其使用方法和python字典类似，并且提供了额外保护机制来避免拼写错误导致的未定义字段错误
下面是items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ZuowenItem(scrapy.Item):
    # define the fields for your item here like:
    # scrapy.Field()  我猜它似乎是一个字典
    title = scrapy.Field()  #像是定义了一个字典的键
    cont2 = scrapy.Field()  #像是又定义了一个字典的键
    
    pass

pipelines.py 数据管道

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
#from itemadapter import ItemAdapter

#用于把爬取的数据存储到文件的管道

class ZuowenPipeline:
    def process_item(self, item, spider):   #把文字写入txt文件
        
        with open('爬取作文.txt','a',encoding='gbk') as f:
            f.write('\n'+str({item['title']:item['cont2']})+'\n\n') #按照字典形式写入 
            f.close()
            print('写入成功')
        return item

middlewares.py #这个是自己生成的，我不敢动它

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class ZuowenSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ZuowenDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

settings.py#设置爬取的参数，线程，延时，速度，日志等等
#里面去掉注释的那几行，是根据网上学的，设置了几个参数

# Scrapy settings for zuowen project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'zuowen'

SPIDER_MODULES = ['zuowen.spiders']
NEWSPIDER_MODULE = 'zuowen.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs

DOWNLOAD_DELAY = 0

CONCURRENT_REQUESTS = 100

CONCURRENT_REQUESTS_PER_DOMAIN = 100

CONCURRENT_REQUESTS_PER_IP = 100

COOKIES_ENABLED = False

# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': '*/*',
   'Accept-Language': 'zh-CN,zh;q=0.9',
   'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'zuowen.middlewares.ZuowenSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'zuowen.middlewares.ZuowenDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'zuowen.pipelines.ZuowenPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#最后在第一层文件夹里面，放入一个main.py 文件为了启动运行爬虫
在这里插入图片描述
main.py #注意这个是我自己添加的，不是scrapy产生的啊

from scrapy import cmdline
cmdline.execute(['scrapy','crawl','zuowen','-s','LOG_FILE=all.log'])

这样写main，为了让运行的时候，日志保存到all.txt文件，避免一堆红色日志输出，显得很乱。’-s’, ‘LOG_FILE=all.log’ 这两个东西去掉也可以的。

下面看效果

在这里插入图片描述

愉快的爬虫框架启动了，爽啊！！

搬码工琪老师

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
2
评论
我的第一个Scrapy爬虫，入门级原理详解，附详细源码

平时写了很多小爬虫，最近找到一个小网页，页面特别简单，文章地址如下：（闲来无事，写个小爬虫，爬取1800篇高中作文）可以先了解一下爬取原理https://blog.csdn.net/hq606/article/details/115215425，可以实现几十页的批量爬取。具体文章我已经写到博客里面了，里面有详细的思路和代码。写完以后，我突发奇想，既然这个爬虫这么简单好实现，这么多页面，为啥不用scrapy试试呢？以前想学scrapy但是实在不懂人家的方法，但是凭着直觉，我感觉我的爬虫肯定可以比较简单的放
复制链接

扫一扫