python爬虫自学宝典——开发步骤

最新推荐文章于 2023-04-24 15:48:02 发布

良木66

最新推荐文章于 2023-04-24 15:48:02 发布

阅读量1k

点赞数 1

分类专栏： scrapy python

本文链接：https://blog.csdn.net/qq_44503987/article/details/105046846

版权

python 同时被 2 个专栏收录

22 篇文章 0 订阅

订阅专栏

scrapy

14 篇文章 4 订阅

订阅专栏

前文回顾
通过前面的解说，已经知道了如何到网络上爬取自己想要的信息；现在，我们只需要将这些测试代码在spiders中实现即可实现真正的爬虫。

一、定义items类。

此步为开启爬虫的首要任务，该类仅仅用于定义项目需要爬取的几个属性（值），比如爬取博客名称，博客类型，博客介绍信息等。
访问我的博客界面，如下：
在这里插入图片描述
本教程爬取我的各个博客的名称，阅读量，以及创建时间。
打开自己创的项目目录，进入item.py文件：

item.py文件内容如下：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

上面的程序中的一行粗体字代码表示所有的item类都需要继承scrapy.item类，接下来就是为所有的爬取信息定义对应的属性，每个属性都是一个scrapy.Field对象。

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # the blogs' name.
    name = scrapy.Field()
    # the number of the blogs' redding.
    red_number = scrapy.Field()
    # the date of the blogs' publish.
    publish_date = scrapy.Field()

该item类只是一个作为数据创数对象(DTO)的类，因此定义该类和简单。

二、编写spider类。

spider类是爬虫的开发关键步骤，需要使用xpath或者css选择器来提取HTML页面感兴趣的信息。
scrapy为创建spider提供了scrapy genspider命令，该命令的语法格式如下：

scrapy genspider [option] <name> <domain>#<>表示可选择选项

打开dos窗口，利用cd/d 进入项目目录下，然后在dos执行如下命令：

scrapy genspider demo_spider “https://www.csdn.net/”

运行上面的命令，既可在demo项目下demo\spiders中，看到demo_spider.py文件，如下：
在这里插入图片描述
打开文件内容如下：

# -*- coding: utf-8 -*-
import scrapy
class DemoSpiderSpider(scrapy.Spider):
    name = 'demo_spider'
    allowed_domains = ['csdn.net']
    start_urls = ['http://csdn.net/']
    def parse(self, response):
        pass

上面程序就是spider类的模板，该类的name属性用于指定该spider的名字；allow_domain用于限制该spider所爬取的域名；start_urls指定该spider会自动爬取的页面url。
spider需要继承scrapy.spider，并重写它的parse(self,response)方法——如下面程序所示。从该类来看，我们看不到发送请求、获取响应的代码，这正是scrapy的魅力所在——只要把所有需要爬取的页面url定义在starts_urls列表，scrapy的下载中间件就会负责从网络上下载数据，并把所有的数据传给parse(self,response)方法的response参数。
那么问题来了，我们要怎么改这个spider（蜘蛛）呢？
我们只需要做三步：1、修改start_urls，将自己要爬取的页面url填入中括号中；
2、修改parse(self,response)方法通过xpath或css选择器提取项目感兴趣的信息。
3、保存文件
将demo_spider.py文件修改如下：

# -*- coding: utf-8 -*-
import scrapy
from demo.items import DemoItem
class DemoSpiderSpider(scrapy.Spider):
    name = 'demo_spider'
    allowed_domains = ['csdn.net']
    start_urls = ['https://me.csdn.net/qq_44503987']
    def parse(self, response):
        for info in response.xpath('//div[@class="my_tab_page_con"]/dl[@class="tab_page_list"]'):
			item = DemoItem()
			item['name'] = info.xpath('./dt/h3/a[@class="sub_title"]/text()').extract_first().strip()#爬取到博客的name信息
			item['red_number'] = info.xpath('./dd[@class="tab_page_con_b clearfix"]/div[@class="tab_page_b_l fl"]/label/em/text()').extract_first().strip()#爬取到博客的阅读次数
			item['publish_date'] = info.xpath('./dd[@class="tab_page_con_b clearfix"]/div[@class="tab_page_b_r fr"]/text()').extract_first().strip()#爬取博客的发布日期
			yield item

至于上述的xpath为什么那么写，看一下博客界面的源代码便知道为什么。
程序中修改了start_urls列表，重新定义了该spides需要爬取的首页；接下来程序重写了spider的parse(self,response)方法。extract_first()表示将读取的内容写入列表。strip（）函数表示将字符串前后空白去掉。程序最后一行代码使用yield语句将item对象返回给scrapy引擎。此处不能使用return，因为return会导致整个方法返回，循环不能继续执行，而yield将会创建一个生成器。如果还是看不懂为什么用yield，可以访问我的这篇博客：yield与return的区别；scrapy使用yield将item返回给scrapy引擎后，scrapy引擎将这些item收集起来传给项目的pipeline，因此自然就到了使用scrapy开发的第三步。

三、编写pipeline.py文件

该文件负责将所有爬取的数据写入文件或者数据库中。
下面是未修改的pipelines.py文件的内容。

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class DemoPipeline(object):
    def process_item(self, item, spider):
        return item

上述的process_item(self, item, spider)函数中，参数与item，spider都是来自scrapy项目框架。本来是要用from xxx import xxx引入的，但是框架已经给我们解决引入问题了。
现在开始修改pipeline.py文件。为了简化开发，只在控制台打印item数据（后文有写到数据库的方法）。

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class DemoPipeline(object):
    def process_item(self, item, spider):
        print("Blogs's name:"+item['name']+";",end="")
        print("The number of blogs' redding:"+item['red_number']+";",end="")
        print("The date of blogs' publish:",item['publish_date'])

Scrapy引擎会自动将spider捕获的所有item逐个传给process_item(self,item,spider)方法，因此该方法只需处理单个的item即可——不管爬虫总供爬取了多少个item，process_item(self,item,spider)方法只处理一个即可。
经过上面三个步骤，基于scarpy的爬虫基本开发完成，下面还需要修改setting.py文件，进行一些简单的配置，比如修改user-agent。
打开项目的设置文件settings.py：

# -*- coding: utf-8 -*-
# Scrapy settings for demo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'demo'
SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'demo (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'demo.middlewares.DemoSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'demo.middlewares.DemoDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'demo.pipelines.DemoPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

将上述文件修改如下（注释删了，直接找下面非注释部分进行修改。）：

# -*- coding: utf-8 -*-
BOT_NAME = 'demo'
SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'
DEFAULT_REQUEST_HEADERS = {
	'User-Agent':'Mozilla/5.0',
	'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
	'Accept-Language': 'en',
}
ITEM_PIPELINES = {
	'demo.pipelines.DemoPipeline': 300,
}

总结

回顾上面开发过程，使用scrapy开发爬虫的核心工作就是三步：
1、定义item类，由于item只有一个DTO对象，因此定义item类很简单。（item类只是一个简单的项目存储文件，只需要将自己想要爬取的信息属性创建到里面即可。）
2、开发spider类，这一步是核心，核心，核心啊！，说白了就是爬信息，你可以利用xpath，也可以利用css，爬取的信息最后都会封装在item对象中。
3、开发pipeline，pipeline将爬取的信息进行处理。你可以将这个信息存储到数据库，也可以输出到控制台，总之这些信息，你想怎样就怎样。
开发步骤已经讲完了。
欲知后事如何，且听下回分解。

注：所有文件请以utf-8格式进行保存，以防止出现乱码现象。

良木66

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python爬虫自学宝典——开发步骤

通过前面的解说，已经知道了如何到网络上爬取自己想要的信息；现在，我们只需要将这些测试代码再spiders中实现即可实现真正的爬虫。一、定义items类。此步为开启爬虫的首要任务，该类仅仅用于定义项目需要爬取的几个属性（值），比如爬取博客名称，博客类型，博客介绍信息等。访问我的博客界面，如下：本教程爬取我的各个博客的名称，阅读量，以及创建时间。打开自己创的项目目录，进入item.py文...
复制链接

扫一扫

专栏目录