python+scrapy+MongoDB爬取网站数据

最新推荐文章于 2023-05-16 12:00:26 发布

大林子先森

最新推荐文章于 2023-05-16 12:00:26 发布

阅读量453

点赞数

分类专栏： python 文章标签： python scrapy mongodb 爬虫

本文链接：https://blog.csdn.net/liulianglin/article/details/116541289

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

假设前置条件：

个人PC安装好python、scrapy、MongoDB等环境

1.创建项目

scrapy startproject tutorial

创建完后的文件夹如下：

其中红色框框内的两个文件为后续创建的文件，其他的文件为执行命令后生成的。

2.创建爬虫

进入刚才创建的 tutorial 文件夹，执行genspider命令

scrapy genspider quotes

执行完毕之后，spiders 文件夹中多了一个 quotes.py文件（上图红色框1）,编辑quotes.py:

import scrapy

from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = [
        "http://quotes.toscrape.com/"]
    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract_first()
            yield item
        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)

说明：

name，它是每个项目唯一的名字，用来区分不同的 Spider。
allowed_domains，它是允许爬取的域名，如果初始或后续的请求链接不是这个域名下的，则请求链接会被过滤掉。
start_urls，它包含了 Spider 在启动时爬取的 url 列表，初始请求是由它来定义的。
parse，它是 Spider 的一个方法，负责解析response。默认情况下，被调用时 start_urls 里面的链接构成的请求完成下载执行后，返回的响应就会作为唯一的参数传递给这个函数。该方法负责解析返回的响应、提取数据或者进一步生成要处理的请求。
采用CSS选择器提取元素（具体HTML分析不在本文范畴内）
QuoteItem为下一步将要创建的item，首页的所有内容解析后赋值给该QuoteItem，其中text、author、tags为需要提取的字段
next为通过CSS选择器获取到的下一个页面的链接，通过urljoin()方法构造出新的url，并通过url和callback变量构造新的请求，回调函数仍然使用当前的parse()方法。以此循环处理，直到处理到最后一页。

3.创建 Item

items.py内容如下：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

继承 scrapy.Item 类,创建三个字段text、author、tags。

4.创建Item Pipeline

pipelines.py内容如下：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import pymongo
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem

class TutorialPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip()+'...'
            return item
        else:
            return DropItem('Missing Text')

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_port, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_port = mongo_port
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
                   mongo_port=crawler.settings.get('MONGO_PORT'),
                   mongo_db=crawler.settings.get('MONGO_DB'))

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(host=self.mongo_uri,
                                          port=self.mongo_port)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        print('\n插入数据结束，开始查询数据...')
        ##指定数据库
        db = self.client.tutorial
        ##指定集合(类似表)
        collection = db.QuoteItem
        ##多条查询
        results = collection.find()
        for ret in results:
            print('数据==' + str(ret))
        self.client.close()

Item Pipeline 为项目管道。当 Item 生成后，它会自动被送到 Item Pipeline 进行处理，我们常用 Item Pipeline 来做如下操作。

* 清洗 HTML 数据
* 验证爬取数据，检查爬取字段
* 查重并丢弃重复内容
* 将爬取结果储存到数据库

要实现 Item Pipeline 很简单，只需要定义一个类并实现 process_item() 方法即可。启用 Item Pipeline 后，Item Pipeline 会自动调用这个方法。process_item() 方法必须返回包含数据的字典或 Item 对象，或者抛出 DropItem 异常。process_item() 方法有两个参数。一个参数是 item，每次 Spider 生成的 Item 都会作为参数传递过来。另一个参数是 spider，就是 Spider 的实例。这里定义了两个pipeline, 其中：

TutorialPipeline：主要实现判断 item 的 text 属性是否存在，如果不存在，则抛出 DropItem 异常；如果存在，再判断长度是否大于 50，如果大于，那就截断然后拼接省略号，再将 item 返回即可。

MongoPipeline：主要实现获取settings.py中的MongoDB配置信息，连接并插入解析后的item数据，同时插入完毕后，查询数据进行雁阵。

from_crawler：这是一个类方法，用 @classmethod 标识，是一种依赖注入的方式，方法的参数就是 crawler，通过 crawler 这个我们可以拿到全局配置的每个配置信息，在全局配置 settings.py 中我们可以定义 MONGO_URI 和 MONGO_DB 来指定 MongoDB 连接需要的地址和数据库名称，拿到配置信息之后返回类对象即可。所以这个方法的定义主要是用来获取 settings.py 中的配置的
open_spider：当 Spider 被开启时，这个方法被调用。在这里主要进行了一些初始化操作。
close_spider：当 Spider 被关闭时，这个方法会调用，在这里将数据库连接关闭。在关闭client连接前，查询出前面插入的数据进行验证。
process_item()：执行了数据插入操作。

5.配置setting.py

# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'tutorial.middlewares.TutorialSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'tutorial.pipelines.TutorialPipeline': 300,
    'tutorial.pipelines.MongoPipeline': 400,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


##mongodb setting
MONGO_URI='127.0.0.1'
MONGO_PORT=27017
MONGO_DB='tutorial'

关键添加如下两部分配置（其他配置为默认生成的）：

1.定义MongoDB的连接信息MONGO_URI、MONGO_PORT、MONGO_DB

##mongodb setting
MONGO_URI='127.0.0.1'
MONGO_PORT=27017
MONGO_DB='tutorial'

2. 赋值 ITEM_PIPELINES 字典，键名是 Pipeline 的类名称，键值是调用优先级，是一个数字，数字越小则对应的 Pipeline 越先被调用

ITEM_PIPELINES = {
    'tutorial.pipelines.TutorialPipeline': 300,
    'tutorial.pipelines.MongoPipeline': 400,
}

6.创建一个执行文件demo.py

用于运行爬虫：

# coding=utf-8
from scrapy import cmdline
import pymongo


cmdline.execute("scrapy crawl quotes".split())

7.运行

执行命令：python demo.py

运行结果：

结束....

大林子先森

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
python+scrapy+MongoDB爬取网站数据

假设前置条件：个人PC安装好python、scrapy、MongoDB等环境1.创建项目scrapy startproject tutorial创建完后的文件夹如下：其中红色框框内的两个文件为后续创建的文件，其他的文件为执行命令后生成的。2.创建爬虫进入刚才创建的 tutorial 文件夹，执行genspider命令scrapy genspider quotes执行完毕之后，spiders 文件夹中多了一个 quotes.py文件（上图红色框1）,编辑...
复制链接

扫一扫