假设前置条件:
个人PC安装好python、scrapy、MongoDB等环境
1.创建项目
scrapy startproject tutorial
创建完后的文件夹如下:
其中红色框框内的两个文件为后续创建的文件,其他的文件为执行命令后生成的。
2.创建爬虫
进入刚才创建的 tutorial 文件夹,执行genspider命令
scrapy genspider quotes
执行完毕之后,spiders 文件夹中多了一个 quotes.py文件(上图红色框1),编辑quotes.py:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = [
"http://quotes.toscrape.com/"]
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract_first()
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)
说明:
- name,它是每个项目唯一的名字,用来区分不同的 Spider。
- allowed_domains,它是允许爬取的域名,如果初始或后续的请求链接不是这个域名下的,则请求链接会被过滤掉。
- start_urls,它包含了 Spider 在启动时爬取的 url 列表,初始请求是由它来定义的。
- parse,它是 Spider 的一个方法,负责解析response。默认情况下,被调用时 start_urls 里面的链接构成的请求完成下载执行后,返回的响应就会作为唯一的参数传递给这个函数。该方法负责解析返回的响应、提取数据或者进一步生成要处理的请求。
- 采用CSS选择器提取元素(具体HTML分析不在本文范畴内)
- QuoteItem为下一步将要创建的item,首页的所有内容解析后赋值给该QuoteItem,其中text、author、tags为需要提取的字段
- next为通过CSS选择器获取到的下一个页面的链接,通过urljoin()方法构造出新的url,并通过url和callback变量构造新的请求,回调函数仍然使用当前的parse()方法。以此循环处理,直到处理到最后一页。
3.创建 Item
items.py内容如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class QuoteItem(scrapy.Item):
# define the fields for your item here like:
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
继承 scrapy.Item 类,创建三个字段text、author、tags。
4.创建Item Pipeline
pipelines.py内容如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
import pymongo
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class TutorialPipeline(object):
def __init__(self):
self.limit = 50
def process_item(self, item, spider):
if item['text']:
if len(item['text']) > self.limit:
item['text'] = item['text'][0:self.limit].rstrip()+'...'
return item
else:
return DropItem('Missing Text')
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_port, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_port = mongo_port
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_port=crawler.settings.get('MONGO_PORT'),
mongo_db=crawler.settings.get('MONGO_DB'))
def open_spider(self, spider):
self.client = pymongo.MongoClient(host=self.mongo_uri,
port=self.mongo_port)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.__class__.__name__
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
print('\n插入数据结束,开始查询数据...')
##指定数据库
db = self.client.tutorial
##指定集合(类似表)
collection = db.QuoteItem
##多条查询
results = collection.find()
for ret in results:
print('数据==' + str(ret))
self.client.close()
Item Pipeline 为项目管道。当 Item 生成后,它会自动被送到 Item Pipeline 进行处理,我们常用 Item Pipeline 来做如下操作。
* 清洗 HTML 数据
* 验证爬取数据,检查爬取字段
* 查重并丢弃重复内容
* 将爬取结果储存到数据库
要实现 Item Pipeline 很简单,只需要定义一个类并实现 process_item() 方法即可。启用 Item Pipeline 后,Item Pipeline 会自动调用这个方法。process_item() 方法必须返回包含数据的字典或 Item 对象,或者抛出 DropItem 异常。process_item() 方法有两个参数。一个参数是 item,每次 Spider 生成的 Item 都会作为参数传递过来。另一个参数是 spider,就是 Spider 的实例。这里定义了两个pipeline, 其中:
TutorialPipeline:主要实现判断 item 的 text 属性是否存在,如果不存在,则抛出 DropItem 异常;如果存在,再判断长度是否大于 50,如果大于,那就截断然后拼接省略号,再将 item 返回即可。
MongoPipeline:主要实现获取settings.py中的MongoDB配置信息,连接并插入解析后的item数据,同时插入完毕后,查询数据进行雁阵。
- from_crawler:这是一个类方法,用 @classmethod 标识,是一种依赖注入的方式,方法的参数就是 crawler,通过 crawler 这个我们可以拿到全局配置的每个配置信息,在全局配置 settings.py 中我们可以定义 MONGO_URI 和 MONGO_DB 来指定 MongoDB 连接需要的地址和数据库名称,拿到配置信息之后返回类对象即可。所以这个方法的定义主要是用来获取 settings.py 中的配置的
- open_spider:当 Spider 被开启时,这个方法被调用。在这里主要进行了一些初始化操作。
- close_spider:当 Spider 被关闭时,这个方法会调用,在这里将数据库连接关闭。在关闭client连接前,查询出前面插入的数据进行验证。
- process_item(): 执行了数据插入操作。
5.配置setting.py
# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'tutorial.middlewares.TutorialSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'tutorial.pipelines.TutorialPipeline': 300,
'tutorial.pipelines.MongoPipeline': 400,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
##mongodb setting
MONGO_URI='127.0.0.1'
MONGO_PORT=27017
MONGO_DB='tutorial'
关键添加如下两部分配置(其他配置为默认生成的):
1.定义MongoDB的连接信息MONGO_URI、MONGO_PORT、MONGO_DB
##mongodb setting
MONGO_URI='127.0.0.1'
MONGO_PORT=27017
MONGO_DB='tutorial'
2. 赋值 ITEM_PIPELINES 字典,键名是 Pipeline 的类名称,键值是调用优先级,是一个数字,数字越小则对应的 Pipeline 越先被调用
ITEM_PIPELINES = { 'tutorial.pipelines.TutorialPipeline': 300, 'tutorial.pipelines.MongoPipeline': 400, }
6.创建一个执行文件demo.py
用于运行爬虫:
# coding=utf-8
from scrapy import cmdline
import pymongo
cmdline.execute("scrapy crawl quotes".split())
7.运行
执行命令:python demo.py
- 运行结果:
结束....