一、为什么使用adbapi
传统的同步数据库操作可能会导致Scrapy爬虫阻塞,等待数据库操作完成。而adbapi允许Scrapy以异步的方式与数据库进行交互。这意味着Scrapy可以在等待数据库操作完成的同时继续执行其他任务,如抓取更多的网页或解析数据。这种异步处理方式极大地提高了Scrapy的运行效率和性能,使得项目能够更快地处理大量数据。
二、代码实现
首先在settings.py文件中定义数据库相关配置项,这里以mysql为例。
# mysql 配置
DB_NAME = 'your_database'
DB_HOST = '127.0.0.1'
DB_PORT = 3306
DB_USER = 'your_user_name'
DB_PASSWORD = 'your_password'
随后在pipelines.py初始化时读入上述数据库连接信息,并建立连接池。
class DataextractorPipeline:
def __init__(self):
settings = get_project_settings()
self.db_host = settings.get('DB_HOST')
self.db_port = settings.get('DB_PORT')
self.db_user = settings.get('DB_USER')
self.db_password = settings.get('DB_PASSWORD')
self.db_name = settings.get('DB_NAME')
self.dbpool = adbapi.ConnectionPool(
'pymysql', host=self.db_host, port=self.db_port,
user=self.db_user, passwd=self.db_password, db=self.db_name
)
随后在process_item()中进行写入数据库的操作。
def process_item(self, item, spider):
self.dbpool.runInteraction(self._do_insert, item)
def _do_insert(self, txn, item):
insert_sql = """
insert into tb_news_detail(news_url, news_domain, news_module, news_author, news_title, news_publish_time, news_content)
values (%s, %s, %s, %s, %s, %s, %s)
"""
# txn是一个事务对象,用于执行SQL语句
txn.execute(insert_sql, (item.get('url'), item.get('domain'), item.get('module'),
item.get('author'), item.get('title'), item.get('publish_time'), item.get('content')))
pipelines.py的完整代码如下:
from scrapy.utils.project import get_project_settings
from twisted.enterprise import adbapi
# useful for handling different item types with a single interface
class DataextractorPipeline:
def __init__(self):
settings = get_project_settings()
self.db_host = settings.get('DB_HOST')
self.db_port = settings.get('DB_PORT')
self.db_user = settings.get('DB_USER')
self.db_password = settings.get('DB_PASSWORD')
self.db_name = settings.get('DB_NAME')
self.dbpool = adbapi.ConnectionPool(
'pymysql', host=self.db_host, port=self.db_port,
user=self.db_user, passwd=self.db_password, db=self.db_name
)
def process_item(self, item, spider):
self.dbpool.runInteraction(self._do_insert, item)
def _do_insert(self, txn, item):
insert_sql = """
insert into tb_news_detail(news_url, news_domain, news_module, news_author, news_title, news_publish_time, news_content)
values (%s, %s, %s, %s, %s, %s, %s)
"""
# txn是一个事务对象,用于执行SQL语句
txn.execute(insert_sql, (item.get('url'), item.get('domain'), item.get('module'),
item.get('author'), item.get('title'), item.get('publish_time'), item.get('content')))
其中涉及到的item在items.py中定义,读者可根据需要自行设计:
import scrapy
class NewsDetailItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 文章url
url = scrapy.Field()
# 所属网站
domain = scrapy.Field()
# 所属模块
module = scrapy.Field()
# 文章作者
author = scrapy.Field()
# 文章标题
title = scrapy.Field()
# 发表时间
publish_time = scrapy.Field()
# 文章内容
content = scrapy.Field()