scrapy 爬虫之爬取CSDN博客（二）

最新推荐文章于 2021-03-26 23:33:54 发布

csd_ct

最新推荐文章于 2021-03-26 23:33:54 发布

阅读量226

点赞数

分类专栏：网络爬虫文章标签： python scrapy 爬虫

本文链接：https://blog.csdn.net/csd_ct/article/details/109378880

版权

网络爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

上篇博客中，介绍了爬取CSDN博客XPATH定位DOM节点和Items数据结构化的使用,下面我们来看scrapy中数据持久化，将数据保存到数据库中（sqlite、mysql、ES、mongodb）中，这里我们选择mysql为例，将爬取到的数据保存到mysql数据库中，其他数据库的存储过程类似。首先安装python操作msyql的驱动程序：

pip install -U mysqlclient

1、创建pipelines.py文件

在创建爬虫项目时，pipelines.py文件是自动生成的，前面说过，在spiders中爬取到的结构化的数据items会经过pipelines处理，我们在pipelines中建立我们的处理过程。在pipelines.py中建立MysqlPipeline类,代码如下：

class MysqlPipeline:
    '''将下载的数据保存到mysql数据中'''
    host = 'xx.xx.xx.xx'  # mysql ip
    port = 3306
    user = "xx"
    passwd = "xxxx"
    db = "xxxxx"
    #tb = 'OpsWeb_csdnblog' if "LINUX" in platform().upper() else 'opsweb_csdnblog'
    tb = 'OpsWeb_csdnblog'
    cur = None
    conn = None

   

    # 初始化代码
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        # crawler.signals.connect(s.spider_closed, signal=signals.engine_stopped)

        return s
    # 初始代码，用于mysql的连接
    def open_spider(self, spider):
        try:
            self.conn = MySQLdb.connect(
                host=self.host, port=self.port, user=self.user, passwd=self.passwd, db=self.db, charset='utf8')
            self.cur = self.conn.cursor()
        except Exception as identifier:
            spider.logger.error("error:%s", identifier, exc_info=1)

    # 数据库关闭时执行closed代码，用于清理工作
    def close_spider(self, spider):
        taskId = spider.customArgs.get('_job')
        spider.logger.error('error:task id is: %s', taskId)

        endTime = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        tableTask = 'OpsWeb_crawltask'
        sqlStr = f'UPDATE {tableTask} SET state="END",endTime="{endTime}" WHERE tId="{taskId}"'
        try:
            self.cur.execute(sqlStr)
        except Exception as identifier:
            spider.logger.error('error: %s', identifier, exc_info=1)
        finally:
            self.cur.close()
            self.conn.close()
        self.conn.commit()

    # item处理
    def process_item(self, item, spider):
        try:
            sqlStr = f"INSERT INTO {self.tb}({','.join(item.fields.keys())}) VALUES ({','.join([r'%s'] * len(item.fields.keys()))})"
            sqlVaules = [str(item[key]) for key in item.fields.keys()]
            self.cur.execute(sqlStr, sqlVaules)
        except Exception as identifier:
            spider.logger.error("error:%s", identifier, exc_info=1)

在MySQL数据中查看，爬取到的数据，见下图：

csd_ct

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy 爬虫之爬取CSDN博客（二）

上篇博客中，介绍了爬取CSDN博客XPATH定位DOM节点和Items数据结构化的使用,下面我们来看scrapy中数据持久化，将数据保存到数据库中（sqlite、mysql、ES、mongodb）中，这里我们选择mysql为例，将爬取到的数据保存到mysql数据库中，其他数据库的存储过程类似。首先安装python操作msyql的驱动程序：pip install -U mysqlclient1、创建pipelines.py文件在创建爬虫项目时，pipelines.py文件是自动生成的，前面说过，
复制链接

扫一扫

专栏目录