【Python】大数据挖掘课程作业2——使用SQLAlchemy将爬虫获得的数据储存进数据库

最新推荐文章于 2022-03-30 16:54:23 发布

RM -RF /星

最新推荐文章于 2022-03-30 16:54:23 发布

阅读量462

点赞数

分类专栏：一入Python深似海文章标签： python SQLAlchemy sqlite3 数据挖掘 bilibili

本文链接：https://blog.csdn.net/weixin_41429999/article/details/106892166

版权

一入Python深似海专栏收录该内容

7 篇文章 0 订阅

订阅专栏

【Python】大数据挖掘课程作业2——使用SQLAlchemy将爬虫获得的数据储存进数据库

上一篇博客中，我们详解了从B站爬取相关数据的流程，现在，我们要将数据储存进数据库中。

本文写作于2020-06，B站正处于AV向BV过渡的阶段，日后B站后台的数据库设计可能发生变化导致本文的内容不在适用，请读者注意。

数据表的定义

根据我的课程作业的需要，将定义四个数据表，分别表示UP主、视频、评论、弹幕。

定义表单的代码

from sqlalchemy import create_engine, MetaData
from sqlalchemy import Table, Column, ForeignKey
from sqlalchemy import String, DateTime, Integer, Text

metadata = MetaData()

uploader = Table(
    'uploader', metadata,
    Column('uid', Integer(), primary_key=True),
    Column('name', String(255), nullable=False)
)

video = Table(
    'video', metadata,
    Column('av', Integer(), primary_key=True),
    Column('bv', String(20), nullable=False, index=True, unique=True),
    Column('comment_count', Integer(), nullable=False),
    Column('play_count', Integer(), nullable=False),
    Column('title', String(255), nullable=False),
    Column('description', Text(), nullable=False),
    Column('uploader_id', Integer(), ForeignKey(column='uploader.uid', ondelete='CASCADE'), nullable=False),
    Column('upload_time', DateTime(), nullable=False),
)

comments = Table(
    'comments', metadata,
    Column('rp_id', Integer(), primary_key=True),
    Column('video_id', Integer(), ForeignKey(column='video.av', ondelete='CASCADE'), nullable=False, index=True),
    Column('likes', Integer(), nullable=False),
    Column('root_comment', Integer(), ForeignKey(column='comments.rp_id', ondelete='CASCADE'), nullable=True),
    Column('content', String(255), nullable=False),
    Column('comment_time', DateTime(), nullable=False),
)

dm = Table(
    'dm', metadata,
    Column('id', Integer(), autoincrement=True, primary_key=True),
    Column('video_id', Integer(), ForeignKey(column='video.av', ondelete='CASCADE'), nullable=False, index=True),
    Column('content', String(255), nullable=False),
    Column('property', String(255), nullable=False),
)

if __name__ == '__main__':
    engine = create_engine('sqlite:///../bilibili.db', echo=True, encoding='utf-8')
    metadata.create_all(engine)

uploader各字段的含义

字段	含义
uid	用户的数字UID
name	用户昵称

video各字段的含义

字段	含义
av	视频的AV号
bv	视频的BV号
comment_count	评论总数
play_count	播放量
title	视频标题
description	视频简介
uploader_id	UP主的uid，外键
upload_time	视频上传时间

comments各字段的含义

字段	含义
rp_id	唯一的标识每一条评论的id
video_id	对应视频的AV号，外键
likes	评论的点赞数
root_comment	如果此条评论是另一条评论下的回复，则此字段为那一条评论的rp_id，外键
content	评论的具体内容
comment_time	评论时间

dm各字段的含义

字段	含义
id	数据库的自增id
video_id	对应视频的AV号，外键
content	弹幕的具体内容
property	原始弹幕数据中表示弹幕各种属性的一个字符串

获取数据并插入数据库

关于获取数据的部分，使用上一篇博客中的代码，需要注意的是，在获取数据的过程中，由于一个视频下的评论需要分成很多次获取，而在这个过程中，由于数据本身可能会发生变化（比如在我们爬数据的过程中，有其他正常的用户在进行评论和点赞操作，导致数据发生变化），我们获取的数据中有可能会存在重复的部分，为了防止在数据库中插入重复数据导致的异常，我们在插入数据之前需要先进行一次检查（我这里直接对每个要插入的数据进行一次select确保没有重复，如果读者对数据库相关知识掌握的更加深入，请忽略我简单粗暴的做法）。

注意，在这里由于我的作业选题，我指定了一个确定的UP主列表，并且给出了一个标签列表用于过滤出和COVID19相关的视频。

相关代码如下：

from .CreateTable import uploader, video, comments, dm
from .CreateTable import metadata

from GetBilibiliData.GetBilibiliUploaderInfo import get_video_list_from_uploader_id
from GetBilibiliData.GetBilibiliVideoInfo import get_av_vid_comment_number_and_tags_from_bv
from GetBilibiliData.GetBilibiliVideoInfo import get_comments_and_replies_from_av_and_bv
from GetBilibiliData.GetBilibiliVideoInfo import get_dm_from_vid_and_bv

from sqlalchemy import create_engine
from sqlalchemy import insert, select, update, and_
from sqlalchemy.sql.dml import Insert, Update
from sqlalchemy.sql.selectable import Select
from sqlalchemy.engine.result import ResultProxy, RowProxy
from sqlalchemy.engine.base import Engine, Connection

import datetime


def gather_uploader_info(connection: Connection) -> None:
    """
    将我需要的UP主的信息插入数据库中。
    :param connection: 一个数据库连接，数据库中必须已经创建好了对应的表（up，video，comments，dm）
    :return: None
    """
    up = {
        10330740: '观察者网',
        456664753: '央视新闻',
        10303206: '环球时报',
        483787858: '环球网',
        222103174: '小央视频',
        54992199: '观视频工作室',
    }

    for uid in up:
        name = up[uid]
        sel = select([uploader]).where(uploader.c.uid == uid)  # type: Select
        sel_rp = connection.execute(sel)  # type: ResultProxy
        if sel_rp.first():
            continue

        ins = insert(uploader).values(  # type: Insert
            uid=uid,
            name=name
        )
        res = connection.execute(ins)  # type: ResultProxy
        print('up主信息插入：' + str(res.inserted_primary_key))


def gather_video_info_for_single_uploader(connection: Connection, uid: int, required_tags: list,
                                          start_time: datetime.datetime, end_time: datetime.datetime) -> None:
    """
    根据UP主的UID，爬取一定之间段内，这个UP上传的包含指定标签的所有视频的信息，并储存。
    :param connection: 一个数据库连接，必须已经创建好了相关数据表
    :param uid: UP主的UID
    :param required_tags: 最终插入数据库的视频的标签至少有一个出现在required_tags中
    :param start_time: 需要的视频的最早上传时间
    :param end_time: 需要的视频的最晚上传时间
    :return: None
    """
    def __filter_video_tags(bv: str, wanted_tags: list) -> bool:
        _, _, cnt, tags = get_av_vid_comment_number_and_tags_from_bv(bv=bv)
        if cnt == -1:
            return False
        real_tags = []
        for t in tags:  # type: dict
            real_tags.append(t['tag_name'])

        for t1 in real_tags:  # type: str
            for t2 in wanted_tags:  # type: str
                if t1.find(t2) != -1 or t2.find(t1) != -1:
                    print(real_tags)
                    return True
        return False

    res = get_video_list_from_uploader_id(uid=f'{uid}', start_time=start_time, end_time=end_time)

    for v in res:  # type: dict
        if __filter_video_tags(bv=v['bvid'], wanted_tags=required_tags):
            sel = select([video.c.av]).where(video.c.av == v['aid'])  # type: Select
            sel_rp = connection.execute(sel)  # type: ResultProxy

            if sel_rp.first():
                upd = update(video).values(  # type: Update
                    comment_count=v['comment'],
                    play_count=v['play'],
                    title=v['title'],
                    description=v['description'],
                )
                upd = upd.where(video.c.av == v['aid'])
                upd_rp = connection.execute(upd)  # type: ResultProxy
                print(upd_rp.last_updated_params())
            else:
                ins = insert(video).values(  # type: Insert
                    av=v['aid'],
                    bv=v['bvid'],
                    comment_count=v['comment'],
                    play_count=v['play'],
                    title=v['title'],
                    description=v['description'],
                    uploader_id=uid,
                    upload_time=datetime.datetime.fromtimestamp(v['created']),
                )

                ins_res = connection.execute(ins)  # type: ResultProxy
                print(ins_res.inserted_primary_key)


def gather_video_info_for_all_uploader(connection: Connection, start_time: datetime.datetime,
                                       end_time: datetime.datetime) -> None:
    """
    对于数据库中已经存在的所有UP主，爬取他们在一定时间范围内上传的视频的信息，并储存。
    :param connection: 数据库连接，相关数据表必须已经创建好
    :param start_time: 开始时间
    :param end_time: 结束时间
    :return: None
    """
    up_sel = select([uploader.c.uid, uploader.c.name])  # type: Select
    rp = connection.execute(up_sel)  # type: ResultProxy
    required_tags = ['福奇', '肺炎', '新冠', '疫情', '病毒', '蝙蝠', 'COVID-19', 'COVID19']  # 用这些标签来识别与COVID19相关的视频

    for r in rp:  # type: RowProxy
        print(f'现在获取 {r.name} 的视频列表')
        gather_video_info_for_single_uploader(connection=connection, uid=r.uid, required_tags=required_tags,
                                              start_time=start_time, end_time=end_time)


def gather_comment_info_for_single_video(connection: Connection, av: int, bv: str, comment_total: int) -> None:
    """
    对于单个视频，爬取它的所有评论并储存。
    :param connection: 数据库连接，相关数据表必须已经创建
    :param av: 视频的AV号
    :param bv: 视频的BV号
    :param comment_total: 视频评论总数（作为识别数据是否已经获取完整的依据）
    :return: None
    """

    def __insert_comment(__rp_id: int, __video_id: int, __likes: int, __root_comment: int, __content: str,
                         __comment_time: datetime.datetime) -> int:
        sel = select([comments]).where(comments.c.rp_id == __rp_id)  # type: Select
        rp = connection.execute(sel)  # type: ResultProxy
        if rp.first():
            upd = update(comments).values(  # type: Update
                likes=__likes,
            )
            upd = upd.where(comments.c.rp_id == __rp_id)  # type: Update
            connection.execute(upd)
            return __rp_id

        ins = insert(comments).values(  # type: Insert
            rp_id=__rp_id,
            video_id=__video_id,
            likes=__likes,
            root_comment=__root_comment,
            content=__content,
            comment_time=__comment_time,
        )
        rp = connection.execute(ins)  # type: ResultProxy
        return rp.inserted_primary_key

    cts = get_comments_and_replies_from_av_and_bv(av=str(av), bv=bv, comment_total=comment_total)
    for c in cts:  # type: dict
        ins_id = __insert_comment(
            __rp_id=c['rpid'],
            __video_id=c['oid'],
            __likes=c['like'],
            __root_comment=-1,
            __content=c['content']['message'],
            __comment_time=datetime.datetime.fromtimestamp(c['ctime']),
        )
        print(ins_id)

        if c.get('replies'):
            for r in c['replies']:  # type: dict
                ins_id = __insert_comment(
                    __rp_id=r['rpid'],
                    __video_id=r['oid'],
                    __likes=r['like'],
                    __root_comment=r['root'],
                    __content=r['content']['message'],
                    __comment_time=datetime.datetime.fromtimestamp(r['ctime'])
                )
                print(ins_id)


def gather_comment_info_for_all_video(connection: Connection) -> None:
    """
    对于数据库中已经存在的所有视频信息，爬取他们的评论并储存。
    :param connection: 数据库连接，相关数据表必须已经创建完成。
    :return: None
    """
    video_sel = select([video.c.av, video.c.bv, video.c.comment_count])  # type: Select
    video_rp = connection.execute(video_sel)  # type: ResultProxy

    for v in video_rp:  # type: RowProxy
        gather_comment_info_for_single_video(connection=connection, av=v.av, bv=v.bv, comment_total=v.comment_count)


def gather_dm_info_for_single_video(connection: Connection, av: int, bv: str) -> None:
    """
    爬取某一个视频的弹幕并储存。
    :param connection: 数据库连接，相关数据表必须已经创建。
    :param av: 视频的AV号
    :param bv: 视频的BV号
    :return: None
    """
    _, vid, _, _ = get_av_vid_comment_number_and_tags_from_bv(bv=bv)
    if vid == '':
        return
    dms = get_dm_from_vid_and_bv(vid=vid, bv=bv)
    for d in dms:
        text = d[0]
        prop = d[1]

        sel = select([dm.c.content]).where(and_(dm.c.content == text, dm.c.property == prop))  # type: Select
        rp = connection.execute(sel)  # type: ResultProxy
        if rp.first():
            continue

        ins = insert(dm).values(  # type: Insert
            video_id=av,
            content=text,
            property=prop,
        )
        rp = connection.execute(ins)  # type: ResultProxy
        print(rp.inserted_primary_key)


def gather_dm_info_for_all_video(connection: Connection) -> None:
    """
    对于数据库中已经存在的所有视频，爬取他们的弹幕数据，并储存。
    :param connection:
    :return:
    """
    video_sel = select([video.c.av, video.c.bv])  # type: Select
    video_rp = connection.execute(video_sel)  # type: ResultProxy

    for r in video_rp:  # type: RowProxy
        gather_dm_info_for_single_video(connection=connection, av=r.av, bv=r.bv)


if __name__ == '__main__':
    pass

杂谈

在做这次作业的过程中，为了方便和同组的人分享数据（疫情期间不能返校），我使用了sqlite3这个数据库，因为它是直接基于文件的，但在使用中我发现这个数据库如果进行密集的读写的话，对硬盘施加的负载很大，如果将数据库文件放在机械硬盘上，很可能机械硬盘的性能会成为整个程序运行性能的瓶颈。

我在实际爬取数据的过程中，考虑到可能会发生的网络异常或是程序运行异常，在程序的一次运行中我只让它爬取五天的数据并形成一个单独的db文件（虽然最后爬完了半年的数据也没有发生什么意外），这就带来了合并数据库的需要。这里我选择了使用SQLAlchemy进行数据库合并（而不是在sqlite的命令行中合并），在使用SQLAlchemy合并数据库的过程中，我了解到sqlite3支持内存数据库，于是决定使用内存数据库储存中间结果，等到所有数据在内存中合并完成后，在一并写入硬盘，带来了一定的效率提升。

RM -RF /星

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【Python】大数据挖掘课程作业2——使用SQLAlchemy将爬虫获得的数据储存进数据库

【Python】大数据挖掘课程作业2——使用SQLAlchemy将爬虫获得的数据储存进数据库上一篇博客中，我们详解了从B站爬取相关数据的流程，现在，我们要将数据储存进数据库中。本文写作于2020-06，B站正处于AV向BV过渡的阶段，日后B站后台的数据库设计可能发生变化导致本文的内容不在适用，请读者注意。数据表的定义根据我的课程作业的需要，将定义四个数据表，分别表示UP主、视频、评论、弹幕。定义表单的代码from sqlalchemy import create_engine, MetaData
复制链接

扫一扫

专栏目录