计算机/大数据毕业设计-基于Python的动漫数据分析可视化系统的设计与实现

本文链接：https://blog.csdn.net/m0_52163859/article/details/136464804

基于Python的动漫数据分析可视化系统的设计与实现

设计爬虫程序爬取哔哩哔哩动漫数据信息

后端使用flask框架，数据库使用Mysql8.0，可视化使用echarts

部分代码如下：

# 保存所有动漫信息
all_anime_infos = []
# 保存到文件中
file_writer = open('动漫信息1.json', 'w', encoding='utf8')

for page in range(1, total_page):
    print('抓取第 {} 页的数据'.format(page))
    url = base_url.format(page)
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Accept-Encoding': 'gzip, deflate, compress',
        'Accept-Language': 'en-us;q=0.5,en;q=0.3',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Referer': url
    }
    response = requests.get(url, headers=headers)
    response.encoding = 'utf8'
    soup = BeautifulSoup(response.text, 'lxml')

    item_ul = soup.find(name='ul', attrs={'id': 'browserItemList'})
    items = item_ul.find_all(name='li')

    for item in items:
        try:
            # 封面
            img = item.find('img')['src']
            img = 'https:' + img

            # 名称
            name = item.find('h3').a.text

            # 排名
            rank = item.find('span', attrs={'class': 'rank'}).text
            rank = rank.split(' ')[-1]

            # 话数，上映时间，导演等
            info = item.find('p', attrs={'class': 'info tip'}).text
            info = info.strip().replace(' ', '').split('/')

            # 话数
            hua_count = info[0][:-1]
            date = info[1]
            peoples = info[2:] if len(info) > 2 else []
            # 导演
            daoyan = peoples[0] if len(peoples) > 0 else '未知'
            # 脚本
            jiaoben = peoples[1:] if len(peoples) > 1 else []

            # 评分和人数
            score = item.find('p', attrs={'class': 'rateInfo'})
            # 评分人数
            score_count = score.find('span', attrs={'class': 'tip_j'}).text[1:].split('人')[0]
            # 评分
            score = score.find('small', attrs={'class': 'fade'}).text

            # 动漫链接
            dm_url = 'https://bangumi.tv' + item.find('h3').a['href']

            resp = requests.get(dm_url, headers=headers)
            resp.encoding = 'utf8'
            soup = BeautifulSoup(resp.text, 'lxml')
            header = soup.find('div', attrs={'id': 'headerSubject'})
            leixing = header.small.text

            # 声优
            juese = soup.find_all('div', attrs={'class': 'info'})
            cv_shengyou = []
            for js in juese:
                js = js.find_all('a')
                cv_shengyou.extend([j.text.strip() for j in js])

            anime_info = {
                '封面': img,
                '名称': name,
                '类型': leixing,
                '排名': int(rank),
                '话数': int(hua_count),
                '放送时间': date,
                '导演': daoyan,
                '声优': cv_shengyou,
                '脚本': jiaoben,
                '评分': float(score),
                '评分人数': int(score_count)
            }
            line_str = json.dumps(anime_info, ensure_ascii=False)
            print(line_str)
            all_anime_infos.append(line_str + '\n')

            if len(all_anime_infos) % 10 == 0:
                file_writer.writelines(all_anime_infos)
                all_anime_infos.clear()
        except:
            pass

    time.sleep(1)