利用Python进行数据收集与分析：以某平台评论为例_基于python的课程评论数据采集与解析研究-CSDN博客

本文链接：https://blog.csdn.net/flk_9090/article/details/140990522

前言

在数据科学和机器学习领域，数据的收集和处理是至关重要的环节。本文将通过一个实例，演示如何利用Python进行数据收集和分析。本文的目的是促进技术学习和提升数据处理能力，请勿用于任何非法用途。

环境准备

首先，我们需要准备一些必要的环境和库。本文将使用requests库来模拟HTTP请求，使用json库来处理数据，并使用一些自定义的工具函数。

#!/usr/bin/env python3
# coding:utf-8
import bag
import time
import os
import requests
import ssl

主程序

主程序负责调用数据收集和处理的函数，并传递必要的参数。

def main():
    note_id = '650ad2a6000000001e03d032'  # 示例文章id
    get_content(note_id)

模拟HTTP请求

通过自定义的函数spider，我们可以模拟HTTP请求，获取目标数据。

def spider(url):
    headers = {
        "Cookie": "your_cookie_here",   # 这里填上自己的cookies
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }
    response = requests.get(url=url, headers=headers).json()
    return response

数据处理函数

通过get_content函数，我们可以处理获取到的数据，并将其保存到本地文件中。

def get_content(note_id):
    cursor = ''
    page = 0
    save_path = './data.json'
    if os.path.exists(save_path):
        dit = bag.Bag.read_json(save_path)
    else:
        dit = bag.Bag.save_json([], save_path)
        dit = []
    while True:
        time.sleep(1)
        url = f"https://example.com/api/data?note_id={note_id}&cursor={cursor}"
        resp = spider(url)
        content_list = resp['data']['comments']
        for index in content_list:
            dic = {
                '评论': {'内容': index['content'].strip(),
                       '点赞数量': index['like_count'],
                       '发布时间': get_time(index['create_time']),
                       '昵称': index['user_info']['nickname'].strip(),
                       '头像链接': index['user_info']['image'],
                       '用户id': index['user_info']['user_id'], },
                '回复': {}
            }
            judge = index.get('sub_comments')
            if judge:
                sub_list = []
                for sub_content in judge:
                    sub_list.append({'内容': sub_content['content'].strip(),
                                     '点赞数量': sub_content['like_count'],
                                     '发布时间': get_time(sub_content['create_time']),
                                     '昵称': sub_content['user_info']['nickname'].strip(),
                                     '头像链接': sub_content['user_info']['image'],
                                     '用户id': sub_content['user_info']['user_id'], })
                    if sub_list:
                        dic['回复'] = sub_list
                    else:
                        dic.pop('回复')
            else:
                pass
            dit.append(dic)
            bag.Bag.save_json(dit, './data.json')
        if not resp['data']['has_more']:
            break
        cursor = resp['data']['cursor']
        print(cursor)
        page = page + 1
        print(f'第{page}页数据已保存:--------------------------------------')
        if page > 2:   # 调试代码用，正常运行注释这两行
            break

时间格式转换

通过自定义的get_time函数，将时间戳转换为可读的时间格式。

def get_time(ctime):
    timeArray = time.localtime(int(ctime / 1000))
    otherStyleTime = time.strftime("%Y.%m.%d", timeArray)
    return str(otherStyleTime)