python爬取新浪微博评论存入excel

mshine0

已于 2024-04-11 10:33:05 修改

阅读量530

点赞数 4

分类专栏： python爬虫安全文章标签： python 新浪微博 excel 爬虫

于 2024-04-10 21:29:12 首次发布

本文链接：https://blog.csdn.net/u013021184/article/details/137611013

版权

python爬虫同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

安全

2 篇文章 0 订阅

订阅专栏

注：本博客只是供大家学习爬虫，如有违规，侵犯到了谁的利益，请联系我，会立马删除博客内容。

一、数据分析

微博里面随便点了一条内容查看评论，本次实验随机点到的微博地址是https://weibo.com/1275238313/O4uNhyk3F#comment。
打开F12，点开评论区，随便复制了第一条评论，进行数据搜索。定位到相应的数据请求，进行请求数据分析。
在这里插入图片描述
请求数据分析后发现，是一个get请求，无加密数据。先把params写出来，里面的数据都是固定的。

params = {
        'is_reload': '1',
        'id': '5010716878180259',
        'is_show_bulletin': '2',
        'is_mix': '0',
        'count': '10',
        'uid': '1275238313',
        'fetch_level': '0',
        'locale': 'zh-CN',
    }

本次实验主要抓取如下数据进行存储
在这里插入图片描述

二、获取首页评论

获取首页评论内容并存入到excel

def one_data(url):
    res = requests.get(url=url, params=params, headers=headers)
    contents = res.json()['data']
    text = [i.get('text') for i in contents]
    source = [i.get('source') for i in contents]
    screen_name = [i.get('user').get('screen_name') for i in contents]
    df_one = pd.DataFrame(
        {'评论内容': text,
         'IP': source,
         '昵称': screen_name
         }
    )
    df_one.to_excel('新浪.xlsx', index=False)
    max_id = res.json()['max_id']
    return max_id

三、获取加载更多的内容

通过点击加载更多，找到相应的数据请求，仍然是一个get请求，只是该请求的params中多了一个max_id
在这里插入图片描述
通过搜索max_id，并进行比较发现，上一次请求的结果中会有一个max_id，而这个max_id会在下一次请求（加载更多的这个请求）的params中。所以我们每次请求后，将返回的json结果中的max_id取出来。
因此我们可以写一个循环，让它一直发请求，直到请求的结果json中的data为[]时，跳出循环。

def all_data(url):
    max_id = one_data(weibo_url)
    while True:
        params['max_id'] = max_id
        res = requests.get(url=url, params=params, headers=headers)
        contents = res.json()['data']
        max_id = res.json()['max_id']
        text = [i.get('text') for i in contents]
        source = [i.get('source') for i in contents]
        screen_name = [i.get('user').get('screen_name') for i in contents]
        df_all = pd.DataFrame(
            {'评论内容': text,
             'IP': source,
             '昵称': screen_name
             }
        )
        excel_file = pd.read_excel('新浪.xlsx')
        df = pd.DataFrame(excel_file)
        append_df = df._append(df_all, ignore_index=True)
        append_df.to_excel('新浪.xlsx', index=False)
        if contents == []:
            break

四、完整代码

完整代码如下：

import pandas as pd
import requests

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'Accept': 'application/json, text/plain, */*',
        'Cookie': 'your cookie'
    }

params = {
        'is_reload': '1',
        'id': '5010716878180259',
        'is_show_bulletin': '2',
        'is_mix': '0',
        'count': '10',
        'uid': '1275238313',
        'fetch_level': '0',
        'locale': 'zh-CN',
    }
# 获取第一页数据并写入到excel
def one_data(url):
    res = requests.get(url=url, params=params, headers=headers)
    contents = res.json()['data']
    text = [i.get('text') for i in contents]
    source = [i.get('source') for i in contents]
    screen_name = [i.get('user').get('screen_name') for i in contents]
    df_one = pd.DataFrame(
        {'评论内容': text,
         'IP': source,
         '昵称': screen_name
         }
    )
    df_one.to_excel('新浪.xlsx', index=False)
    max_id = res.json()['max_id']
    return max_id


# 加载更多获取的数据并存入到excel
def all_data(url):
    max_id = one_data(weibo_url)
    while True:
        params['max_id'] = max_id
        res = requests.get(url=url, params=params, headers=headers)
        contents = res.json()['data']
        max_id = res.json()['max_id']
        text = [i.get('text') for i in contents]
        source = [i.get('source') for i in contents]
        screen_name = [i.get('user').get('screen_name') for i in contents]
        df_all = pd.DataFrame(
            {'评论内容': text,
             'IP': source,
             '昵称': screen_name
             }
        )
        excel_file = pd.read_excel('新浪.xlsx')
        df = pd.DataFrame(excel_file)
        append_df = df._append(df_all, ignore_index=True)
        append_df.to_excel('新浪.xlsx', index=False)
        if contents == []:
            break


weibo_url = 'https://weibo.com/ajax/statuses/buildComments'
all_data(weibo_url)