pandas——微博数据分析

数据介绍: Use of this dataset in publications must be acknowledged by referencing the following publication:
King-wa Fu, CH Chan, Michael Chau. Assessing Censorship on Microblogs in China: Discriminatory
Keyword Analysis and Impact Evaluation of the 'Real Name Registration' Policy. IEEE Internet
Computing. 2013; 17(3): 42-50. http://doi.ieeecomputersociety.org/10.1109/MIC.2013.28

属性名称

英文含义

中文含义(可能翻译不
准确, 仅供参考
)

mid

Unique pseudo message ID

唯一的伪消息 ID 标识

retweeted_status_mid

Pseudo message ID of the original
message (Only available if the row of
interest is a retweet)

原始消息的伪消息ID
识(考虑转发状态)

uid

Pseudo user ID

伪用户 ID

retweeted_uid

Pseudo user ID of the original poster (Only
available if the row of interest is a retweet)

原始发表者的伪用户 ID

source

The application name of the client
program

客户端程序的名称

image

With image? (1= Yes, 0=No)

是否有图片

text

body of the message. Any address handle
(@xxxx:) is replaced by either the pseudo
user ID or ukn (uknown)

消息文本

geo

GIS information. Please refer to the Sina
Weibo API documentation:
http://goo.gl/Um8SS

GIS 信息

created_at

Original posting time

发表时间

deleted_last_seen

The last seen time before this message
was missing from the user timeline

最后显示时间

permission_denied

'permission denied' status is marked when
the message was found missing in the
timeline and the API return message was
'permission denied' deleted_last_seen
and permission_denied

时间轴消失后的标记

下面对数据进行分析的代码: 

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 12 18:51:28 2019

@author: try
"""

import pandas as pd
import matplotlib.pyplot as plt
#import re


#读取数据
user_data=pd.read_table(r'userdata.csv',sep=',')
week_data=pd.read_table(r'week1.csv',sep=',',encoding='utf-8',error_bad_lines=False)
V_count=user_data['verified'].value_counts()#统计是否认证人数
plt.figure(1)
V_count.plot(kind='bar')#画是否认证人数对比图

#统计发博和被转发次数
uid_count=week_data['uid'].value_counts()
retweeted_uid_count=week_data['retweeted_uid'].value_counts()
#mid_data=mid_count.loc[mid_count>10]

#用dataframe进行正则提取,提取话题
data_text=week_data['text']
#用正则从字符数据中抽取匹配的数据,只返回第一个匹配的数据
out_text=data_text.str.extract(r'#(.*?)#',expand=True)
out_text1=out_text.dropna()#删除空行
out_text1=out_text1.reset_index(drop=True)#重新设置index
out_text1.columns=['text_R']
out_text1.rename(columns={'text_R':'text'},inplace=True)

#对话题统计
out_text_count=out_text1['text'].value_counts()#统计话题数目
out_text_count=out_text_count.loc[out_text_count>1000]
plt.figure(2)
out_text_count.plot(kind='bar',width=0.8)

#转发次数分区间统计
t_data=week_data['retweeted_uid'].value_counts()
Se_t=pd.Series(t_data)
bin_t=range(0,3000,100)
count_t=pd.cut(Se_t,bin_t).value_counts()
plt.figure(3)
plt.title('被转发次数统计')
count_t.plot(kind='bar')

热点话题提取结果:

1

1

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Tao_RY

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值