数据介绍: Use of this dataset in publications must be acknowledged by referencing the following publication:
King-wa Fu, CH Chan, Michael Chau. Assessing Censorship on Microblogs in China: Discriminatory
Keyword Analysis and Impact Evaluation of the 'Real Name Registration' Policy. IEEE Internet
Computing. 2013; 17(3): 42-50. http://doi.ieeecomputersociety.org/10.1109/MIC.2013.28
属性名称 | 英文含义 | 中文含义(可能翻译不 |
mid | Unique pseudo message ID | 唯一的伪消息 ID 标识 |
retweeted_status_mid | Pseudo message ID of the original | 原始消息的伪消息ID标 |
uid | Pseudo user ID | 伪用户 ID |
retweeted_uid | Pseudo user ID of the original poster (Only | 原始发表者的伪用户 ID |
source | The application name of the client | 客户端程序的名称 |
image | With image? (1= Yes, 0=No) | 是否有图片 |
text | body of the message. Any address handle | 消息文本 |
geo | GIS information. Please refer to the Sina | GIS 信息 |
created_at | Original posting time | 发表时间 |
deleted_last_seen | The last seen time before this message | 最后显示时间 |
permission_denied | 'permission denied' status is marked when | 时间轴消失后的标记 |
下面对数据进行分析的代码:
# -*- coding: utf-8 -*-
"""
Created on Sat Oct 12 18:51:28 2019
@author: try
"""
import pandas as pd
import matplotlib.pyplot as plt
#import re
#读取数据
user_data=pd.read_table(r'userdata.csv',sep=',')
week_data=pd.read_table(r'week1.csv',sep=',',encoding='utf-8',error_bad_lines=False)
V_count=user_data['verified'].value_counts()#统计是否认证人数
plt.figure(1)
V_count.plot(kind='bar')#画是否认证人数对比图
#统计发博和被转发次数
uid_count=week_data['uid'].value_counts()
retweeted_uid_count=week_data['retweeted_uid'].value_counts()
#mid_data=mid_count.loc[mid_count>10]
#用dataframe进行正则提取,提取话题
data_text=week_data['text']
#用正则从字符数据中抽取匹配的数据,只返回第一个匹配的数据
out_text=data_text.str.extract(r'#(.*?)#',expand=True)
out_text1=out_text.dropna()#删除空行
out_text1=out_text1.reset_index(drop=True)#重新设置index
out_text1.columns=['text_R']
out_text1.rename(columns={'text_R':'text'},inplace=True)
#对话题统计
out_text_count=out_text1['text'].value_counts()#统计话题数目
out_text_count=out_text_count.loc[out_text_count>1000]
plt.figure(2)
out_text_count.plot(kind='bar',width=0.8)
#转发次数分区间统计
t_data=week_data['retweeted_uid'].value_counts()
Se_t=pd.Series(t_data)
bin_t=range(0,3000,100)
count_t=pd.cut(Se_t,bin_t).value_counts()
plt.figure(3)
plt.title('被转发次数统计')
count_t.plot(kind='bar')
热点话题提取结果: