他分析了五十页微博数据，看看今年双十一大家都说了什么

最新推荐文章于 2022-10-12 00:53:13 发布

实验楼v

最新推荐文章于 2022-10-12 00:53:13 发布

阅读量344

点赞数

原文链接：https://www.shiyanlou.com/louplus/dm

版权

在我的那个时代，今天只是光棍节，而如今却成为了全民的购物狂欢节……

为了研究网上大家双十一都在谈论什么，这篇文章的作者在微博上搜索了 “双十一”，采集了前 50 页的微博数据，来分析大家谈论的热点，以及影响这些热点的因素。为了采集数据方便，搜索的页面是微博默认的 “综合” 选项。

（本项目来自实验楼《楼 + 之数据分析与挖掘实战》第 9 期学员：灵汐）

数据采集

在微博搜索页面输入 “双十一”，右击选择 “检查”，选择 “networks” 选项，选中手机浏览页面，得到 API 的 URL。经分析，URL 的最后一项是页面数。根据这个规律可以得到 50 页微博的数据。（注：不同时间搜出来的数据不一样，本次分析报告是根据 10 月 23 日下午 10 点半的搜索结果分析的。）

import requests
import pandas as pd
from tqdm import tqdm_notebook
import os
import json
PAGE_SUM = 50
FILE_NAME = 'top'

if not os.path.exists(FILE_NAME):
    os.mkdir(FILE_NAME)

# 访问搜索详情信息的api地址， 返回json格式;
#默认页面weibo_url = "https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E5%8F%8C%E5%8D%81%E4%B8%80&page_type=searchall"
# 获取网页内容;
for page in tqdm_notebook(range(PAGE_SUM)):
    #不同页面后面加上page+1
    url = "https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E5%8F%8C%E5%8D%81%E4%B8%80%26t%3D0&page_type=searchall&page="+str(page+1)
    json_file_path = os.path.join(FILE_NAME, '{}.json'.format(page+1))
    if os.path.exists(json_file_path):  # 如果已经爬取
        continue
    while True:  # 一直尝试到成功
        try:
            response = requests.get(url, timeout=5)
        except requests.exceptions.Timeout:
            time.sleep(1)
        if response.status_code == 200:
            break
    with open(json_file_path, 'w') as f:  # 写入本地文件
        # indent 表明需要对齐，ensure_ascii=False 编码中文
        f.write(json.dumps(json.loads(response.content.decode('utf-8')),
                           indent=4, ensure_ascii=False))
weibos = []
for page in tqdm_notebook(range(PAGE_SUM)):
    json_file_path = os.path.join(FILE_NAME, '{}.json'.format(page+1))
    with open(json_file_path) as f:
        #print(json.load(f)['data']['cards'])
        weibos += json.load(f)['data']['cards'] #存储微博数据

with open('top.json', 'w') as f:  # 写入文件
    f.write(json.dumps({'weibos': weibos}, indent=4, ensure_ascii=False))

经分析 json 文件里面 card_type==9 的时候，对应的条目是微博。挑选出如下几种数据储存在 csv 文件中。

created_at = [] #发布时间
blog_id = [] #微博id
text = [] #微博内容
source = [] #微博客户端
user_id = [] #用户id
user_screen_name = [] #用户名
user_statuses_count = [] #用户微博数
user_gender = [] #用户性别
user_followers_count = [] #用户分析数
user_follow_count = [] #用户关注数
reposts_count = [] #微博转发数
comments_count = [] #微博评论数
attitudes_count = [] #微博点赞数
for blog in weibos:
    if blog['card_type'] == 9:
        created_at.append(blog['mblog']['created_at'])
        blog_id.append(blog['mblog']['id'])
        text.append(blog['mblog']['text'])
        source.append(blog['mblog']['source'])
        user_id.append(blog['mblog']['user']['id'])
        user_screen_name.append(blog['mblog']['user']['screen_name'])
        user_statuses_count.append(blog['mblog']['user']['statuses_count'])
        user_gender.append(blog['mblog']['user']['gender'])
        user_followers_count.append(blog['mblog']['user']['followers_count'])
        user_follow_count.append(blog['mblog']['user']['follow_count'])
        reposts_count.append(blog['mblog']['reposts_count'])
        comments_count.append(blog['mblog']['comments_count'])
        attitudes_count.append(blog['mblog']['attitudes_count'])
df = pd.DataFrame({'created_at':created_at, 'blog_id': blog_id, 'text':text, 'source':source,
               'user_id': user_id, 'user_screen_name':user_screen_name, 'user_statuses_count': user_statuses_count, 'user_gender': user_gender,
               'user_followers_count': user_followers_count, 'user_follow_count':user_follow_count,
               'reposts_count': reposts_count, 'comments_count': comments_count, 'attitudes_count':attitudes_count})

df.to_csv('top.csv')

观察数据，发现 created_at, text, source 数据可以整理，下一步就是数据清洗。

df.head()

输出结果：

df = pd.read_csv('top.csv')

数据清洗

清洗 text 栏的数据

这一部分主要是清洗 text 数据，统计每个词的频率画出词云，了解用户谈论的热点。

加载停用词表：

!wget -nc "http://labfile.oss.aliyuncs.com/courses/1176/stopwords.txt"
def load_stopwords(file_path):
    # 加载停用词函数
    with open(file_path, 'r') as f:
        stopwords = [line.strip('\n') for line in f.readlines()]
    return stopwords

提取中文，结巴分词，清除掉一个字符长度的字符和停用词汇。得到常用词汇放在 text_all 里面。并且将每个微博的词汇 text_list 放入 df 中。

import re
import jieba
stopwords = load_stopwords('stopwords.txt')

def text_clean(string):
    # 对一个微博中文内容进行分词
    result = []
    #print(string)
    para = string.split(' ')
    #print(para)
    result = []
    for p in para:
        #print(p)
        p = ''.join(re.findall('[\u4e00-\u9fa5]', p))
        #print(p)
        # 对每一个分句进行分词
        seg_list = list(jieba.cut(p, cut_all=False))
        for x in seg_list:
            if len(x) <= 1:
                continue
            if x in stopwords:
                continue
            result.append(x)
    return result
text_all = []
text_list = []
for i in range(len(df)):
    text_all.extend(text_clean(df.iloc[i]['text']))
    text_list.append(text_clean(df.iloc[i]['text']))
df['text'] = text_list

加载中文字体：

!wget -nc "http://labfile.oss.aliyuncs.com/courses/1176/fonts.zip"
!unzip -o fonts.zip

统计每个词的频率，放在 word_dict 中，画出词云。

from wordcloud import WordCloud
from matplotlib import pyplot as plt
%matplotlib inline

font_path = 'fonts/SourceHanSerifK-Light.otf'
wc = WordCloud(font_path=font_path, background_color="white", max_words=1000,
               max_font_size=100, random_state=42, width=800, height=600, margin=2)
word_dict= {}
for word in text_all:
    if word not in word_dict:
        word_dict[word] = 1
    else:
        word_dict[word] += 1
wc.generate_from_frequencies(word_dict)


plt.figure(figsize=(8, 6))
plt.imshow(wc, interpolation="bilinear")  # 显示图片
plt.axis("off")

输出结果：

(-0.5, 799.5, 599.5, -0.5)

可以看出 “双十” 是最多的，因为这个数据搜索 “双十一” 的微博。除了 “双十”，还有其他的词，比如 “购物”、“购物车”、“预售” 这些词比较频繁，说明大家都忙着准备双十一。还有 “允悲”，“开心”，“嘻嘻”，“微笑”、“啊啊啊” 这些表示心情的词，说明大家还是很热烈地在准备双十一。还出现了一些人名，比如 “林彦俊”、“李佳琦”。“李佳琦” 不愧是直播界一哥，双十一怎么能少了他。

挑选频率最高的 20 个词。研究这些词汇跟时间的关系。

word_dict = sorted(word_dict.items(), key = lambda x:x[1], reverse=True)
word_dict[:20]

输出结果：

[('双十', 551),
 ('真的', 70),
 ('允悲', 65),
 ('东西', 63),
 ('全文', 53),
 ('微博', 49),
 ('视频', 48),
 ('便宜', 35),
 ('预售', 34),
 ('超话', 30),
 ('活动', 28),
 ('购物', 27),
 ('喜欢', 27),
 ('嘻嘻', 26),
 ('攻略', 24),
 ('发现', 23),
 ('推荐', 22),
 ('啊啊啊', 22),
 ('哈哈哈', 21),
 ('感觉', 20)]

对微博发布时间的整理，created_at 栏目的中文 “分钟”，“小时”，“天” 大概说明了发布时间由近到远。纯数字表示的是日期，是好几天前了。

created_at_list = []
for created_at in df['created_at']:
    created_at = ''.join(re.findall('[\u4e00-\u9fa5]', created_at))
    if len(created_at) < 1:
        created_at_list.append('很久前')
    else:
        created_at_list.append(created_at)
df['created_at'] = created_at_list

根据关键字整理微博客户端 source_list 栏。

source_list = []
system = []
for source in df['source']:
    source = str(source)
    if 'nova' in source or 'HUAWEI' in source or '华为' in source or '荣耀' in source:
        source_list.append('HUAWEI')
        system.append('Android')
    elif 'iPhone' in source or 'iPad' in source:
        source_list.append('iPhone')
        system.append('IOS')
    elif 'OPPO' in source:
        source_list.append('OPPO')
        system.append('Android')
    elif 'vivo' in source:
        source_list.append('vivo')
        system.append('Android')
    elif 'OnePlus' in source:
        source_list.append('OnePlus')
        system.append('Android')
    elif 'Redmi' in source or '红米' in source or '小米' in source:
        source_list.append('XIAOMI')
        system.append('Android')
    elif '魅族' in source:
        source_list.append('MEIZU')
        system.append('Android')
    elif '联想' in source:
        source_list.append('Lenovo')
        system.append('Android')
    elif 'Samsung' in source or '三星' in source:
        source_list.append('Samsung')
        system.append('Android')
    elif '浏览器' in source or '微博' in source or '网页' in source:
        source_list.append('Web')
        system.append('Web')
    elif 'Android' in source:
        source_list.append('Not Known')
        system.append('Android')
    else:
        source_list.append('Not Known')
        system.append('Not Known')
df['source'] = source_list
df['system'] = system

去掉没有用的信息微博 id，用户 id，用户名。

df.drop(['blog_id', 'user_id', 'user_screen_name'], axis=1, inplace=True)

df.head()

数据分析

将整理频率从高到低排序前面 20 名的词汇定义为热门词汇，统计每条微博里面热门词汇的个数，用 hot_word。分析热门词汇个数与发布时间的关系，与转发、评论、点赞数的关系。然后分析搜索到用户的信息，包括客户端、微博数、性别、粉丝数和关注数。

整理频率从高到低排序前面 20 名的词汇（去掉 “双十” 异常点）。统计每条微博 text 里面热门词汇的个数。

import string
i = 0
hot_word_list = [w for w, n in word_dict[1:20]]
hot_word_count_all = []
for text in df['text']:
    hot_word_count = 0
    cold_word_count = 0
    for word in text:
        if word in hot_word_list:
            hot_word_count += 1
    hot_word_count_all.append(hot_word_count)
#print(hot_word_count_all)
word_columns = []
df['hot_word'] = hot_word_count_all

df.head()

from collections import Counter
hot_word_c = Counter(df['hot_word'])
hot_word_count = []
for i in range(1, df['hot_word'].max()):
    hot_word_count.append(hot_word_c[i])
print(hot_word_count)
plt.pie(x=hot_word_count, labels=[str(i) for i in range(1, df['hot_word'].max())])
plt.show()

输出结果：

[128, 73, 48, 36, 18, 4, 0, 0, 0, 0]

可以看到大部分人谈论 1, 2 次热门词汇，说明人们一般关注一个两个点。

按照时间统计热门词汇出现的次数。

import matplotlib
df_text = df.groupby(by='created_at').sum()['hot_word']
df_text.reset_index()
#print(df_text)
df_text = df_text.reindex(index=['很久前', '昨天', '小时前', '分钟前', '刚刚'])
#print(df_text)
fig = plt.figure(figsize=(16, 9))
myfont = matplotlib.font_manager.FontProperties(fname="fonts/SourceHanSerifK-Light.otf")
plt.plot(df_text)
plt.xticks(fontproperties=myfont)
plt.legend('hot_word', prop=myfont)
#plt.xlabel("")
plt.show()

因为 “刚刚” 的数据比较少，所以接近于零。但是从时间上看越靠近，热门词汇越多。说明发布的越新，热度越高。

热门词汇与转发、评论、点赞的关系，时间有限，这里画出了热门词汇与转发数的关系。

fig = plt.figure(figsize=(16, 9))
ax = fig.add_subplot(111)
ax.set_xscale("log")
ax.set_yscale("log")
df1 = df[['user_followers_count','reposts_count']]
df1 = df1.sort_values(by='user_followers_count',ascending= False)  
ax.plot(df1['user_followers_count'], df1['reposts_count'])
plt.xlabel('user_followers_count')
plt.ylabel('reposts_count')

输出结果：

Text(0, 0.5, 'reposts_count')

虽然图片有些杂乱，还是可以看出，用户的粉丝数越多，微博转发数越多。

词汇与客户端的关系

df_text_source = df.groupby(by='source').sum()['hot_word']
plt.pie(df_text_source, labels=df_text_source.keys(),autopct='%1.2f%%')
plt.show()

可以看出 iphone 用户比较喜欢发布热门内容。

总结：本实验抓取的数据是微博上搜索了 “双十一” 的微博内容，搜索的页面是微博默认的 “综合” 选项，采集的数据是前 50 页微博，来分析微博中大家谈论的热点，以及影响这些热点的因素。整理了热门词汇，画出了云图。分析热门词汇个数与发布时间的关系，发布的越新，热度越高。可以看到大部分人谈论 1, 2 次热门词汇，说明人们一般关注一个两个点。用户的粉丝数越多，微博转发数越多。分析出了比较喜欢发布热门内容的客户端。由于时间有限，本实验就到这里。

心得体会：数据的获取是最难的，经常出现抓不到的情况。有的选题找不到方法获取。本课程比较倾向于数据的获取和分析。但是在数据挖掘项目试验中，关于数据的建模分析，分类和回归的练习比较少。虽然课程标题是数据分析与挖掘实战，希望可以增加这一部分的实验和挑战。另外，建议增加一些挑战的内容，挑战可以提示没有讲到的内容。最后，感谢工作人员的陪伴与帮助，谢谢！

《楼 + 数据分析与挖掘实战》是实验楼以满足数据分析或数据挖掘初级工程师职位需求而定制的课程内容。包含 35 个实验，20 个挑战，5 个综合项目，1 个大项目。6 周时间，让你入门数据分析与挖掘。

还有很多优秀的同学的作品，可以在这里查看：

https://github.com/shiyanlou/louplus-dm/tree/master/Assignments

关于《楼 + 数据分析与挖掘实战》课程详情内容，可请添加实验楼小助手微信 (sylmm003)，进行咨询或索要优惠。

????????????点击阅读原文，了解课程详细信息～

实验楼v

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
他分析了五十页微博数据，看看今年双十一大家都说了什么

在我的那个时代，今天只是光棍节，而如今却成为了全民的购物狂欢节……为了研究网上大家双十一都在谈论什么，这篇文章的作者在微博上搜索了 “双十一”，采集了前 50 页的微博数...
复制链接

扫一扫