一小时销量破百万，Python告诉你周杰伦的《Mojito》到底有多火！

最新推荐文章于 2020-12-01 08:32:00 发布

CDA·数据分析师

最新推荐文章于 2020-12-01 08:32:00 发布

阅读量603

点赞数

分类专栏：数据分析·编程语言·分析工具·可视化

本文链接：https://blog.csdn.net/yoggieCDA/article/details/106784415

版权

数据分析·编程语言·分析工具·可视化专栏收录该内容

394 篇文章 31 订阅

订阅专栏

【导语】：今天我们来聊聊周杰伦的新歌《Mojito》，Python技术部分请看第三部分，Show me data，用数据说话

6月12日0：00，周杰伦的最新单曲《Mojito》正式上线。对周杰伦歌迷来说，这一天简直就是过年了。因为距离周杰伦上一次发歌，已经过去了半年时间；而他的上一张专辑《周杰伦的床边故事》，已经是4年前的事情了。

这首以一种传统的古巴鸡尾酒命名的《Mojito》，前奏就充满浓浓的古巴风情和拉丁节奏，整首歌都写出在遇见爱情时令人神往的浪漫情调。尤其是令人惊艳的rap部分，不仅让人感叹我的青春回来了！

《Mojito》一经上线也是火爆全网，300多万人提前预约，一小时内销量破100万，这也直接导致 QQ 音乐崩溃。

今天我们就用数据来全方位解读一下，周杰伦的新歌《Mojito》。

1、豆瓣数据

首先我们分析整理了《Mojito》的豆瓣音乐数据，目前这首歌在豆瓣共有2万3千余人进行评价，分数为6.9分。

数据来源：豆瓣音乐短评

https://music.douban.com/subject/35093585/comments/

数据量：498条

虽然比起周董早年音乐的分数有一定差距，但是对比起近两年发布的《我是如此相信》《说好不哭》在豆瓣的6.3分和5.9分，这首mojito还是不错的。

再具体看到评分数据可以发现，给出3星的人最多，占比39.02%。其次是4星，21.84%。给出5星的占比16.49%。

我们进一步把给出1-2分的归类为负向评分，把4-5分的归为正向评分。

分别对比负向的正向的词云，我们可以看到：

在负向评价中，提到最多的就是"失望"、"难听"、"编曲"。正向评价中，大多数人都表示"好听"、"有夏天"的感觉、"喜欢"。有意思的是，无论正向负向评价都提到了充满异域风情的《Mojito》很容易让人联想到当年周董的那首《迷迭香》。

2、微博数据

接下来我们分析了周杰伦中文网在微博发布的mojito mv的评论数据。

微博：周杰伦中文网JayCn发布的《Mojito》MV

https://weibo.com/1165631310/J6cxJ67HC?filter=hot&root_comment_id=0&type=comment

数据集大小（去重之后）：

评论数据：9976条
粉丝数据：9107条

分析发现，评论用户性别方面，女生占了绝大多数，占比78.82%。

而用户年龄方面，也是妥妥的90后的天下，占比高达74.91%。

微博评论中大家都在说些什么呢？

可以看到大多数人都表示"好听"、"喜欢"、"很有夏天"的感觉。让人"单曲循环"，特别"上头"。同时经典的周氏rap，也是整首歌的灵魂，一听就太有那味儿了！

3、教你用Python爬取QQ音乐评论数据

最后我们看到qq音乐的数据：

QQ音乐评论信息。

https://y.qq.com/n/yqq/album/0009C3rp3Kfwg0.html

数据量：20245

我们先看到结论

首先在实时评论走势图上可以看到，在歌发布的12日0时，评论人数最高，之后慢慢回落。

评论中也让人感叹"好听"、"青春回来了"。让歌迷们纷纷下单支持，毕竟上线一小时突破一百万张，让QQ 音乐软件一度崩溃，周董的影响力还是不容小觑的。

下面我们看到具体的步骤

我们使用Python分别获取了QQ音乐评论、豆瓣短评和微博相关的评论数据，进行了数据分析。此处我们展示QQ音乐评论分析部分。按照业务分析流程进行：

数据获取
数据处理
数据可视化

01 数据获取

首先打开QQ音乐，搜索Mojito。翻到评论区后，使用谷歌浏览器的检查功能，切换到Network部分，点击翻页进行网络抓包，很容易发现发现评论区的内容是被封装在json中的，如下图所示：

切换到headers处，找到请求URL地址，我们对请求地址进行精简和测试，得到评论数据请求的URL地址：

https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?biztype=2&topid=12924001&cmd=8pagenum=0&pagesize=25

其中参数pagenum代表页数，通过遍历即可获取所有数据，代码如下：

# 导入包
import pandas as pd
import time
import requests
import json
from fake_useragent import UserAgent


def get_qq_comment(page_num):
    # 存储数据
    df_all = pd.DataFrame()

    for i in range(page_num):
        # 打印进度
        print('我正在获取第{}页的信息'.format(i))

        # 获取URL
        url = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?biztype=2&topid=12924001&cmd=8pagenum={}&pagesize=25'.format(i)

        # 添加headers
        headers = {
            'user-agent': UserAgent().random
        }

        # 发起请求
        try:
            r = requests.get(url, headers=headers)
        except Exception as e:
            print(e)
            continue

        # 解析网页
        json_data = json.loads(r.text)

        # 获取数据
        comment_list = json_data['comment']['commentlist']

        # 昵称
        nick_name = [i.get('nick') for i in comment_list]
        # 评论内容
        content = [i.get('rootcommentcontent') for i in comment_list]
        # 评论时间
        comment_time = [i.get('time') for i in comment_list]
        # 点赞数
        praise_num = [i.get('praisenum') for i in comment_list]

        # 存储数据
        df = pd.DataFrame({
            'nick_name': nick_name,
            'content': content,
            'comment_time': comment_time,
            'praise_num': praise_num
        })

        # 追加数据
        df_all = df_all.append(df, ignore_index=True)

        # 休眠一秒
        time.sleep(1)

    return df_all


# 运行函数
df = get_qq_comment(page_num=912)

通过上述程序，共获取到截止6.13日22217条评论信息，数据集如下所示:

df.head()

02 数据读入和数据处理

读入数据集，并对获取的数据集进行清洗。

# 导入所需包
import jieba
import stylecloud
from pyecharts.charts import Pie, Bar, Map, Line, WordCloud, Page
from pyecharts import options as opts
from pyecharts.globals import SymbolType, WarningType
WarningType.ShowWarning = False

# 读入数据
df = pd.read_excel('../data/QQ音乐评论数据6.13.xlsx')

# 查看重复值和空值
print(df.duplicated().sum())
print(df.isnull().sum())

# 转换函数
def transform_time(time_second):
    time_array = time.localtime(time_second)
    otherStyleTime = time.strftime('%Y-%m-%d %H:%M:%S', time_array) 
    return otherStyleTime

# 时间数据处理
df['comment_time'] = df['comment_time'].apply(lambda x: transform_time(x))

# content初步处理
pattern = re.compile(r'\[em\](.*?)\[/em\]')
df['content'] = df.content.str.replace(pattern, '')
df.head()

03 数据可视化分析

Mojito评论时间走势图

# 日期数量
comment_num = df.comment_time.str.split(':').str[0].value_counts().sort_index()
comment_num[:5] 
2020-06-12 00    12673
2020-06-12 01     1185
2020-06-12 02      364
2020-06-12 03      146
2020-06-12 04       80
Name: comment_time, dtype: int64

# 产生数据
x_line1 = [i.replace('2020-','') for i in comment_num.index.to_list()] 
y_line1 = comment_num.values.tolist()

# 绘制面积图
line1 = Line(init_opts=opts.InitOpts(width='1350px', height='750px'))
line1.add_xaxis(x_line1)
line1.add_yaxis('', y_line1,
                markpoint_opts=opts.MarkPointOpts(data=[
                    opts.MarkPointItem(type_='max', name='最大值'),
                    opts.MarkPointItem(type_='min', name='最小值')
                ])) 
line1.set_global_opts(title_opts=opts.TitleOpts('Mojito评论人数走势图'), 
                      xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate='30')),
                      visualmap_opts=opts.VisualMapOpts(max_=12673)
                     ) 
line1.set_series_opts(label_opts=opts.LabelOpts(is_show=False), 
                      linestyle_opts=opts.LineStyleOpts(width=3))
line1.render()

QQ音乐评论词云图

def get_cut_words(content_series):
    # 读入停用词表
    stop_words = [] 

    with open(r"stop_words.txt", 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            stop_words.append(line.strip())

    # 添加关键词
    my_words = ['周杰伦', '一首歌', '好好听', '方文山', '30多岁']    
    for i in my_words:
        jieba.add_word(i) 

#     自定义停用词
    my_stop_words = ['歌有', '真的', '这首', '一首', '一点', 
                    '反正', '一段', '一句', '首歌', '啊啊啊', 
                    '哈哈哈', '转发', '微博', '那段', '他会'
                    ]   
    stop_words.extend(my_stop_words)               

    # 分词
    word_num = jieba.lcut(content_series.str.cat(sep='。'), cut_all=False)

    # 条件筛选
    word_num_selected = [i for i in word_num if i not in stop_words and len(i)>=2]

    return word_num_selected

text1 = get_cut_words(content_series=df.content)
text1[:5] 
['致敬', '久石', '人生', '旋转', '木马']

# 绘制词云图
stylecloud.gen_stylecloud(text=' '.join(text1), 
                          max_words=1000,
                          collocations=False,
                          font_path=r'‪C:\Windows\Fonts\msyh.ttc',
                          icon_name='fas fa-music',
                          size=624,
                          output_name='./词云图/QQ音乐评论词云图.png')

CDA数据分析师出品