Analysis for INFO 212 Data Science Programming I Assignment 2

大数据第二次作业

Assignment 2: Pandas and matplotlib

题目描述:

介绍数据集的背景

League of Legends (LoL) is one of the most popular games in the world today. There are some datasets that aim to provide information for data scientists interested in analyzing the game and, hopefully, playing a bit better. smile This assignment will focus on the (LoL) League of Legends Ranked Games dataset, which gathers details from more than 50,000 ranked LoL games. This dataset is available at https://www.kaggle.com/datasnaek/league-of-legends.

此次数据集是针对S9的LOL比赛,确实令我这个LOL狂热玩家激动。
数据集内容

According to the website, This is a collection of over 50,000 ranked EUW games from the game League of Legends, as well as json files containing a way to convert between champion and summoner spell IDs and their names. For each game, there are fields for:

  • Game ID
  • Creation Time (in Epoch format)
  • Game Duration (in seconds)
  • Season ID
  • Winner (1 = team1, 2 = team2)
  • First Baron, dragon, tower, blood, inhibitor and Rift Herald (1 = team1, 2 = team2, 0 = none)
  • Champions and summoner spells for each team (Stored as Riot’s champion and summoner spell IDs)
  • The number of tower, inhibitor, Baron, dragon and Rift Herald kills each team has
  • The 5 bans of each team (Again, champion IDs are used)
题目要求

Obtain this dataset and familiarize yourself with it. Using this dataset, you are expected to answer some questions and create some plots:

  1. Which champion id is more often associated with a team winning?
  2. Which champion id is more often associated with a team losing?
  3. Which champion id being banned is more often associated with a team losing?
  4. Which two champion ids appear more often together in winning teams (where they are not banned)?
  5. Create a scatterplot showing the number of times each champion is part of a winning team.
  6. Furthermore, create a violin plot to show the distribution of wins for the various champions. This plot will also highlight whether there are outliers, i.e., champions that win or lose much more often than the median.
  7. Finally, you should plot a bar chart and a pie chart with the top 20 most often winning champions.
题目附加要求

Your charts should be readable, without overlapping names, and with tight borders. Ideally, they should be part of the same figure, appearing as subplots.

The deliverable for this assignment is a Jupyter Notebook with all the code that you wrote. The notebook should also use the functions you built so that the responses to the questions posed in the tasks are visible in the notebook. The notebook should also include the code to produce the plots so that all the responses can be seen in the same notebook.

这个数据集中的内容有很多都是用不到的

首先导下库

其中seaborn是用来画小提琴图的

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

设置一下文件路径

# set the path
path = './archive'
gamepath = './archive/games.csv'
champ1path = './archive/champion_info.json'
champ2path = './archive/champion_info_2.json'
spellpath = './archive/summoner_spell_info.json'

其中champ2path和spellpath(存放召唤师技能的细节)都没用到过

该csv文件是按照逗号正常分隔的,直接调用pandas的read_csv方法来获得DataFrame。

gamedata = pd.read_csv(gamepath)

试图通过科学的方法来减少复制粘贴,但是其实没必要…

# create table columns names
champ_team_list=['winner']
team_ban=[]
for i in range(1,3):
    for j in range(1,6):
        champ_team_list.append('t{}_champ{}id'.format(i,j))
        team_ban.append('t{}_ban{}'.format(i,j))
team1_champid_list=champ_team_list[1:6]
team2_champid_list=champ_team_list[6:]
team1_ban=team_ban[:5]
team2_ban=team_ban[5:]

就是为了获得

team1_champid_list
['t1_champ1id', 't1_champ2id', 't1_champ3id', 't1_champ4id', 't1_champ5id']
team1_ban
['t1_ban1', 't1_ban2', 't1_ban3', 't1_ban4', 't1_ban5']

这样的效果

直接把1-3问大一统地解决掉,核心方法是字典,字典的键是champion id,值初始化为0。它们仨的初始化其实直接浅拷贝就行,就是用.copy()方法,注意champion_banned_lose中有-1键,这个键并没有出现在champion id里面;-1代表着空ban:

%%time
champid_dict_won = {i:0 for i in pd.read_json(champ1path).index.values}
champid_dict_lose = {i:0 for i in pd.read_json(champ1path).index.values}
champid_banned_lose = {i:0 for i in pd.read_json(champ1path).index.values}
champid_banned_lose[-1] = 0

for i in range(len(gamedata)):
    if gamedata.loc[i, 'winner'] == 1:
        for id in gamedata.loc[i, team1_champid_list].T.values:
            champid_dict_won[id] += 1
        for id in gamedata.loc[i, team2_champid_list].T.values:
            champid_dict_lose[id] += 1
        for id in gamedata.loc[i, team1_ban].T.values:
            champid_banned_lose[id] += 1
    elif gamedata.loc[i, 'winner'] == 2:
        for id in gamedata.loc[i, team2_champid_list].T.values:
            champid_dict_won[id] += 1
        for id in gamedata.loc[i, team1_champid_list].T.values:
            champid_dict_lose[id] += 1
        for id in gamedata.loc[i, team2_ban].T.values:
            champid_banned_lose[id] += 1

其中,用转置T是为了消除掉列表中括号

%%time获得的值是Wall time: 3min 43s

# number of champions
len(champid_dict_won)
138

可以知道在S9的时候有138个英雄。

写一个从字典中,获取键对应值的最大值的键和值的函数:

# return the maximum value and corresponding key from a dictionary
def get_max_from_dict(adict):
    max_key = max(adict.keys(),key=(lambda x:adict[x]))
    max_value = adict[max(adict.keys(),key=(lambda x:adict[x]))]
    print('id:',max_key, 'value:', max_value)

那么可以直接对问题解答了:

  1. Which champion id is more often associated with a team winning?
get_max_from_dict(champid_dict_won)
id: 18 value: 6713

id为18的英雄是崔丝塔娜,小炮

  1. Which champion id is more often associated with a team losing?
get_max_from_dict(champid_dict_lose)
id: 412 value: 6859

id为412的英雄是锤石

  1. Which champion id being banned is more often associated with a team losing?
get_max_from_dict(champid_banned_lose)
id: 157 value: 16519

id为157的英雄是亚索,这个真是能笑死个人,快乐风男被ban了赢得概率就大了,这是不让队友玩的原因还是不让对面玩的原因?

对于问题4

Which two champion ids appear more often together in winning teams (where they are not banned)?

要找出来一对英雄,思来想去用的方法是笛卡尔乘积,但是要剔除掉重复的,所以按照n+(n-1)+···+1的总数来弄类笛卡尔乘积,首先先对数据排好序,然后再生成即可。

champid_list = sorted([i for i in pd.read_json(champ1path).index.values])
champid_tuple = []

for i in range(len(champid_list)):
    for j in range(len(champid_list[i+1:])):
        champid_tuple.append((champid_list[i], champid_list[i+1:][j]))

        
champid_win_both = {i:0 for i in champid_tuple}
%%time
for i in range(len(gamedata)):
    champid_tuple_demo=[]
    if gamedata.loc[i, 'winner'] == 1:
        data = gamedata.loc[i, team1_champid_list].T.values
    elif gamedata.loc[i, 'winner'] == 2:
        data = gamedata.loc[i, team2_champid_list].T.values
    sorted_data = sorted(data)
    for i in range(len(sorted_data)):
        for j in range(len(sorted_data[i+1:])):
            champid_tuple_demo.append((sorted_data[i], sorted_data[i+1:][j]))
    for i in champid_tuple_demo:
        if i in champid_tuple:
            champid_win_both[i] += 1

这个是在双核奔腾笔记本上运行的,Wall time是3min 56s。

# champion ids appear more often together in winning teams
get_max_from_dict(champid_win_both)
id: (497, 498) value: 1242

果然在情理之中啊,在一起,是与队伍胜利联系最大的一对英雄!因为在官方设定里面他俩就是一对情侣啊!回城的时候还能一起回城。

我用英文是这样描述的:

The happy thing is that these two ids are correspond to Rakan and Xayah who are lovers in the game.

对于第五题,画散点图,我想着x坐标用id来比较好,毕竟如果把138个英雄名写上去太挤了。

Create a scatterplot showing the number of times each champion is part of a winning team.

won_id_times = sorted(champid_dict_won.items(), key=lambda d:d[0], reverse=False)
scatter_x = [i[0] for i in won_id_times]
scatter_y = [i[1] for i in won_id_times]

# set the figure size
fig = plt.figure(figsize=(10,5))  
ax1 = fig.add_subplot(1,1,1)  
ax1.set_title('Champion id and Times on Winning Team')  

plt.xlabel('Champion id')  
plt.ylabel('Times')  

ax1.scatter(scatter_x, scatter_y, s=20)  
 
plt.show()  

获得的散点图如下:
散点图
对于第六题,是最令我头疼的,

Furthermore, create a violin plot to show the distribution of wins for the various champions. This plot will also highlight whether there are outliers, i.e., champions that win or lose much more often than the median.

小提琴图,第一次听说。仔细查找相关文档发现这东西和箱线图差不多,就是在seaborn上讲的很细。

比如说这张图:
violinplot
主要困难的地方是画出来离群点,因为我的这个数据中没有y值,所以用matplotlib里的方法,就算把y调成0也画不出来。

于是向外教发邮件,恭恭敬敬地内容如下:

Dear professor,
     I am trapped in question #6 “Furthermore, create a violin plot to show the distribution of wins for the various champions. This plot will also highlight whether there are outliers, i.e., champions that win or lose much more often than the median.” Although I have plotted the violin plot of wins for the various champions which is the graph as an attachment, I have trouble in highlighting the max value on the plot.
     The fatal question for me is that it is hard to plot a yellow circle as the highlight point onto the plot since this violin plot created by seaborn does not have y values.
     I would be appreciate it if you give me some helpful suggestion.
     Yours,
     sincerely student

其中的附件就是上图的那个小提琴图,但是外教没回我,直到作业提交完成的第二天早上我才收到姗姗来迟的邮件:

Dear XXX,
You can change the orientation of the violin plot by passing the orient optional named parameter with argument “v” (for “vertical”). Would that help you solve the problem?
I apologize for the long delay. This message got lost in my inbox.
Kind regards,
XXX

于是向大佬请教,stackoverflow上,csdn上都提问了,链接如下:StackOverflow询问情况(图片预览需要爬梯子才能加载)CSDN询问情况

发现最后的解决方案非常简单啊,用seaborn的画布不就行啦!

champid_list = sorted([i for i in pd.read_json(champ1path).index.values])

champid_won_times = np.array([i[1] for i in sorted(champid_dict_won.items(), key=lambda d:d[0], reverse=False)])
champid_lose_times = np.array([i[1] for i in sorted(champid_dict_lose.items(), key=lambda d:d[0], reverse=False)])

datadict = {'id':champid_list, 'won_times':champid_won_times, 'lose_times':champid_lose_times}

champ_won_lose = pd.DataFrame(data=datadict)
# violin plot with outliers
# outliers were highlighted by plotting red points

fig, ((ax1, ax2)) = plt.subplots(nrows=2, ncols=1, figsize=(10, 5), sharex=True)

data_won_violin = champ_won_lose['won_times']
data_lose_violin = champ_won_lose['lose_times']

q1_won, q3_won = np.percentile(champ_won_lose['won_times'], [25, 75])
whisker_low_won = q1_won - (q3_won - q1_won) * 1.5
whisker_high_won = q3_won + (q3_won - q1_won) * 1.5

q1_lose, q3_lose = np.percentile(champ_won_lose['lose_times'], [25, 75])
whisker_low_lose = q1_lose - (q3_lose - q1_lose) * 1.5
whisker_high_lose = q3_lose + (q3_lose - q1_lose) * 1.5

sns.violinplot(x=data_won_violin, color='CornflowerBlue', ax=ax1)
outliers_won = data_won_violin[(data_won_violin > whisker_high_won) | (data_won_violin < whisker_low_won)]
sns.scatterplot(x=outliers_won, y=0, marker='o', color='crimson', ax=ax1)

sns.violinplot(x=data_lose_violin, color='CornflowerBlue', ax=ax2)
outliers_lose = data_lose_violin[(data_lose_violin > whisker_high_lose) | (data_lose_violin < whisker_low_lose)]
sns.scatterplot(x=outliers_lose, y=0, marker='o', color='crimson', ax=ax2)

ax1.set_title('Won Violin Plot')
ax2.set_title('Lose Violin Plot')

plt.tight_layout()
plt.show()

效果如图:
小提琴图
最后一题,

Finally, you should plot a bar chart and a pie chart with the top 20 most often winning champions.

画柱状图和饼图

由于题目没明确是胜率还是上面一二题的那种关联问题,于是我做了两类图。

此外,我是用这样的英文描述我的困惑的:

Since the description of the title is not very accurate, two methods are used to draw the pictures. One is based on the victory of the champion and the team, and the other is based on the victory rate of the champion.

排序函数很重要。

id_name_map = {i['id']: i['name'] for i in pd.read_json(champ1path).data.values}

won_rate = (champ_won_lose['won_times']/(champ_won_lose['lose_times']+champ_won_lose['won_times'])).to_list()
won_rate_dict = {x:0 for x in champid_list}
for i in range(len(won_rate)):
    won_rate_dict[champid_list[i]] += won_rate[i]

rate_won_id_times_sorted_with_value_20 = sorted(won_rate_dict.items(), key=lambda d:d[1], reverse=True)[:20]
bar_x_rate = [id_name_map[i[0]] for i in rate_won_id_times_sorted_with_value_20]
bar_y_rate = [i[1] for i in rate_won_id_times_sorted_with_value_20]

won_id_times_sorted_with_value_20 = sorted(champid_dict_won.items(), key=lambda d:d[1], reverse=True)[:20]
bar_x = [id_name_map[i[0]] for i in won_id_times_sorted_with_value_20]
bar_y = [i[1] for i in won_id_times_sorted_with_value_20]

了解一下top 20的英雄id

# check the top20 won champion ids
won_id_times_sorted_with_value_20
[(18, 6713),
 (412, 6143),
 (67, 5498),
 (40, 4826),
 (141, 4807),
 (29, 4665),
 (64, 4217),
 (222, 4087),
 (157, 3948),
 (202, 3925),
 (236, 3915),
 (498, 3906),
 (99, 3560),
 (53, 3506),
 (497, 3433),
 (117, 3367),
 (24, 3347),
 (103, 3295),
 (61, 3286),
 (21, 3256)]

把英雄id以及它对应的英雄名搞成键值对抛进字典里:

id_name_map = {i['id']: i['name'] for i in pd.read_json(champ1path).data.values}

非胜率饼图:

# pie chart
fig = plt.figure(figsize=(10,10))  
ax1 = fig.add_subplot(1,1,1)  
ax1.set_title('Pie Chart for Top 20 Most Winning Champions') 
plt.pie(bar_y, labels=bar_x, autopct='%0.2f%%')
plt.show()

非胜率饼图
胜率饼图:

# pie chart (winning rate)
fig = plt.figure(figsize=(10,10))  
ax1 = fig.add_subplot(1,1,1)  
ax1.set_title('Pie Chart for Top 20 Most Winning Champions (Winning Rate)') 
plt.pie(bar_y_rate, labels=bar_x_rate, autopct='%0.2f%%')
plt.show()

胜率饼图
柱状图也一样简单:

非胜率柱状图:

# bar chart
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(1,1,1)  
ax1.set_title('Bar Chart for Top 20 Most Winning Champions') 
plt.bar(bar_x,bar_y)
plt.xticks(fontsize = 12, rotation=45)
plt.xlabel("Champion Name")
plt.ylabel("Times")
 
plt.show() 

非胜率柱状图
胜率柱状图:

# bar chart (winning rate)
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(1,1,1)  
ax1.set_title('Bar Chart for Top 20 Most Winning Champions (Winning Rate)') 
plt.bar(bar_x_rate,bar_y_rate)
plt.xticks(fontsize = 12, rotation=45)
plt.xlabel("Champion Name")
plt.ylabel("Times")
 
plt.show() 

胜率柱状图
終わり。

不懂,就钻;还是难受,那不如不懂就问。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值