【DS实践 | Coursera】Assignment 3 | Applied Plotting, Charting & Data Representation in Python

Mart_inn

已于 2022-02-21 17:21:49 修改

阅读量1k

点赞数 1

分类专栏： Data Science with Python Coursera

于 2022-02-21 17:17:02 首次发布

本文链接：https://blog.csdn.net/Mart_inn/article/details/123048672

版权

概率可视化交互式图表置信区间 Matplotlib 数据点分配

关键词由CSDN通过智能技术生成

Data Science with Python 同时被 2 个专栏收录

16 篇文章 8 订阅

订阅专栏

Coursera

6 篇文章 1 订阅

订阅专栏

一、问题分析

1.1 问题描述

In this assignment you must choose one of the options presented below and submit a visual as well as your source code for peer grading. The details of how you solve the assignment are up to you, although your assignment must use matplotlib so that your peers can evaluate your work. The options differ in challenge level, but there are no grades associated with the challenge level you chose. However, your peers will be asked to ensure you at least met a minimum quality for a given technique in order to pass. Implement the technique fully (or exceed it!) and you should be able to earn full grades for the assignment.

Ferreira, N., Fisher, D., & Konig, A. C. (2014, April). Sample-oriented task-driven visualizations: allowing users to make better, more confident decisions.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571  In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571-580). ACM. ([video](https://www.youtube.com/watch?v=BI7GAs-va-Q))

In this paper the authors describe the challenges users face when trying to make judgements about probabilistic data generated through samples. As an example, they look at a bar chart of four years of data (replicated below in Figure 1). Each year has a y-axis value, which is derived from a sample of a larger dataset. For instance, the first value might be the number votes in a given district or riding for 1992, with the average being around 33,000. On top of this is plotted the 95% confidence interval for the mean (see the boxplot lectures for more information, and the yerr parameter of barcharts).

在这里插入图片描述

A challenge that users face is that, for a given y-axis value (e.g. 42,000), it is difficult to know which x-axis values are most likely to be representative, because the confidence levels overlap and their distributions are different (the lengths of the confidence interval bars are unequal). One of the solutions the authors propose for this problem (Figure 2c) is to allow users to indicate the y-axis value of interest (e.g. 42,000) and then draw a horizontal line and color bars based on this value. So bars might be colored red if they are definitely above this value (given the confidence interval), blue if they are definitely below this value, or white if they contain this value.

在这里插入图片描述

Easiest option: Implement the bar coloring as described above - a color scale with only three colors, (e.g. blue, white, and red). Assume the user provides the y axis value of interest as a parameter or variable.

Harder option: Implement the bar coloring as described in the paper, where the color of the bar is actually based on the amount of data covered (e.g. a gradient ranging from dark blue for the distribution being certainly below this y-axis, to white if the value is certainly contained, to dark red if the value is certainly not contained as the distribution is above the axis).

Even Harder option: Add interactivity to the above, which allows the user to click on the y axis to set the value of interest. The bar colors should change with respect to what value the user has selected.

Hardest option: Allow the user to interactively set a range of y values they are interested in, and recolor based on this (e.g. a y-axis band, see the paper for more details).

Note: The data given for this assignment is not the same as the data used in the article and as a result the visualizations may look a little different.

Use the following data for this assignment:

import pandas as pd
import numpy as np
from scipy import stats
%matplotlib notebook

np.random.seed(12345)

df = pd.DataFrame([np.random.normal(32000,200000,3650), 
                   np.random.normal(43000,100000,3650), 
                   np.random.normal(43500,140000,3650), 
                   np.random.normal(48000,70000,3650)], 
                  index=[1992,1993,1994,1995])

1.2 问题分析

分析Even Harder option选项：

本题给出了1992年-1995年4年间的数据集，在给定某一个特定的值的时候，判断其属于那个年份的概率最大。根据中心极限定理，我们假设每一年的数据分布应该是属于正态分布的（结合核密度曲线观察、利用Shapiro-Wilk检验或者Kolmogorov-Smirnov检验法，本题中可省略），首先我们设定 $\alpha=0.05$ ，给定置信水平为95%，以此计算出每个数据的置信区间。根据置信区间的中值（同时也是数据集的中值）绘制柱状图，在根据置信区间的范围绘制errorbar，这样即可给出在0.95置信水平内可能属于该数据集的观测值的范围。

考虑到越靠近置信区间的均值，则隶属于该数据集的概率越大，利用 $2*(\bar{X}-Y)/(X_{max}-X_{min})$ 来设计函数用以计算观测值属于各个数据集的概率。

选择色阶图的时候应该选择分散形（Diverging）的色阶图colormap，表现为中间白两头渐变，中间值为1，两头值为0，首先需要对-1到1的数据设计规范化，用plt.Normalize(-1,1)，根据已经得到的数据概率值，对观测值属于各个数据集的概率值进行上色，最后完善图的细节（例如提高有效墨水比例ink-ratio和减少绘图垃圾信息），便完成了一帧的制作，对于色阶图的运用可以查看
【DS with Python】Matplotlib入门(三)：cm模块、colormap配色、animation动画与canvas交互设计。

最后根据鼠标点击创建事件，在点击时获取当前的y坐标event.ydata，将其当作观测值，带入上面设计好的绘图函数完成每一帧的制作中即可。

二、具体代码及注释

2.1 代码及注释

# Use the following data for this assignment:

import pandas as pd
import numpy as np
from scipy import stats
%matplotlib notebook

np.random.seed(12345)

#四个数据集
df = pd.DataFrame([np.random.normal(32000,200000,3650), 
                   np.random.normal(43000,100000,3650), 
                   np.random.normal(43500,140000,3650), 
                   np.random.normal(48000,70000,3650)], 
                  index=[1992,1993,1994,1995])
import matplotlib.pyplot as plt
from matplotlib import cm
from scipy import stats

#计算95%置信区间
intervals=[]
for idx in df.index:
    interval=stats.norm.interval(0.95,np.mean(df.loc[idx]),stats.sem(df.loc[idx]))
    intervals.append(interval)
    
#计算yerr值(本质上就是置信区间减去期望值)用于在柱状图上绘制errorbar
err_1992=np.array(stats.norm.interval(0.95,np.mean(df.loc[1992]),stats.sem(df.loc[1992])))-np.mean(df.loc[1992])
err_1993=np.array(stats.norm.interval(0.95,np.mean(df.loc[1993]),stats.sem(df.loc[1993])))-np.mean(df.loc[1993])
err_1994=np.array(stats.norm.interval(0.95,np.mean(df.loc[1994]),stats.sem(df.loc[1994])))-np.mean(df.loc[1994])
err_1995=np.array(stats.norm.interval(0.95,np.mean(df.loc[1995]),stats.sem(df.loc[1995])))-np.mean(df.loc[1995])
err=np.array([err_1992,err_1993,err_1994,err_1995]).T

## 提供另一种思路：直接在上面的95%置信区间内减掉对应的数据
# idx_2=1992
# intervals_2=[]
# for interval in intervals:   
#     interval_2=np.array(interval)-np.mean(df.loc[idx_2])
#     intervals_2.append(interval_2)
#     idx_2+=1
# err=np.array([intervals_2[0],intervals_2[1],intervals_2[2],intervals_2[3]]).T

#提取df的index属性和均值
index=df.T.describe().loc['mean',:].index.values
values=df.T.describe().loc['mean',:].values

#设置虚线y的默认值为4条柱状图均值的均值
y=np.mean(values)

#创建新图像
plt.figure()

#从colormap中选定色彩，这里选择了'collwarm'，也可以选择其他的发散式colormap，或自定义
cmap=cm.get_cmap('coolwarm')

#计算概率，完全超过95%置信区间为0，即蓝色，完全低于95%置信区间为1，即红色
def calculate_probability(y,interval):
    if y<interval[0]:
        return 1
    elif y>interval[1]:
        return -1
    return 2*((interval[1]+interval[0])/2-y)/(interval[1]-interval[0])

#LC表达式对各个置信区间求解
probs=[calculate_probability(y,interval) for interval in intervals]

#设置各个概率对应的颜色
colors=cmap(probs)

#设置ScalarMappable
sm = cm.ScalarMappable(cmap=cmap,norm=plt.Normalize(-1,1))
sm.set_array([])

#画柱状图
bars=plt.bar(range(len(values)),values,color=sm.to_rgba(probs))

#画误差线
plt.gca().errorbar(range(len(values)),values,yerr=abs(err),c='k',fmt=' ',capsize=15)

#画面设置
plt.xticks(range(len(values)),index)
plt.ylabel('Values')
plt.xlabel('Year')
plt.ylim([0,60000])
plt.gca().set_title('Assignment3')

#设置水平色阶图
plt.colorbar(sm,orientation='horizontal')

#去掉两两条边框，减少绘图垃圾
[plt.gca().spines[loc].set_visible(False) for loc in ['top','right']]
[plt.gca().spines[loc].set_alpha(0.3) for loc in ['left','bottom']]

#更新虚线y的y轴坐标
yticks = plt.gca().get_yticks()
new_yticks=np.append(yticks,y)
plt.gca().set_yticks(new_yticks)

#画观测值的虚线
h_line=plt.axhline(y,color='gray',linestyle='--',linewidth=1)

#给每个柱添加注释
text=plt.text(1.5,58000,'y={:5.0f}'.format(y),bbox={'fc':'w','ec':'k'},ha='center')
text1=plt.text(bars[0].get_x()+bars[0].get_width()/2,bars[0].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[0])),bbox={'fc':'w','ec':'k'},ha='center')
text2=plt.text(bars[1].get_x()+bars[1].get_width()/2,bars[1].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[1])),bbox={'fc':'w','ec':'k'},ha='center')
text3=plt.text(bars[2].get_x()+bars[2].get_width()/2,bars[2].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[2])),bbox={'fc':'w','ec':'k'},ha='center')
text4=plt.text(bars[3].get_x()+bars[3].get_width()/2,bars[3].get_height()+10000,'prob={:.2f}'.format(1-abs(probs[3])),bbox={'fc':'w','ec':'k'},ha='center')



#设置交互函数
def onclick(event):
	#计算概率
    probs=[calculate_probability(event.ydata,interval) for interval in intervals]
    #用cmap给数值上色
    colors=cmap(probs)
    #print(probs)
    plt.bar(range(len(values)),values,color=sm.to_rgba(probs))
    plt.gca().errorbar(range(len(values)),values,yerr=abs(err),c='k',fmt=' ',capsize=15)
    #更改观测值
    h_line.set_ydata(event.ydata)
    #得到新的y刻度
    new_yticks=np.append(yticks,event.ydata)
    #更新新的y刻度
    plt.gca().set_yticks(new_yticks)
    #给每个柱添加注释
    text.set_text('y={:5.0f}'.format(event.ydata))
    text1.set_text('prob={:.2f}'.format(1-abs(probs[0])))
    text2.set_text('prob={:.2f}'.format(1-abs(probs[1])))
    text3.set_text('prob={:.2f}'.format(1-abs(probs[2])))
    text4.set_text('prob={:.2f}'.format(1-abs(probs[3])))
    #text=plt.gca().text(1.5,55000,'y={:5.0f}'.format(event.ydata),bbox={'fc':'w','ec':'k'},ha='center')

    
plt.gcf().canvas.mpl_connect('button_press_event', onclick)