【Hello Python World】Week 14：A statistical analysis case with Pandas & Seaborn-CSDN博客

本文链接：https://blog.csdn.net/u013159381/article/details/80637975

- Part 1
  - 分析
  - 代码
- Part 2
  - 分析
  - 代码

这次我们要学习的东西是关于Python中的Pandas和Seaborn模块，结合一个非常有名的例子来分析。
Anscombe's Quartet是一个拥有四个数据集的集合，这四个数据集内的数据都有相同的均值、方差和相关度，一眼看上去感觉是四个高度相似的数据集，实际上当我们画出这四个数据集的图像时我们才发现被直觉骗了，这个有趣的例子是由统计学家F.J. Anscombe在1973年构造出的四组数据。它告诉我们在分析数据之前，先通过描绘数据所对应的图像直观了解数据特性的重要性，了解更多请点。这两part的题目就是让我们动手实验这个过程，剧透完了之后就开始用Python来走一次分析过程吧。

Part 1

For each of the four datasets…

Compute the mean and variance of both x and y
Compute the correlation coefficient between x and y
Compute the linear regression line: $y = \beta_0 + \beta_1 x + \epsilon$ (hint: use statsmodels and look at the Statsmodels notebook

分析

Python的Pandas模块中提供了读csv文件，按行/列操作表格的功能，对于前两个要求,
计算均值有mean()，计算方差有var()，计算相关系数有corr() (用了这个√) 和corrwith()

对于最后一个要求，我们要用到statsmodels模块中的线性回归模型进行系数计算。
划分数据集的时候可以用匹配索引的方式，例如要取第一个dataset的行，用df[df.dataset == 'I']
OLS方法是statsmodels模块中提供的线性回归模型，fit()用于求解这个模型，最后我们用关键字params取出拟合系数

代码

import pandas as pd
import numpy as np
import statsmodels.api as sm

df = pd.read_csv('anscombe.csv')

print("The means of the 4 datasets:")
print(df.groupby(['dataset']).mean())

print("\nThe variances of the 4 datasets:")
print(df.groupby(['dataset']).var())

print("\nThe correlation coefficients between x & y of the 4 datasets:")
print(df.groupby(['dataset']).corr())

print("\nFitted functions of 4 dataset:")
for i in ['I','II','III','IV']:
    ds = df[df.dataset == i]
    x = sm.add_constant(np.array(ds.x))
    y = np.array(ds.y)
    res = sm.OLS(y, x).fit().params
    print("The fitted function of %s is: y = %fx + %f" % (i, res[0], res[1]))

结果如下：
The means of the 4 datasets:
       x         y
dataset
I        9.0  7.500909
II       9.0  7.500909
III      9.0  7.500000
IV       9.0  7.500909

The variances of the 4 datasets:
            x         y
dataset
I        11.0  4.127269
II       11.0  4.127629
III      11.0  4.122620
IV       11.0  4.123249

The correlation coefficients between x & y of the 4 datasets:
                  x         y
dataset
I       x  1.000000  0.816421
        y  0.816421  1.000000
II      x  1.000000  0.816237
        y  0.816237  1.000000
III     x  1.000000  0.816287
        y  0.816287  1.000000
IV      x  1.000000  0.816521
        y  0.816521  1.000000

Fitted functions of 4 dataset:
The fitted function of I is: y = 3.000091x + 0.500091
The fitted function of II is: y = 3.000909x + 0.500000
The fitted function of III is: y = 3.002455x + 0.499727
The fitted function of IV is: y = 3.001727x + 0.499909

可以发现，四个数据集从统计的角度看是高度类似的。

Part 2

Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter

分析

从seaborn安装的时候我们就可以看到seaborn打包了scipy和matplotlab等库，封装了这些库的功能，提供了更强大的画图能力。
使用sns.FacetGrid可以将数据集划分成子集来画图，再使用map方法，可以将这些子集的图象整合成一个紧凑的大图。题目的要求是用plt.scatter，只需要画出点就行了，这里我还尝试了第二种方法，就是用sns.lmplot，这个方法会自动帮我们算出曲线拟合的效果并呈现在图上，一条龙服务。

代码

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set() #设置默认的背景格式
df = pd.read_csv('anscombe.csv')

#两种方法择其一
#第一种方法
sns.lmplot(data = df, col='dataset', x="x", y="y", markers="*", col_wrap=2)

#第二种方法，337232是我的学号后六位，对应的颜色是草绿色
sns.FacetGrid(data=df,col='dataset',col_wrap=2).map(plt.scatter,'x','y',color="#337232", edgecolor="white") 
plt.show() #最后要show才会显示！

结果如图：
方法一：

这里写图片描述

方法二：

这里写图片描述

从这四张图像对比起来就可以知道四个数据集的差异明显。这警示我们：在进行数据分析前
最好能先将数据可视化，而不能只单纯用均值方差相关性这三个统计学上的数据就臆测数据
集之间的特性。