《Python数据分析基础》第七章学习笔记

Bryant chen

已于 2023-11-25 21:54:28 修改

阅读量875

点赞数 15

分类专栏： python数据分析学习文章标签： python 数据分析学习

于 2023-11-25 21:20:34 首次发布

本文链接：https://blog.csdn.net/chenchenojbk/article/details/134620427

版权

python数据分析学习专栏收录该内容

9 篇文章 1 订阅

订阅专栏

Python数据分析第七章

描述性统计与建模

7.1 数据集

7.1.1 葡萄酒质量

数据集下载：
红葡萄酒
 白葡萄酒
我们将这两个数据集合成一个数据集，并且在第一列添加新的type变量来表明是什么类型的葡萄酒。

7.1.2 客户流失

数据集：

Churn
(已经失效。。。需要下载的去第一章的学习笔记里找源代码里面的现成数据吧)

7.2 葡萄酒质量

7.2.1 描述性统计

import numpy as np
import pandas as pd
import seaborn as sns
import  matplotlib.pyplot as plt
import statsmodels.api as sm
import  statsmodels.formula.api as smf
from statsmodels.formula.api import ols, glm

# 将数据集读入到pandas数据框中
wine = pd.read_csv('winequality-both.csv', sep=',', header=0)
wine.columns = wine.columns.str.replace(' ', '_')
print(wine.head())
# 显示所有变量的描述性统计量
print(wine.describe())
# 找出唯一值
print(sorted(wine.quality.unique()))
# 计算值的频率
print(wine.quality.value_counts())

在import语句之后，先用read_csv方法将数据集读入一个pandas数据框。附加参数表示域分隔符为逗号，第一行为列标题。有些列标题中包含空格，所以在下面一行代码中使用下划线代替空格。然后使用head函数检查一下标题行和前五行数据，确保数据被正确加载。

7.2.2 分组、直方图与t检验

前面计算的统计量是针对整个数据集的，因此下面分别分析红葡萄酒和白葡萄酒数据，看看统计量是否会保持不变。(没有找到书上的seaborn版本，所以我也没有完全跑通书上的代码，以下代码只做参考）

import numpy as np
import pandas as pd
import seaborn as sns
import  matplotlib.pyplot as plt
import statsmodels.api as sm
import  statsmodels.formula.api as smf
from statsmodels.formula.api import ols, glm
#
# # 将数据集读入到pandas数据框中
wine = pd.read_csv('winequality-both.csv', sep=',', header=0)

# 按照葡萄酒类型显示质量的描述性统计量
print(wine.groupby('type')[['quality']].describe().unstack('type'))
# 按照葡萄酒类型显示质量的特定分位数
print(wine.groupby('type')[['quality']].quantile([0.25, 0.75]).unstack('type'))
# 按照葡萄酒类型查看质量分布
red_wine = wine.loc[wine['type']=='red', 'quality']
white_wine = wine.loc[wine['type']=='white', 'quality']
sns.set_style("dark")
print(sns.displot(red_wine, kde=False, color="red", label="Red wine"))
print(sns.displot(white_wine, kde=False, color="white", label="White wine"))
sns.utils.axlabel('Quality Score', 'Density')
plt.title("Distribution of Quality by Wine Type")
plt.legend()
plt.show()
# 检验红葡萄酒和白葡萄酒的平均质量是否有所不同
print(wine.groupby(['type'])[['quality']].agg(['std']))
tstat, pvalue, df = sm.stats.ttest_ind(red_wine, white_wine)
print('tstat: %.3f pvalue: %.4f' % (tstat, pvalue))

7.2.3 成对变量之间的关系和相关性

# 计算所有变量的相关矩阵
print(wine.corr())

# 从红葡萄酒和白葡萄酒的数据中取出小样本来进行绘图
def take_sample(data_frame, replace=False, n=200):
	return data_frame.loc[np.random.choice(data_frame.index, replace=replace, size=n)]	
reds = wine.loc[wine['type']=='red', :]
whites = wine.loc[wine['type']=='white', :]
reds_sample = take_sample(wine.loc[wine['type']=='red', :])
whites_sample = take_sample(wine.loc[wine['type']=='white', :])
wine_sample = pd.concat([reds_sample, whites_sample])
wine['in_sample'] = np.where(wine.index.isin(wine_sample.index), 1.,0.)

reds_sample = reds.loc[np.random.choice(reds.index, 100)]
whites_sample = whites.loc[np.random.choice(whites.index, 100)]
wine_sample = pd.concat([reds_sample, whites_sample], ignore_index=True)

print(wine['in_sample'])
print(pd.crosstab(wine.in_sample, wine.type, margins=True))

sns.set_style("dark")
sns.set_style("darkgrid", {"legend.scatterpoints": 0})
pg = sns.PairGrid(wine_sample, hue="type", hue_order=["red", "white"], palette=dict(red="red", white="white"), hue_kws={"marker": ["o", "s"]}, vars=['quality', 'alcohol', 'residual_sugar'])
pg.x = wine_sample.loc[wine_sample['type']=='red', 'quality']
pg = pg.map_diag(plt.hist)
pg.x = wine_sample.loc[wine_sample['type']=='white', 'quality']
pg = pg.map_diag(plt.hist)
pg = pg.map_offdiag(plt.scatter, edgecolor="black", s=10, alpha=0.25)
#plt.show()

g = sns.pairplot(wine_sample, kind='reg', plot_kws={"ci": False, "x_jitter": 0.25, "y_jitter": 0.25}, hue='type', diag_kind='hist', diag_kws={"bins": 10, "alpha": 1.0}, palette=dict(red="red", white="white"), markers=["o", "s"], vars=['quality', 'alcohol', 'residual_sugar'])
sns.set_style({'legend.frameon': True,'legend.numpoints': 0,'legend.scatterpoints': 0})
wine_all_plot = sns.pairplot(wine, kind='reg', hue='type', palette=dict(red="red", white="white"), markers=["o", "s"], vars=['quality', 'alcohol', 'residual_sugar'])
wine_sample_plot = sns.pairplot(wine_sample, kind='reg', hue='type', palette=dict(red="red", white="white"), markers=["o", "s"], vars=['quality', 'alcohol', 'residual_sugar'])

wine['ln_fixed_acidity'] = np.log(wine.loc[:, 'fixed_acidity'])
sns.distplot(wine.loc[:, 'fixed_acidity'])
sns.distplot(wine.loc[:, 'ln_fixed_acidity'])
print(g)
plt.suptitle('Histograms and Scatter Plots of Quality, Alcohol, and Residual Sugar', fontsize=14, horizontalalignment='center', verticalalignment='top',x=0.5, y=0.999)
#plt.show()

7.2.4 使用最小二乘估计进行线性回归

相关系数和两两变量之间的统计图有助于对两个变量之间的关系进行量化和可视化，但是他们不能测量出每个自变量在其他自变量不变时与因变量之间的关系。线性回归可以解决这个问题。但是本书没有非常细致的讲解如何用最小二乘估计进行线性回归，我后面会专门写一篇文章用来学习线性回归。

7.2.5 自变量标准化

普通最小二乘回归是通过残差平方和最小化来估计未知的β参数值的，这里的残差是值自变量观测值与拟合值之间的差别。因为残差大小是依赖于自变量的测量单位，所以如果自变量的测量单位相差很大的话，那么将自变量标准化后，就可以更容易对模型进行解释。
对自变量进行标准化的方法是，先从自变量的每个观测值中减去均值，然后再除以这个自变量的标准差。
pandas在数据框中对变量标准化非常容易。你可以对一个观测写一个变换公式，pandas可以把这个公式扩展到行与列中，来标准化所有变量。

7.3 客户流失

下面来分析客户流失数据集。为数据框churn创建一个新的数值型二值变量，并检查数据框中前几行数据。

import numpy as np
import  pandas as pd
import seaborn as sns
import  matplotlib.pyplot as plt
import statsmodels.api as sm
import  statsmodels.formula.api as smf
churn = pd.read_csv('churn.csv', sep=',', header=0)
churn.columns = [heading.lower() for heading in churn.columns.str.replace(' ','_').str.replace("\'", "").str.strip('?')]
churn['churn01'] = np.where(churn['churn'] == 'True.', 1., 0.)
print(churn.head())

为每个分组中的一些特定的列计算3个统计量：总数、均值和标准差：

# 为分组数据计算描述性统计量
print(churn.groupby(['churn'])[['day_charge', 'eve_charge', 'night_charge', 'intl_charge', 'account_length', 'custserv_calls']].agg(['count', 'mean', 'std']))
# 为不同的变量计算不同的统计量
print(churn.groupby(['churn']).agg({'day_charge' : ['mean', 'std'],
                                    'eve_charge' : ['neam', 'std'],
                                    'night_charge' : ['mean', 'std'],
                                    'intl_charge' : ['mean', 'std'],
                                    'account_length' : ['count', 'min', 'max'],
                                    'custserv_calls' : ['count', 'min', 'max']}))

下一段代码对客户服务通话次数这部分数据进行了摘要分析，先按照一个新变量total_charges中的值使用等宽分箱法将数据分成5个组，然后为每个分组计算5个统计量：总数、最小值、均值、最大值和标准差。

# 创建total_charges
# 将其分为5组，并为每一组计算统计量
churn['total_charges'] = churn['day_charge'] + churn['eve_charge'] + churn['night_charge'] + churn['intl_charge']
factor_cut = pd.cut(churn.total_charges, 5, precision=2)
def get_stats(grop):
    return {'min' : grop.min(), 'max' : grop.max(), 'count' : grop.count(), 'mean' : grop.mean(), 'std' : grop.std()}
grouped = churn.custserv_calls.groupby(factor_cut)
print(grouped.apply(get_stats).unstack())

7.3.1 逻辑斯蒂回归

逻辑斯蒂回归通过使用逻辑函数的反函数估计概率的方式来测量自变量和二值型因变量之间的关系。这个函数可以将连续值转换为0和1之间的值，这是个必要条件，因为预测值表示概率，而概率必须在0和1之间。这样，逻辑斯蒂回归预测的就是某种结果的概率。

逻辑斯蒂回归通过一种能够实现极大似然估计的迭代算法来估计未知的β参数值。

逻辑斯蒂回归的语法与线性回归又一点区别。对于逻辑斯蒂回归，需要分别设置因变量和自变量，而不是将他们写在一个公式中：

dependent_variable = churn['churn01'] 
independent_variables = churn[['account_length', 'custserv_calls','total_charges']] 
independent_variables_with_constant =   sm.add_constant(independent_variables, prepend=True) 
logit_model = sm.Logit(dependent_variable, independent_variables_with_constant).fit() 
print(logit_model.summary()) 
print("\nQuantities you can extract from the result:\n%s" % dir(logit_model)) 
print("\nCoefficients:\n%s" % logit_model.params) 
print("\nCoefficient Std Errors:\n%s" % logit_model.bse)

第一行代码创建一个变量dependent_varible并赋给他churn01列中的一系列值。
同样，第二行代码设定了用作自变量的3列，并将他们赋给变量independent_variables。
然后，我们使用statsmodels和add_constant函数向输入变量中加入一列1.
下一行代码拟合逻辑斯蒂模型，并将拟合结果赋给变量logit_model。