第28期Datawhale组队学习——动手学数据分析

最新推荐文章于 2022-07-14 00:34:59 发布

yep吖

最新推荐文章于 2022-07-14 00:34:59 发布

阅读量365

点赞数 2

文章标签： python

本文链接：https://blog.csdn.net/weixin_44259058/article/details/119749433

版权

前言

Datawhale第28期组队学习来了！这次很幸运报名成功！我选择的是Datawhale开源学习项目——“动手学数据分析”，本次学习周期为8.15-8.26，在此记录本人初次接触数据分析的学习历程。基于教程写的比较详细，这里的记录只作为补充。

环境配置

笔者使用的软件实验环境为Win10系统，Anaconda3。
1、环境配置（Anaconda Prompt）
（1）建立data-A的环境：conda create --name data-A python=3.8.5 然后输入 y
（报错：Collecting package metadata (current_repodata.json): failed。解决：沿路径C:\Users\1207.condarc把清华源删除）
（2）激活环境：conda activate data-A
（3）安装jupyter、numpy、pandas
conda install -c conda-forge jupyter 然后输入 y
pip install numpy -i https://pypi.douban.com/simple/
pip install pandas -i https://pypi.douban.com/simple/

2、打开文件（Anaconda Prompt）
首先到下载好的文件夹路径（D盘）👉D:👉cd D:\hands-on-data-analysis
然后启动data-A环境👉conda activate data-A
然后使用👉jupyter notebook

3、关闭Jupyter
首先关闭jupyter的网页
然后在Anaconda Prompt中退出jupyter👉按键ctrl+C
然后退出data-A环境👉conda deactivate
退出cmd👉exit

学习路线

Task01：数据加载及探索性数据分析（8.16、8.17）

Task01相关参考CSDN

1、数据载入及初步观察

思考：
1）pd.read_csv()和pd.read_table()的区别：
pd.read_csv()（读入后它是一个4行12列的数组，每一个字符串作为一列）
在这里插入图片描述
pd.read_table()（读入后它是一个4行1列的数组，每一行字符串作为为一列而不是每一个字符串，每个字符串之间有逗号相隔，表明每一行作为一个维度进行了存储）
👉用pd.read_table()读csv文件：pd.read_table('train.csv', sep=',')
思路：将pd.read_table()函数读文件的默认分隔符从“\t”改为“,”

2）chunkersize参数用来控制迭代数据分析的大小，作为每一个数据块的行数
chunker=pd.read_csv('train.csv',chunksize=400)
👉chunker(数据块)类型：print(type(chunker))
👉用for循环打印出来：
for piece in chunker:
print(type(piece))
print(len(piece))
在这里插入图片描述

思路：有chunkersize参数可以进行逐块加载，本质就是将文本分成若干块，每次处理chunkersize行的数据，最终返回一个TextParser对象，对该对象进行迭代遍历，可以完成逐块统计的合并处理。

问题：
1)绝对路径导入数据出错(查路径：import os、os.getcwd())
输入：df=pd.read_csv(‘D:\hands-on-data-analysis\第一单元项目集合\train.csv’)
解决：df=pd.read_csv(‘D:/hands-on-data-analysis/第一单元项目集合/train.csv’)
解决2：df=pd.read_csv(‘D:\hands-on-data-analysis\第一单元项目集合\train.csv’)
2)保存文件中文字体出现乱码
输入：df.to_csv(‘train_chinese.csv’)
解决：df.to_csv(‘train_chinese.csv’,encoding=‘GBK’)

2、pandas基础

思考
1）删除列的方法
test1=pd.read_csv(‘test_1.csv’)
法一：del👉del test1[‘a’]
法二：pop👉test1.pop(‘a’)
2）数据筛选后，对索引值的更改
midage=midage.reset_index(drop=True)
👉数据清洗时，会将带空值的行删除，此时DataFrame或Series类型的数据不再是连续的索引，可以使用reset_index()重置索引。
👉drop：在获得新的indexs时，原来的index变成数据列，保留下来。若不想保留原来的index，使用参数 drop=True，默认 False。
3）对比iloc和loc的异同
👉midage.loc[[100,105,108],[“Pclass”,“Name”,“Sex”]]
👉midage.iloc[[100,105,108],[2,3,4]]

3、探索性数据分析

思考
1）总结不同的排序方式
👉根据某一列值进行排序
（by——排列的列名；ascending——排序的方式：默认True升序、False降序）
frame.sort_values(‘c’,ascending=True)——c列升序排列
frame.sort_values([‘a’,‘c’],ascending=False)——a、c列降序排列
👉根据索引进行排序
（axis——索引的方式：默认0按行索引、1按列索引）
frame.sort_index()——按行索引升序排列
frame.sort_index(axis=1)——按列索引升序排列
frame.sort_index(axis=1,ascending=False)——按列索引降序排序
2）describe 函数用于观察数据基本信息
count : 样本数据大小
mean : 样本数据的平均值
std : 样本数据的标准差
min : 样本数据的最小值
25% : 样本数据25%的时候的值
50% : 样本数据50%的时候的值
75% : 样本数据75%的时候的值
max : 样本数据的最大值
👉看看泰坦尼克号数据集中票价这列数据的基本统计数据：text["票价"].describe()
基本信息：共891个票价数据，平均值约为：32.20，标准差约为49.69，说明票价波动特别大， 25%的人的票价是低于7.91的，50%的人的票价低于14.45，75%的人的票价低于31.00，票价最大值约为512.33，最小值为0。
结果表明：入住高价船舱/享有优质服务的是少部分人，大部分人享有中下等服务。
在这里插入图片描述
👉看看泰坦尼克号数据集中父母子女个数 这列数据的基本统计数据：
text["父母子女个数"].describe()
基本信息：共891个数据，平均值约为：0.3916（如果每个人至少有一个家眷，则平均值大于1），标准差约为0.8061，说明波动不大， 25%、50%、75%的人父母子女个数为0，最大值为6。
结果表明：说明泰坦尼克号这趟旅途大多数人是孤身前往的。
在这里插入图片描述
问题
1）输入包含中文字体文件：text=pd.read_csv(‘train_chinese.csv’)
报错：UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xb3 in position 0: invalid start byte
解决：text=pd.read_csv(‘train_chinese.csv’,encoding=‘GBK’)

Task02：数据清洗及特征处理（8.18、8.19）

Task02相关参考CSDN

思考
1）处理缺失值方案：①删除缺失值样本②可能值插补缺失值
👉首先进行缺失值的判断
判断是否为None：df[df['Age']==None]=0
判断是否为np.nan：df[df['Age']==None]=0
找出缺失值isnull：df[df['Age'].isnull()]=0
👉然后删除缺失值
删除缺失值所在行或列的函数dropna()：df.dropna()
填充缺失值为0或nan的函数fillna()：df.fillna(0)
2）重复值观察+处理
👉重复值观察
查询重复值位置：df['Age'].duplicated()
多少重复值：df.duplicated().sum()
打印重复值：df[df.duplicated()]
👉重复值处理
清除有重复值的整行数据：df.drop_duplicates().head()
3）特征处理
👉将连续变量（数值型特征）离散化
将Age平均分箱5个年龄段：
df['AgeBand'] = pd.cut(df['Age'], 5,labels = ['1','2','3','4','5'])
将Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段：
df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = ['1','2','3','4','5'])
将Age按10% 30% 50 70% 90%划分为五个年龄段：
df['AgeBand'] = pd.cut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = ['1','2','3','4','5'])
👉类别文本变量转数值型变量
首先查看类别文本变量名种类及数量
value_counts函数：df['Sex'].value_counts()
unique函数：df['Sex'].unique()
nunique函数：df['Sex'].nunique()
然后转换为数值型
replace函数：df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
map函数：df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 2})
使用sklearn.preprocessing的LabelEncoder函数：

from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
	lbl = LabelEncoder()  
	label_dict = dict(zip(df[feat].unique(), 		range(df[feat].nunique())))
	df[feat + "_labelEncode"] = df[feat].map(label_dict)
	df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))

df.head()

👉类别文本转换为one-hot编码

for feat in ["Age", "Embarked"]:
	x = pd.get_dummies(df[feat], prefix=feat)
	df = pd.concat([df, x], axis=1)

df.head()

问题
1）ModuleNotFoundError: No module named sklearn
解决：pip install scikit-learn

Task03：数据重构（8.20、8.21）

数据重构1

1、数据的合并
👉concat方法

list_up=[text_left_up,text_right_up]
result_up=pd.concat(list_up,axis=1)
list_down=[text_left_down,text_right_down]
result_down=pd.concat(list_down,axis=1)
list_all=[result_up,result_down]
result=pd.concat(list_all)

👉join（DF自带）+append（DF自带）

result_up=text_left_up.join(text_right_up)
result_down=text_left_down.join(text_right_down)
result=result_up.append(result_down)

👉merge（pd）+append（DF自带）

result_up=pd.merge(text_left_up,text_right_up,left_index=True,right_index=True)
result_down=pd.merge(text_left_down,text_right_down,left_index=True,right_index=True)
result=result_up.append(result_down)

数据重构2

思考
1）GroupBy机制：按xx分组, 比如,将一个数据集按A进行分组, 效果是这样↓↓图
在这里插入图片描述

数据聚合与运算——数据运用

1、男性与女性的平均票价
df = text['Fare'].groupby(text['Sex'])
means = df.mean()
2、男女的存活人数
survived_sex=text['Survived'].groupby(text['Sex']).sum()
3、客舱不同等级的存活人数
survived_num=text['Survived'].groupby(text['Pclass']).sum()
结论
女性购买票价比男性高；女性的存活人数比男性多；
客舱不同等级的存活人数的排序：等级1>等级3>等级2。
4、在不同等级的票中的不同年龄的船票花费的平均值
text.groupby(['Pclass','Age'])['Fare'].mean()
5、基于生存者中不同年龄段计算最高存活率

#首先计算出不同年龄段（？）的存活人数
survived_age = text['Survived'].groupby(text['Age']).sum()
#求出最高存活率的年龄段及其存活人数
survived_age[survived_age.values==survived_age.max()]
#总存活人数
_sum = text['Survived'].sum()
#最高存活率：年龄段中最多存活人数/总存活人数
precetn =survived_age.max()/_sum
#结果打印出来
print("sum of person:"+str(_sum))
print("最大存活率："+str(precetn))

Task04：数据可视化（8.22、8.23）

Task4参考视频

1、可视化展示泰坦尼克号数据集中男女中生存+死亡人数分布情况（柱状图）

sex_survived=text.groupby(['Sex','Survived'])['Survived'].count().unstack()
died=sex_survived[0]
died.plot.bar()
plt.title('died')#画出男女死亡人数分布
sex_survived.plot.bar()#画出男女存活、死亡人数分布
sex_survived.plot(kind='bar',stacked='True')#同上

在这里插入图片描述
2、可视化展示泰坦尼克号数据集中不同票价的人生存+死亡人数分布（折线图）

fare=text.groupby(['Fare','Survived'])['Survived'].count().unstack()
fare.plot()

在这里插入图片描述
3、可视化展示泰坦尼克号数据集中不同仓位等级的人生存和死亡人员的分布（柱状图）

pclass=text.groupby(['Pclass','Survived'])['Survived'].count().unstack()
pclass.plot.bar()

在这里插入图片描述
结论1
女性存活人数多比例大；低票价死亡人数多死亡概率大；头等舱存活概率大三等舱死亡概率大。

4、可视化展示泰坦尼克号数据集中不同年龄的人生存与死亡人数分布情况（这里将年龄分成五部分，柱状图+密度曲线）

#标准化数据
text.Age[text.Survived==0]#the age of dead
text.Age[text.Survived==0].hist(bins=5,alpha=0.5,density=1)#alpha透明度,density密度
text.Age[text.Survived==1].hist(bins=5,alpha=0.5,density=1)#the age of the survived
#密度曲线
text.Age[text.Survived==0].plot.density()
text.Age[text.Survived==1].plot.density()
plt.legend([0,1])
plt.xlabel('age')
plt.ylabel('density')

在这里插入图片描述
5、可视化展示泰坦尼克号数据集中不同仓位等级的人年龄分布情况（折线图）
👉首先存下舱位等级

unique_pclass=text.Pclass.unique()#存下舱位等级
unique_pclass.sort()#排序
unique_pclass

👉然后用for循环画出每一个等级舱位的密度曲线（matplotlib.pyplot）

for i in unique_pclass:
    text.Age[text.Pclass==i].plot.density()#对每一个舱位都画一条密度曲线
plt.xlabel('age')
plt.legend(unique_pclass)

在这里插入图片描述

👉用for循环画出每一个等级舱位的密度曲线（seaborn）

import seaborn as sns
for i in unique_pclass:
    sns.kdeplot(text.Age[text.Pclass==i],shade=True,linewidth=0)

在这里插入图片描述
结论2
通过密度曲线可以更加明显的看出在年龄小的一段存活率比死亡率高，而青年情况相反，中年以后持平
问题
1）输入：pip install mathplotlib（输错名字了呜呜呜┭┮﹏┭┮）
报错：ERROR: Could not find a version that satisfies the requirement mathplotlib
解决：pip install matplotlib或者pip install matplotlib -i https://pypi.douban.com/simple/
2）输入：import seaborn as sns
报错：ModuleNotFoundError: No module named ‘seaborn’
解决：pip install seaborn -i https://pypi.douban.com/simple/

Task05：数据建模及模型评估（8.24、8.25、8.26）

Task5参考视频

1、模型搭建

1）数据集分割

👉函数：train_test_split
（导入库文件：from sklearn.model_selection import train_test_split）
（数据集切割：X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)）
X：自变量
y：因变量
test_size：默认为0.25
stratify：按某列进行分层抽样
random_state：随机种子，为非0整数表示每次打乱顺序/得到随机数组都一样

2）模型创建

方案一：创建基于线性模型的分类模型——LogisticRegression（逻辑回归）
PS.逻辑回归不是回归模型而是分类模型，不要与LinearRegression混淆，常用于二分类
导入库文件：from sklearn.linear_model import LogisticRegression
👉模型训练：fit(X, y, sample_weight=None)
X：自变量
y：因变量
sample_weight：样本权重参数（本样本集并没有高度失衡故不用该参数）
👉预测数据：predict(X)
👉概率估计：predict_proba(X)——返回样本为某个标签的概率
👉平均准确度：score(X, y, sample_weight=None)——用predict方法预测得到的结果与给定标签进行对比，得到平均准确度

#1、模型训练，返回一个逻辑回归模型
lr = LogisticRegression().fit(X_train, y_train)
lr
👉LogisticRegression()
#2、查看训练集测试集得分
lr.score(X_train,y_train)
👉0.7994011976047904
#3、格式化一下结果
print('训练集得分：{:.3f}'.format(lr.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(lr.score(X_test,y_test)))
👉训练集得分：0.799
👉测试集得分：0.771
#4、调整模型的正则化因子C=2000，默认是1.0，C值越小越能降低模型的复杂程度
lr1 = LogisticRegression(C=2000).fit(X_train, y_train)
print('训练集得分：{:.3f}'.format(lr1.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(lr1.score(X_test,y_test)))
👉训练集得分：0.807
👉测试集得分：0.785
#5、调整模型的类别权重参数class_weight='balanced'
lr2 = LogisticRegression(class_weight='balanced').fit(X_train, y_train)
print('训练集得分：{:.3f}'.format(lr2.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(lr2.score(X_test,y_test)))
👉训练集得分：0.787
👉测试集得分：0.771

方案二：创建基于树的分类模型——RandomForestClassifier（决策树、随机森林）
PS.随机森林其实是决策树集成为了降低决策树过拟合提高噪声方面稳定性的情况，可以理解成每一棵树有自己的判断分类精度（判断偏差），通过取它们的平均值可以消除每一棵树的偏差，抵消它们的过拟合
导入库文件：from sklearn.ensemble import RandomForestClassifier
👉模型训练：fit(X, y, sample_weight=None)
X：自变量
y：因变量
sample_weight：样本权重参数（本样本集并没有高度失衡故不用该参数）
👉预测数据：predict(X)
👉概率估计：predict_proba(X)——返回样本为某个标签的概率
👉平均准确度：score(X, y, sample_weight=None)——用predict方法预测得到的结果与给定标签进行对比，得到平均准确度

#1、训练模型
rf = RandomForestClassifier().fit(X_train, y_train)
print('训练集得分：{:.3f}'.format(rf.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(rf.score(X_test,y_test)))
👉训练集得分：1.000
👉测试集得分：0.798
#2、调参n_estimators，森林中树的数量
rf1 = RandomForestClassifier(n_estimators=500).fit(X_train, y_train)
👉训练集得分：1.000
👉测试集得分：0.803
#3、调参max_depth，决策树的最大深度
rf2 = RandomForestClassifier(max_depth=5).fit(X_train, y_train)
👉训练集得分：0.871
👉测试集得分：0.789
#4、调参bootstrap,是否放回抽样
rf3 = RandomForestClassifier(bootstrap=False).fit(X_train, y_train)
#5、调参oob_score，是否袋外参数/进行交叉验证
rf4 = RandomForestClassifier(oob_score=True).fit(X_train, y_train)

3）输出模型预测结果

predict()函数：lr.predict(X_train)
predict_proba()函数：lr.predict_proba(X_train)

2、模型评估

模型的评估，模型评估是为了知道模型的泛化能力。表明模型好不好用，将帮助决定我们使用哪一个模型。模型好不好首先与数据合不合适有关，其次评估的方法和参数的选择
任务一：如何划分使得训练集测试集更合理；方案：k折交叉验证
任务二：是用什么方法来评估模型；方案：混淆矩阵
任务三：当训练多次后产生多个混淆矩阵；方案：ROC曲线

1）交叉验证

交叉验证——cross_val_score
为了避免数据集分割的偶然性，引入交叉验证。交叉验证（cross-validation）是一种评估泛化性能的统计学方法，它比单次划分训练集和测试集的方法更加稳定、全面。在交叉验证中，数据被多次划分，并且需要训练多个模型。最常用的交叉验证是 k 折交叉验证（k-fold cross-validation），其中 k 是由用户指定的数字，通常取 5 或 10。
导入库文件from sklearn.model_selection import cross_val_score

#1、使用十折交叉验证来评估模型，模型为逻辑回归
lr=LogisticRegression()
cross_val_score(lr,X_train,y_train,cv=10)
👉array([0.79104478, 0.71641791, 0.80597015, 0.73134328, 0.91044776,
         0.7761194 , 0.88059701, 0.76119403, 0.81818182, 0.77272727])
#2、求交叉验证分数平均值
score=cross_val_score(lr,X_train,y_train,cv=10)
score.mean()
👉0.7964043419267299

2）混淆矩阵

混淆矩阵——confusion_matrix、classification_report
用于评估分类的准确性。但看精确率和召回率是无法评估模型好坏的，如果二者出现极端状况失衡，要用F1来将二者加权平均一下。
导入库文件
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#1、confusion_matrix()函数：参数y_train表示真实标签，y_pred预测标签
lr=LogisticRegression().fit(X_train,y_train)
y_pred=lr.predict(X_train)
confusion_matrix(y_train, y_pred, labels=[0,1])
👉array([[357,  55],
       [ 79, 177]], dtype=int64)
#2、查看一下真正的标签
y_train.value_counts()
👉0    412
👉1    256
#3、分类报告
print(classification_report(y_train, y_pred))
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       412
           1       0.76      0.69      0.73       256

    accuracy                           0.80       668
   macro avg       0.79      0.78      0.78       668
weighted avg       0.80      0.80      0.80       668

3）ROC曲线

ROC曲线——roc_curve、plot_roc_curve
以一个简单的方式来总结所有的信息（多次训练后产生的多个混淆矩阵）,AUC越大越好
①导入库文件from sklearn.metrics import roc_curve
👉sklearn.metrics.roc_curve(y_true, y_score, *, pos_label=None, sample_weight=None, drop_intermediate=True)
y_true：测试集真实标签
y_score：预测值，本例选取模型分类器（如逻辑回归）的decision_function返回的值。
👉roc_curve()返回值有三个：fpr, tpr, thresholds。
fpr： false positive rates——ROC的横轴
（猜错的比率：猜为负类中实际为正类/猜为负类）
tpr：true positive rates——ROC的纵轴
（猜对的比率：猜为正类中实际为正类/猜为正类，召回率）
thresholds：正负类边界值，越靠近0越好，因为越靠近0就会有越多的样本被划分成正类

1、函数roc_curve()，此处lr.decision_function(X_test)为逻辑回归中返回的置信度分数
fpr, tpr, thresholds =roc_curve(y_test, lr.decision_function(X_test))
2、上一步骤得到返回值，并没有会出来
plt.plot(fpr,tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
3、找到最优点close_zero，也就是thresholds最靠近0的点，思路是首先求绝对值，然后找到最接近0的点，
返回为数组的元素值，所以能找到对应的坐标，接着找到对应的fpr，tpr
plt.plot(fpr,tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
close_zero=np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero],tpr[close_zero],'o')
plt.title('LR ROC')

在这里插入图片描述
②对比调参的结果，可以用多根ROC曲线来展示，另外函数plot_roc_curve()可以直接画图，不需要一步步的计算画图
导入库文件from sklearn.metrics import plot_roc_curve
👉sklearn.metrics.plot_roc_curve(estimator, X, y, *, sample_weight=None, drop_intermediate=True, response_method=‘auto’, name=None, ax=None, pos_label=None, **kwargs)
estimator：分类器
X：测试集标签
y：测试集预测值
response_method：{‘predict_proba’, ‘decision_function’, ‘auto’}
lr逻辑回归用的是decision_function()；rf随机森林用的是predict_proba()

from sklearn.metrics import plot_roc_curve
lr = LogisticRegression().fit(X_train, y_train)
lr1 = LogisticRegression(C=2000).fit(X_train, y_train)
lr2 = LogisticRegression(class_weight='balanced').fit(X_train, y_train)
rf = RandomForestClassifier().fit(X_train, y_train)
rf1 = RandomForestClassifier(n_estimators=500).fit(X_train, y_train)

lr_diaplay=plot_roc_curve(lr, X_test, y_test, name='LR',response_method='decision_function')
plot_roc_curve(lr1,X_test, y_test, name='LR1',response_method='decision_function',ax=lr_diaplay.ax_)
plot_roc_curve(lr2,X_test, y_test, name='LR2',response_method='decision_function',ax=lr_diaplay.ax_)
plot_roc_curve(rf,X_test, y_test, name='RF',response_method='predict_proba',ax=lr_diaplay.ax_)
plot_roc_curve(rf1,X_test, y_test, name='RF1',response_method='predict_proba',ax=lr_diaplay.ax_)

在这里插入图片描述

yep吖

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
第28期Datawhale组队学习——动手学数据分析

目录前言环境配置学习路线Task01：数据加载及探索性数据分析（8.16、8.17）Task02：数据清洗及特征处理Task03：数据重构Task04：数据可视化Task05：数据建模及模型评估前言Datawhale第28期组队学习来了！这次很幸运报名成功！我选择的是Datawhale开源学习项目——“动手学数据分析”，本次学习周期为8.15-8.26，在此记录本人初次接触数据分析的学习历程。环境配置笔者使用的软件实验环境为Win10系统，Anaconda3。1、环境配置（Anaconda Pro
复制链接

扫一扫