【30days eat tf2.0】结构化数据建模流程范例

最新推荐文章于 2024-09-22 18:18:31 发布

嵌入世一根葱

最新推荐文章于 2024-09-22 18:18:31 发布

阅读量296

点赞数

分类专栏：深度学习文章标签：深度学习 tensorflow

本文链接：https://blog.csdn.net/xiangrikuihuazi/article/details/111467558

版权

深度学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文介绍了使用TensorFlow 2.0进行结构化数据建模的流程，包括数据预处理、模型构建（Sequential API）、训练、评估和模型保存。通过示例展示了三层全连接神经网络的参数计算，并使用内置fit方法进行训练，预测结果包含概率和类别。

摘要由CSDN通过智能技术生成

参考：https://jackiexiao.github.io/eat_tensorflow2_in_30_days/chinese/1.%E5%BB%BA%E6%A8%A1%E6%B5%81%E7%A8%8B/1-2%2C%E5%9B%BE%E7%89%87%E6%95%B0%E6%8D%AE%E5%BB%BA%E6%A8%A1%E6%B5%81%E7%A8%8B%E8%8C%83%E4%BE%8B/#%E4%BA%8C%E5%AE%9A%E4%B9%89%E6%A8%A1%E5%9E%8B

1.数据预处理

def preprocessing(dfdata):
dfresult= pd.DataFrame()

#Pclass#yb：乘客所持的票类，有三种值(1,2,3) 【转换成onehot编码】
dfPclass = pd.get_dummies(dfdata['Pclass'])#yb：pd.get_dummies实现one hot encode
#dfPclass为：
#    1  2  3（三种值）
#0   1  0  0
#1   1  0  0    
#2   0  1  0
#3   0  1  0
#4   0  0  1
#5   0  1  0
dfPclass.columns = ['Pclass_' +str(x) for x in dfPclass.columns ]
dfresult = pd.concat([dfresult,dfPclass],axis = 1)#yb：dfresult和dfPclass沿着x水平轴拼接
#dfresult为：
#    1  2  3
#0   1  0  0
#1   1  0  0    
#2   0  1  0
#3   0  1  0
#4   0  0  1
#5   0  1  0

#Sex#yb：乘客性别 【转换成bool特征】
dfSex = pd.get_dummies(dfdata['Sex'])#yb：pd.get_dummies实现one hot encode
#dfSex为：
#    male  female
#0   1       0
#1   0       1    
#2   0       1
#3   1       0
#4   1       0
#5   0       1
dfresult = pd.concat([dfresult,dfSex],axis = 1)#yb：dfresult和dfPclass沿着x水平轴拼接
#dfresult为：
#    1  2  3  male  female
#0   1  0  0   1       0
#1   1  0  0   0       1   
#2   0  1  0   0       1
#3   0  1  0   1       0
#4   0  0  1   1       0
#5   0  1  0   0       1

#Age#yb：乘客年龄(有缺失) 【数值特征，添加“年龄是否缺失”作为辅助特征】
dfresult['Age'] = dfdata['Age'].fillna(0)#yb：.fillna(0)填充缺失值；dfresult['Age']增加列
#dfresult为：
#    1  2  3  male  female  Age
#0   1  0  0   1       0    55
#1   1  0  0   0       1    49
#2   0  1  0   0       1    36
#3   0  1  0   1       0    19
#4   0  0  1   1       0    14
#5   0  1  0   0       1    0（Nan）
dfresult['Age_null'] = pd.isna(dfdata['Age']).astype('int32')#yb：将名为'Age'那列的数据的缺失值用1表示，.isna查询缺失值，.astype('int32')类型转换
#dfresult为：
#    1  2  3  male  female  Age     Age_null
#0   1  0  0   1       0    55         0
#1   1  0  0   0       1    49         0
#2   0  1  0   0       1    36         0
#3   0  1  0   1       0    19         0
#4   0  0  1   1       0    14         0
#5   0  1  0   0       1    0（Nan）   1

#SibSp,Parch,Fare#yb：乘客兄弟姐妹/配偶的个数(整数值)【数值特征】，乘客父母/孩子的个数(整数值)【数值特征】，乘客所持票的价格(浮点数，0-500不等) 【数值特征】
dfresult['SibSp'] = dfdata['SibSp']#yb：增加列
dfresult['Parch'] = dfdata['Parch']
dfresult['Fare'] = dfdata['Fare']
#dfresult为：
#    1  2  3  male  female  Age     Age_null  SibSp   Parch   Fare
#0   1  0  0   1       0    55         0        0       0    305000
#1   1  0  0   0       1    49         0        1       0    767292
#2   0  1  0   0       1    36         0        0       0    130000
#3   0  1  0   1       0    19         0        0       0    396875
#4   0  0  1   1       0    14         0        4       1    160000
#5   0  1  0   0       1    0（Nan）   1        0       0    72500

#Carbin#yb：乘客所在船舱(有缺失) 【添加“所在船舱是否缺失”作为辅助特征】
dfresult['Cabin_null'] =  pd.isna(dfdata['Cabin']).astype('int32')
#dfresult为：
#    1  2  3  male  female  Age     Age_null  SibSp   Parch   Fare    Cabin_null
#0   1  0  0   1       0    55         0        0       0    305000       0
#1   1  0  0   0       1    49         0        1       0    767292       0
#2   0  1  0   0       1    36         0        0       0    130000       1
#3   0  1  0   1       0    19         0        0       0    396875       1
#4   0  0  1   1       0    14         0        4       1    160000       1
#5   0  1  0   0       1    0（Nan）   1        0       0    72500        1

#Embarked#yb：乘客登船港口:S、C、Q(有缺失)【转换成onehot编码，四维度 S,C,Q,nan】
dfEmbarked = pd.get_dummies(dfdata['Embarked'],dummy_na=True)#yb：dummy_na=True表示添加一个列表示NAN数据
dfEmbarked.columns = ['Embarked_' + str(x) for x in dfEmbarked.columns]
#dfEmbarked为：
#    S  C  Q  NAN
#0   1  0  0   0
#1   0  1  0   0 
#2   1  0  0   0
#3   1  0  0   0
#4   1  0  0   0
#5   1  0  0   0
dfresult = pd.concat([dfresult,dfEmbarked],axis = 1)
#dfresult为：
#    1  2  3  male  female  Age     Age_null  SibSp   Parch   Fare    Cabin_null  S   C   Q   NAN
#0   1  0  0   1       0    55         0        0       0    305000       0       1   0   0    0
#1   1  0  0   0       1    49         0        1       0    767292       0       0   1   0    0
#2   0  1  0   0       1    36         0        0       0    130000       1       1   0   0    0
#3   0  1  0   1       0    19         0        0       0    396875       1       1   0   0    0
#4   0  0  1   1       0    14         0        4       1    160000       1       1   0   0    0
#5   0  1  0   0       1    0（Nan）   1        0       0    72500        1       1   0   0    0

return(dfresult)

x_train = preprocessing(dftrain_raw)
y_train = dftrain_raw['Survived'].values

x_test = preprocessing(dftest_raw)
y_test = dftest_raw['Survived'].values

print("x_train.shape =", x_train.shape )
print("x_test.shape =", x_test.shape )

x_train.shape = (712, 15)#yb：训练数据712行，15列
x_test.shape = (179, 15)#yb：测试数据712行，15列

2.定义模型
使用Keras接口有以下3种方式构建模型：使用Sequential按层顺序构建模型，
使用函数式API构建任意结构模型，继承Model基类构建自定义模型。

tf.keras.backend.clear_session()#yb：定义在：tensorflow/python/keras/backend.py。销毁当前的TF图并创建一个新图。有助于避免旧模型/图层混乱。

model = models.Sequential()#yb：新建网络
model.add(layers.Dense(20,activation = 'relu',input_shape=(15,)))#yb：添加一个全连接层Dense，神经元20个，激活函数为relu，input_shape检测图层的输入维度
model.add(layers.Dense(10,activation = 'relu' ))#yb：添加一个全连接层Dense，神经元10个，激活函数为relu
model.add(layers.Dense(1,activation = 'sigmoid' ))#yb：添加一个全连接层Dense，神经元1个，激活函数为sigmoid

model.summary()#yb：使用keras构建模型时，通过该接口打印神经网络结构，统计参数数目

Model: “sequential”

Layer (type) Output Shape Param #

dense (Dense) (None, 20) 320

dense_1 (Dense) (None, 10) 210

dense_2 (Dense) (None, 1) 11

Total params: 541
Trainable params: 541
Non-trainable params: 0

param即每一层神经元权重的个数，计算方法：
参数个数 = （输入维度+1）*神经元个数；
之所以要加1，是考虑到每个神经元都有一个Bias。

第一层：输入神经元15个，输出神经元20个
w00x0 + w01x1 + w02x2 + w03x3 + … + w014x14 + b0 = y0
w10x0 + w11x1 + w12x2 + w13x3 + … + w114x14 + b1 = y1
w20x0 + w21x1 + w22x2 + w23x3 + … + w214x14 + b2 = y2
…
w190x0 + w191x1 + w192x2 + w193x3 + … + w1914x14 + b19 = y19
因此：参数个数 = （15+1）*20 = 320
第二层：输入神经元20个，输出神经元10个
因此：参数个数 = （20+1）*10 = 210
第三层：输入神经元10个，输出神经元1个
因此：参数个数 = （10+1）*1 = 11

3.训练模型
训练模型通常有3种方法，内置fit方法，内置train_on_batch方法，以及自定义训练循环。此处我们选择最常用也最简单的内置fit方法。
#二分类问题选择二元交叉熵损失函数

model.compile(optimizer='adam',#yb：在配置训练方法时，告知训练时用的优化器、损失函数和准确率评测标准，优化器定义为adam
        loss='binary_crossentropy',#yb：损失函数定义为交叉熵
        metrics=['AUC'])#yb：准确率

history = model.fit(x_train,y_train,#执行训练过程，x_train训练集，y_train训练标签，
                batch_size= 64,#每一批batch的大小为64
                epochs= 30,#迭代次数epochs为30
                validation_split=0.2 #分割一部分训练数据用于验证
               )

4.评估模型

5.使用模型
#预测概率

model.predict(x_test[0:10])
#model(tf.constant(x_test[0:10].values,dtype = tf.float32)) #等价写法

array([[0.26501188],
[0.40970832],
[0.44285864],
[0.78408605],
[0.47650957],
[0.43849158],
[0.27426785],
[0.5962582 ],
[0.59476686],
[0.17882936]], dtype=float32)

#预测类别