TensorFlow2利用泰坦尼克号获救CSV数据集完成数据预处理

本案例使用TensorFlow2加载CSV数据到tf.data.Dataset中,使用经典的数据集:泰坦尼克乘客数据。

1. 导入所需的库

import tensorflow as tf
import numpy as np
import pandas as pd

import functools

for i in [tf,np,pd]:
    print(i.__name__,": ",i.__version__,sep="")

输出:

tensorflow: 2.2.0
numpy: 1.17.4
pandas: 0.25.3

2. 下载并导入数据

2.1 下载数据到本地

trainDataUrl = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
testDataUrl = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

trainFilePath = tf.keras.utils.get_file("trainTitanic.csv",trainDataUrl)
testFilePath = tf.keras.utils.get_file("testTitanic.csv",testDataUrl)

输出:

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 1s 29us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 28us/step

Windows系统中下载的文件保存在:系统盘:\users\用户名.keras\datasets目录下

2.2 加载数据

labelColumn = "survived"  # 指定数据标签的列名 
labels = [0,1]

def getDataset(filePath, **kwargs):
    dataset = tf.data.experimental.make_csv_dataset(filePath,
                                                   batch_size=5,
                                                   label_name=labelColumn,
                                                   na_value="?",
                                                   num_epochs=1,
                                                   ignore_errors=True,
                                                   **kwargs)
    return dataset

rawTrainData = getDataset(trainFilePath)
rawTestData = getDataset(testFilePath)
def showBatch(dataset):
    for batch, label in dataset.take(1):
        for key, value in batch.items():
            print("{:20s}:{}".format(key,value.numpy()))
        print("{:20s}:{}".format("label",label.numpy()))
            
showBatch(rawTrainData)

输出:

sex                 :[b'male' b'female' b'male' b'male' b'male']
age                 :[50. 30. 28. 31. 27.]
n_siblings_spouses  :[0 0 0 1 0]
parch               :[0 0 0 0 0]
fare                :[ 13.     106.425    8.4583  52.       8.6625]
class               :[b'Second' b'First' b'Third' b'First' b'Third']
deck                :[b'unknown' b'unknown' b'unknown' b'B' b'unknown']
embark_town         :[b'Southampton' b'Cherbourg' b'Queenstown' b'Southampton' b'Southampton']
alone               :[b'y' b'y' b'y' b'n' b'y']
label               :[0 1 0 0 1]

3. 数据预处理

通过CSV文件导入的数据每列的数据类型可能不一样,这就需要将数据喂给模型前进行数据预处理。可以使用sklearn等工具进行前处理,再将数据传给TensorFlow。也可以使用TensorFlow内置的tf.feature_column工具,使用该工具的优点是如果训练的模型需要保存或分享给他人,则数据预处理的部分也会被保存。

3.1 特征选择——连续性数据

selectColumns = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare'] # 选择其中的几列进行分析
defaults = [0, 0.0, 0.0, 0.0, 0.0]
tempDataset = getDataset(trainFilePath,
                        select_columns=selectColumns,
                        column_defaults=defaults)

showBatch(tempDataset)

输出:

age                 :[25.  43.  18.  55.5 47. ]
n_siblings_spouses  :[1. 0. 0. 0. 1.]
parch               :[2. 0. 0. 0. 1.]
fare                :[151.55     8.05     7.7958   8.05    52.5542]
label               :[0 0 0 0 1]
example_batch, labels_batch = next(iter(tempDataset))

# 将所有列打包到一起
def pack(features, label):
    return tf.stack(list(features.values()),axis=1),label

packed_dataset = tempDataset.map(pack)

for fetures, labels in packed_dataset.take(1):
    print(fetures.numpy(),labels.numpy(),sep="\n\n")

输出:

[[28.      1.      0.     14.4542]
 [28.      0.      0.      7.2292]
 [52.      1.      0.     78.2667]
 [48.      1.      2.     65.    ]
 [30.5     0.      0.      8.05  ]]

[0 0 1 1 0]
# 定义一个通用的预处理类:选择部分特征并打包到单列中
class PackNumericFeatures(object):
    def __init__(self, names):
        self.names = names
    def __call__(self, features, labels):
        numeric_features = [features.pop(name) for name in self.names]
        numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]
        numeric_features = tf.stack(numeric_features, axis=1)
        features["numeric"] = numeric_features
        
        return features, labels

numericFeatures = ["age","n_siblings_spouses","parch","fare"]

packed_train_data = rawTrainData.map(PackNumericFeatures(numericFeatures))
packed_test_data = rawTestData.map(PackNumericFeatures(numericFeatures))

showBatch(packed_train_data)

输出:

sex                 :[b'male' b'male' b'male' b'male' b'male']
class               :[b'Third' b'Third' b'Third' b'Third' b'Second']
deck                :[b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         :[b'Cherbourg' b'Southampton' b'Queenstown' b'Cherbourg' b'Southampton']
alone               :[b'n' b'y' b'n' b'y' b'y']
numeric             :[[15.      1.      1.      7.2292]
 [25.      0.      0.      7.05  ]
 [ 7.      4.      1.     29.125 ]
 [28.      0.      0.      7.8958]
 [28.      0.      0.     13.    ]]
label               :[0 0 0 0 1]
example_batch, labels_batch = next(iter(packed_train_data))

3.2 连续性数据归一化

连接性数据通常需要做归一化操作。

desc = pd.read_csv(trainFilePath)[numericFeatures].describe()
desc

输出:

mean = np.array(desc.T["mean"])
std = np.array(desc.T["std"])

print(mean, type(mean))
print(std, type(std))

输出:

[29.63130781  0.54545455  0.37958533 34.38539856] <class 'numpy.ndarray'>
[12.51181763  1.1510896   0.79299921 54.5977305 ] <class 'numpy.ndarray'>
def normalization(data, mean, std):
    return (data - mean)/std

normalizer = functools.partial(normalization,mean=mean, std=std)

numeric_column = tf.feature_column.numeric_column("numeric",normalizer_fn=normalizer,shape=[len(numericFeatures)])
numeric_columns = [numeric_column]
numeric_column

输出:

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalization at 0x000000002E16C2F0>, mean=array([29.63130781,  0.54545455,  0.37958533, 34.38539856]), std=array([12.51181763,  1.1510896 ,  0.79299921, 54.5977305 ])))
In [49]:
example_batch["numeric"]

输出:

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[28.  ,  1.  ,  0.  , 24.15],
       [51.  ,  0.  ,  0.  ,  8.05],
       [ 6.  ,  0.  ,  1.  , 33.  ],
       [26.  ,  0.  ,  0.  , 10.5 ],
       [16.  ,  0.  ,  0.  , 26.  ]], dtype=float32)>
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

输出:

array([[-0.13038135,  0.39488277, -0.4786705 , -0.18746932],
       [ 1.7078807 , -0.47385937, -0.4786705 , -0.4823534 ],
       [-1.888719  , -0.47385937,  0.7823648 , -0.02537466],
       [-0.2902302 , -0.47385937, -0.4786705 , -0.4374797 ],
       [-1.0894746 , -0.47385937, -0.4786705 , -0.15358512]],
      dtype=float32)

3.3 特征选择——离散性数据

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}

categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

categorical_columns

输出:

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])

输出:

[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]

3.4 合并连续数据和离散数据层

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)

print(preprocessing_layer(example_batch).numpy()[0])

输出:

[ 0.          1.          0.          0.          1.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          1.
 -0.13038135  0.39488277 -0.4786705  -0.18746932  1.          0.        ]

4. 构建模型

model = tf.keras.Sequential([
    preprocessing_layer,
    tf.keras.layers.Dense(128,activation="relu"),
    tf.keras.layers.Dense(128,activation="relu"),
    tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             optimizer="adam",
             metrics=["accuracy"])

5. 训练及评估模型

train_data = packed_train_data.shuffle(500)
test_data = packed_test_data

model.fit(train_data, epochs=20)

输出:

Epoch 1/20
126/126 [==============================] - 0s 3ms/step - loss: 0.4894 - accuracy: 0.7544
Epoch 2/20
126/126 [==============================] - 0s 984us/step - loss: 0.4147 - accuracy: 0.8230
Epoch 3/20
126/126 [==============================] - 0s 921us/step - loss: 0.4002 - accuracy: 0.8309
Epoch 4/20
126/126 [==============================] - 0s 849us/step - loss: 0.3894 - accuracy: 0.8325
Epoch 5/20
126/126 [==============================] - 0s 841us/step - loss: 0.3812 - accuracy: 0.8485
Epoch 6/20
126/126 [==============================] - 0s 833us/step - loss: 0.3729 - accuracy: 0.8341
Epoch 7/20
126/126 [==============================] - 0s 849us/step - loss: 0.3716 - accuracy: 0.8421
Epoch 8/20
126/126 [==============================] - 0s 984us/step - loss: 0.3647 - accuracy: 0.8453
Epoch 9/20
126/126 [==============================] - 0s 770us/step - loss: 0.3472 - accuracy: 0.8501
Epoch 10/20
126/126 [==============================] - 0s 794us/step - loss: 0.3470 - accuracy: 0.8533
Epoch 11/20
126/126 [==============================] - 0s 841us/step - loss: 0.3449 - accuracy: 0.8421
Epoch 12/20
126/126 [==============================] - 0s 873us/step - loss: 0.3360 - accuracy: 0.8485
Epoch 13/20
126/126 [==============================] - 0s 857us/step - loss: 0.3313 - accuracy: 0.8565
Epoch 14/20
126/126 [==============================] - 0s 857us/step - loss: 0.3293 - accuracy: 0.8533
Epoch 15/20
126/126 [==============================] - 0s 873us/step - loss: 0.3236 - accuracy: 0.8644
Epoch 16/20
126/126 [==============================] - 0s 897us/step - loss: 0.3336 - accuracy: 0.8581
Epoch 17/20
126/126 [==============================] - 0s 770us/step - loss: 0.3185 - accuracy: 0.8565
Epoch 18/20
126/126 [==============================] - 0s 778us/step - loss: 0.3118 - accuracy: 0.8596
Epoch 19/20
126/126 [==============================] - 0s 794us/step - loss: 0.3130 - accuracy: 0.8581
Epoch 20/20
126/126 [==============================] - 0s 929us/step - loss: 0.3099 - accuracy: 0.8644
Out[58]:
<tensorflow.python.keras.callbacks.History at 0x302be5c0>
test_loss, test_accuracy = model.evaluate(test_data)

print("\nTest Loss: {}, Test Accuracy: {}".format(test_loss,test_accuracy))

输出:

53/53 [==============================] - 0s 906us/step - loss: 0.4588 - accuracy: 0.8561

Test Loss: 0.45877906680107117, Test Accuracy: 0.8560606241226196
In [60]:
predictions = model.predict(test_data)

for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
    prediction = tf.sigmoid(prediction).numpy()
    print("Predicted survied: {:.2%}".format(prediction[0]),
         "| Actual outcome: ",("survived" if bool(survived) else "died"))

输出:

Predicted survied: 39.23% | Actual outcome:  survived
Predicted survied: 99.98% | Actual outcome:  survived
Predicted survied: 86.77% | Actual outcome:  died
Predicted survied: 79.44% | Actual outcome:  survived
Predicted survied: 30.34% | Actual outcome:  died

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值