本案例使用TensorFlow2加载CSV数据到tf.data.Dataset中,使用经典的数据集:泰坦尼克乘客数据。
1. 导入所需的库
import tensorflow as tf
import numpy as np
import pandas as pd
import functools
for i in [tf,np,pd]:
print(i.__name__,": ",i.__version__,sep="")
输出:
tensorflow: 2.2.0
numpy: 1.17.4
pandas: 0.25.3
2. 下载并导入数据
2.1 下载数据到本地
trainDataUrl = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
testDataUrl = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
trainFilePath = tf.keras.utils.get_file("trainTitanic.csv",trainDataUrl)
testFilePath = tf.keras.utils.get_file("testTitanic.csv",testDataUrl)
输出:
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 1s 29us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 28us/step
Windows系统中下载的文件保存在:系统盘:\users\用户名.keras\datasets目录下
2.2 加载数据
labelColumn = "survived" # 指定数据标签的列名
labels = [0,1]
def getDataset(filePath, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(filePath,
batch_size=5,
label_name=labelColumn,
na_value="?",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
rawTrainData = getDataset(trainFilePath)
rawTestData = getDataset(testFilePath)
def showBatch(dataset):
for batch, label in dataset.take(1):
for key, value in batch.items():
print("{:20s}:{}".format(key,value.numpy()))
print("{:20s}:{}".format("label",label.numpy()))
showBatch(rawTrainData)
输出:
sex :[b'male' b'female' b'male' b'male' b'male']
age :[50. 30. 28. 31. 27.]
n_siblings_spouses :[0 0 0 1 0]
parch :[0 0 0 0 0]
fare :[ 13. 106.425 8.4583 52. 8.6625]
class :[b'Second' b'First' b'Third' b'First' b'Third']
deck :[b'unknown' b'unknown' b'unknown' b'B' b'unknown']
embark_town :[b'Southampton' b'Cherbourg' b'Queenstown' b'Southampton' b'Southampton']
alone :[b'y' b'y' b'y' b'n' b'y']
label :[0 1 0 0 1]
3. 数据预处理
通过CSV文件导入的数据每列的数据类型可能不一样,这就需要将数据喂给模型前进行数据预处理。可以使用sklearn等工具进行前处理,再将数据传给TensorFlow。也可以使用TensorFlow内置的tf.feature_column工具,使用该工具的优点是如果训练的模型需要保存或分享给他人,则数据预处理的部分也会被保存。
3.1 特征选择——连续性数据
selectColumns = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare'] # 选择其中的几列进行分析
defaults = [0, 0.0, 0.0, 0.0, 0.0]
tempDataset = getDataset(trainFilePath,
select_columns=selectColumns,
column_defaults=defaults)
showBatch(tempDataset)
输出:
age :[25. 43. 18. 55.5 47. ]
n_siblings_spouses :[1. 0. 0. 0. 1.]
parch :[2. 0. 0. 0. 1.]
fare :[151.55 8.05 7.7958 8.05 52.5542]
label :[0 0 0 0 1]
example_batch, labels_batch = next(iter(tempDataset))
# 将所有列打包到一起
def pack(features, label):
return tf.stack(list(features.values()),axis=1),label
packed_dataset = tempDataset.map(pack)
for fetures, labels in packed_dataset.take(1):
print(fetures.numpy(),labels.numpy(),sep="\n\n")
输出:
[[28. 1. 0. 14.4542]
[28. 0. 0. 7.2292]
[52. 1. 0. 78.2667]
[48. 1. 2. 65. ]
[30.5 0. 0. 8.05 ]]
[0 0 1 1 0]
# 定义一个通用的预处理类:选择部分特征并打包到单列中
class PackNumericFeatures(object):
def __init__(self, names):
self.names = names
def __call__(self, features, labels):
numeric_features = [features.pop(name) for name in self.names]
numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]
numeric_features = tf.stack(numeric_features, axis=1)
features["numeric"] = numeric_features
return features, labels
numericFeatures = ["age","n_siblings_spouses","parch","fare"]
packed_train_data = rawTrainData.map(PackNumericFeatures(numericFeatures))
packed_test_data = rawTestData.map(PackNumericFeatures(numericFeatures))
showBatch(packed_train_data)
输出:
sex :[b'male' b'male' b'male' b'male' b'male']
class :[b'Third' b'Third' b'Third' b'Third' b'Second']
deck :[b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town :[b'Cherbourg' b'Southampton' b'Queenstown' b'Cherbourg' b'Southampton']
alone :[b'n' b'y' b'n' b'y' b'y']
numeric :[[15. 1. 1. 7.2292]
[25. 0. 0. 7.05 ]
[ 7. 4. 1. 29.125 ]
[28. 0. 0. 7.8958]
[28. 0. 0. 13. ]]
label :[0 0 0 0 1]
example_batch, labels_batch = next(iter(packed_train_data))
3.2 连续性数据归一化
连接性数据通常需要做归一化操作。
desc = pd.read_csv(trainFilePath)[numericFeatures].describe()
desc
输出:
mean = np.array(desc.T["mean"])
std = np.array(desc.T["std"])
print(mean, type(mean))
print(std, type(std))
输出:
[29.63130781 0.54545455 0.37958533 34.38539856] <class 'numpy.ndarray'>
[12.51181763 1.1510896 0.79299921 54.5977305 ] <class 'numpy.ndarray'>
def normalization(data, mean, std):
return (data - mean)/std
normalizer = functools.partial(normalization,mean=mean, std=std)
numeric_column = tf.feature_column.numeric_column("numeric",normalizer_fn=normalizer,shape=[len(numericFeatures)])
numeric_columns = [numeric_column]
numeric_column
输出:
NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalization at 0x000000002E16C2F0>, mean=array([29.63130781, 0.54545455, 0.37958533, 34.38539856]), std=array([12.51181763, 1.1510896 , 0.79299921, 54.5977305 ])))
In [49]:
example_batch["numeric"]
输出:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[28. , 1. , 0. , 24.15],
[51. , 0. , 0. , 8.05],
[ 6. , 0. , 1. , 33. ],
[26. , 0. , 0. , 10.5 ],
[16. , 0. , 0. , 26. ]], dtype=float32)>
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()
输出:
array([[-0.13038135, 0.39488277, -0.4786705 , -0.18746932],
[ 1.7078807 , -0.47385937, -0.4786705 , -0.4823534 ],
[-1.888719 , -0.47385937, 0.7823648 , -0.02537466],
[-0.2902302 , -0.47385937, -0.4786705 , -0.4374797 ],
[-1.0894746 , -0.47385937, -0.4786705 , -0.15358512]],
dtype=float32)
3.3 特征选择——离散性数据
CATEGORIES = {
'sex': ['male', 'female'],
'class' : ['First', 'Second', 'Third'],
'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
'alone' : ['y', 'n']
}
categorical_columns = []
for feature, vocab in CATEGORIES.items():
cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
key=feature, vocabulary_list=vocab)
categorical_columns.append(tf.feature_column.indicator_column(cat_col))
categorical_columns
输出:
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])
输出:
[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
3.4 合并连续数据和离散数据层
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)
print(preprocessing_layer(example_batch).numpy()[0])
输出:
[ 0. 1. 0. 0. 1. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 1.
-0.13038135 0.39488277 -0.4786705 -0.18746932 1. 0. ]
4. 构建模型
model = tf.keras.Sequential([
preprocessing_layer,
tf.keras.layers.Dense(128,activation="relu"),
tf.keras.layers.Dense(128,activation="relu"),
tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer="adam",
metrics=["accuracy"])
5. 训练及评估模型
train_data = packed_train_data.shuffle(500)
test_data = packed_test_data
model.fit(train_data, epochs=20)
输出:
Epoch 1/20
126/126 [==============================] - 0s 3ms/step - loss: 0.4894 - accuracy: 0.7544
Epoch 2/20
126/126 [==============================] - 0s 984us/step - loss: 0.4147 - accuracy: 0.8230
Epoch 3/20
126/126 [==============================] - 0s 921us/step - loss: 0.4002 - accuracy: 0.8309
Epoch 4/20
126/126 [==============================] - 0s 849us/step - loss: 0.3894 - accuracy: 0.8325
Epoch 5/20
126/126 [==============================] - 0s 841us/step - loss: 0.3812 - accuracy: 0.8485
Epoch 6/20
126/126 [==============================] - 0s 833us/step - loss: 0.3729 - accuracy: 0.8341
Epoch 7/20
126/126 [==============================] - 0s 849us/step - loss: 0.3716 - accuracy: 0.8421
Epoch 8/20
126/126 [==============================] - 0s 984us/step - loss: 0.3647 - accuracy: 0.8453
Epoch 9/20
126/126 [==============================] - 0s 770us/step - loss: 0.3472 - accuracy: 0.8501
Epoch 10/20
126/126 [==============================] - 0s 794us/step - loss: 0.3470 - accuracy: 0.8533
Epoch 11/20
126/126 [==============================] - 0s 841us/step - loss: 0.3449 - accuracy: 0.8421
Epoch 12/20
126/126 [==============================] - 0s 873us/step - loss: 0.3360 - accuracy: 0.8485
Epoch 13/20
126/126 [==============================] - 0s 857us/step - loss: 0.3313 - accuracy: 0.8565
Epoch 14/20
126/126 [==============================] - 0s 857us/step - loss: 0.3293 - accuracy: 0.8533
Epoch 15/20
126/126 [==============================] - 0s 873us/step - loss: 0.3236 - accuracy: 0.8644
Epoch 16/20
126/126 [==============================] - 0s 897us/step - loss: 0.3336 - accuracy: 0.8581
Epoch 17/20
126/126 [==============================] - 0s 770us/step - loss: 0.3185 - accuracy: 0.8565
Epoch 18/20
126/126 [==============================] - 0s 778us/step - loss: 0.3118 - accuracy: 0.8596
Epoch 19/20
126/126 [==============================] - 0s 794us/step - loss: 0.3130 - accuracy: 0.8581
Epoch 20/20
126/126 [==============================] - 0s 929us/step - loss: 0.3099 - accuracy: 0.8644
Out[58]:
<tensorflow.python.keras.callbacks.History at 0x302be5c0>
test_loss, test_accuracy = model.evaluate(test_data)
print("\nTest Loss: {}, Test Accuracy: {}".format(test_loss,test_accuracy))
输出:
53/53 [==============================] - 0s 906us/step - loss: 0.4588 - accuracy: 0.8561
Test Loss: 0.45877906680107117, Test Accuracy: 0.8560606241226196
In [60]:
predictions = model.predict(test_data)
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
prediction = tf.sigmoid(prediction).numpy()
print("Predicted survied: {:.2%}".format(prediction[0]),
"| Actual outcome: ",("survived" if bool(survived) else "died"))
输出:
Predicted survied: 39.23% | Actual outcome: survived
Predicted survied: 99.98% | Actual outcome: survived
Predicted survied: 86.77% | Actual outcome: died
Predicted survied: 79.44% | Actual outcome: survived
Predicted survied: 30.34% | Actual outcome: died