目录
1.2 tf.data.experimental.CsvDataset
CSV文件是数据处理中常用的文件格式,如何将CSV文件转为TensorFlow的dataset?
本文处理的CSV文件是已经数值化后的文件,不包含字符型数据。
话不多说,都在注释里。
1 转为训练集,不划分测试集
1.1 tf.data.TextLineDataset
def parse_csv(line):
"""
解析文本行,转为float类型,并返回特征和标签
:param line:
:return:
"""
example_defaults = [[0.]] * 318
parsed_line = tf.io.decode_csv(line, example_defaults)
features = tf.reshape(parsed_line[:-1], shape=(317,))
labels = tf.reshape(parsed_line[-1], shape=())
return features, labels
dataset = tf.data.TextLineDataset('../datasets/result1.csv')
# train_dataset = train_dataset.skip(1) # skip the first header row
dataset = dataset.map(parse_csv) # parse each row
dataset = dataset.shuffle(buffer_size=1000) # randomize
dataset = dataset.batch(32)
features, label = next(iter(dataset))
print("example features:", features.shape)
print("example label:", label.shape)
1.2 tf.data.experimental.CsvDataset
def parse_float(row):
"""
处理float列表,
:param row:
:return:
"""
features = tf.reshape(row[:-1], shape=(317,))
labels = tf.reshape(row[-1], shape=())
return features, labels
def f(*items):
"""
加了星号 * 的参数会以元组(tuple)的形式导入,存放所有未命名的变量参数。
"""
print(items)
return tf.stack(items)
record_default = [[0.0]] * 318
ds = tf.data.experimental.CsvDataset('../datasets/result1.csv', record_default)
ds = ds.map(lambda *items: tf.stack(items))
# ds = ds.map(f) # 效果一样
dataset = ds.map(parse_float).batch(32)
for feature, label in dataset.take(1):
print(feature)
print(label)
2 划分训练集、测试集、验证集
2.1 利用pandas划分数据集
def parse_dict(row):
"""
处理字典,取出字典的值,转为特征和标签
:param row: 一行字典
:return: feature, labels
"""
dict_values = row.values()
data_tensor = tf.stack(list(dict_values))
feature = tf.reshape(data_tensor[:-1], shape=(317,))
labels = tf.reshape(data_tensor[-1], shape=())
return feature, labels
def load_data():
"""
加载数据集文件,返回tf.dataset
文件位置:../datasets/result1.csv
:return: train_ds, test_ds, val_ds
"""
# 由于数据缓冲在CPU中,无法转到GPU上,故直接设置在CPU上运行
with tf.device('/cpu:0'):
# pandas读取CSV,返回dataframe
# result1没有列名,故指定
names = ['c' + str(i) for i in range(318)]
dataframe = pd.read_csv('../datasets/result1.csv', names=names, dtype=float)
# 若有列名,则设置header=None,并删除第一行
# dataframe = pd.read_csv('../datasets/result1.csv', header=None, dtype=float)
# dataframe = dataframe.drop([0], axis=0)
# 划分训练集、测试集、验证集
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
# 转化为tf dataset
train_ds = tf.data.Dataset.from_tensor_slices(dict(train)).map(parse_dict)
val_ds = tf.data.Dataset.from_tensor_slices(dict(val)).map(parse_dict)
test_ds = tf.data.Dataset.from_tensor_slices(dict(test)).map(parse_dict)
return train_ds, test_ds, val_ds
肯定还有其他方法,但只要掌握一种就够了,欢迎思想碰撞在评论区。