TensorFlow2.X——读取CSV文件（tf.io.decode_csv()使用）

最新推荐文章于 2021-06-25 12:27:41 发布

一只工程狮

最新推荐文章于 2021-06-25 12:27:41 发布

阅读量5.4k

点赞数 5

分类专栏： TensorFlow

本文链接：https://blog.csdn.net/qq_40913465/article/details/104790258

版权

TensorFlow 专栏收录该内容

35 篇文章 16 订阅

订阅专栏

读取CSV文件（tf.io.decode_csv()使用）

要将csv文件读取并解析可以分为两个部分：

读取csv存储位置形成一个文件名的dataset；
从这个dataset中读取csv文件并解析，最后将读取的数据合并形成一个整个的dataset。

使用到的一些函数的介绍：
1.tf.data.Dataset.list_files( ): 按要求列出匹配的文件。

参数：

list_files(
file_pattern, shuffle=None, seed=None )

file_pattern:文件匹配模型，一个字符串或者字符串列表，例：../*.py 列出当前文件夹下的.py文件；
shuffle=None: 列出的文件是否需要混排；
seed=None: 随机种子。

2.tf.data.TextLineDataset():从一个或多个文本中读取数据形成一个Dataset。

参数：

tf.data.TextLineDataset(
filenames,
compression_type=None,
buffer_size=None,
num_parallel_reads=None
)

filenames,：一个tf.string张量或者tf.data.Dataset,其中包含一个或多个文件名；
compression_type=None, ：格式是ZLIB或者GZIP；
buffer_size=None, ：决定缓冲字节数多少
num_parallel_reads=None ：并行读取的文件数。

3.tf.io.decode_csv():将csv记录转化为张量，每一条记录映射一个张量

参数：

tf.io.decode_csv(
records,
record_defaults,
field_delim=',',
use_quote_delim=True,
na_value='',
select_cols=None,
name=None
)

records：一个string类型的Tensor。每个字符串都是csv中的记录/行，所有记录都应具有相同的格式；
record_defaults：具有特定类型的Tensor对象列表。可接受的类型有float32，float64，int32，int64，string。输入记录的每列一个张量，具有该列的标量默认值或者如果需要该列则为空向量；
field_delim=','：可选的string。默认为","。用于分隔记录中字段的char分隔符；
use_quote_delim=True：可选的bool。默认为True。如果为false，则将双引号视为字符串字段内的常规字符；
na_value=''：要识别为NA/NaN的附加字符串；
select_cols=None：可选的列索引的可选排序列表。如果指定，则仅解析并返回此列的子集；
name=None：操作的名称。

Tensor对象列表。与record_defaults具有相同的类型。每个张量将与记录具有相同的形状。

代码示例：

# 查看文件
#train_filenames 、 valid_filenames 、 test_filenames 分别为保存的CSV文件夹名
import pprint
print("train filenames: ")
pprint.pprint(train_filenames)
print("valid filenames:")
pprint.pprint(valid_filenames)
print("test filenames: ")
pprint.pprint(test_filenames)

train filenames:
[‘customize_generate_csv\train_00.csv’,
‘customize_generate_csv\train_01.csv’,
‘customize_generate_csv\train_02.csv’,
‘customize_generate_csv\train_03.csv’,
‘customize_generate_csv\train_04.csv’,
‘customize_generate_csv\train_05.csv’,
‘customize_generate_csv\train_06.csv’,
‘customize_generate_csv\train_07.csv’,
‘customize_generate_csv\train_08.csv’,
‘customize_generate_csv\train_09.csv’,
‘customize_generate_csv\train_10.csv’,
‘customize_generate_csv\train_11.csv’,
‘customize_generate_csv\train_12.csv’,
‘customize_generate_csv\train_13.csv’,
‘customize_generate_csv\train_14.csv’,
‘customize_generate_csv\train_15.csv’,
‘customize_generate_csv\train_16.csv’,
‘customize_generate_csv\train_17.csv’,
‘customize_generate_csv\train_18.csv’,
‘customize_generate_csv\train_19.csv’]
valid filenames:
[‘customize_generate_csv\valid_00.csv’,
‘customize_generate_csv\valid_01.csv’,
‘customize_generate_csv\valid_02.csv’,
‘customize_generate_csv\valid_03.csv’,
‘customize_generate_csv\valid_04.csv’,
‘customize_generate_csv\valid_05.csv’,
‘customize_generate_csv\valid_06.csv’,
‘customize_generate_csv\valid_07.csv’,
‘customize_generate_csv\valid_08.csv’,
‘customize_generate_csv\valid_09.csv’]
test filenames:
[‘customize_generate_csv\test_00.csv’,
‘customize_generate_csv\test_01.csv’,
‘customize_generate_csv\test_02.csv’,
‘customize_generate_csv\test_03.csv’,
‘customize_generate_csv\test_04.csv’,
‘customize_generate_csv\test_05.csv’,
‘customize_generate_csv\test_06.csv’,
‘customize_generate_csv\test_07.csv’,
‘customize_generate_csv\test_08.csv’,
‘customize_generate_csv\test_09.csv’]

#tf.data.Data.list_files() : 用于列出目录中的所有文件
filename_dataset = tf.data.Dataset.list_files(train_filenames)
for filename in filename_dataset:
    print(filename)

tf.Tensor(b’customize_generate_csv\train_09.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_12.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_17.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_19.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_08.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_07.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_03.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_16.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_15.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_10.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_11.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_01.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_18.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_00.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_06.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_02.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_14.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_04.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_13.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_05.csv’, shape=(), dtype=string)

#tf.data.Data.interleave() : 遍历文件中的数据返回一个数据集dataset
#tf.data.TextLineDataset() : 从一个或多个文本中读取数据形成一个dataset

n_readers = 5
dataset = filename_dataset.interleave(
    #skip()跳过第一行header
    lambda filename: tf.data.TextLineDataset(filename).skip(1),
    #并行数为5，默认一次从并行数中取出一条数据
    cycle_length = n_readers
)

for line in dataset.take(5):
    print(line.numpy())

b’-0.46357383731798407,-0.9969472983623009,-0.360665182362259,-0.03758824275346155,-0.7513782282717916,-0.11044346277054949,-1.3324374269537262,1.2692798012625406,0.946’
b’-0.27877631514723744,0.26961493546674115,-0.44563685950976084,-0.0952067699106492,1.2317425845839447,-0.027998641115256996,-0.7266744085540406,0.7624447479420658,2.141’
b’-0.819775446057801,-0.12618576260483452,-0.25010557828795343,0.05971663748529316,-1.1013928489500828,-0.08518203260477861,1.085954931118859,-0.8226570364621759,0.875’
b’0.6216133377416374,0.34877507508105626,0.09784787148671302,-0.15320100586458107,-0.1957854000052381,-0.04840063829783664,0.7970525684974694,-1.2102367831190115,3.116’
b’-0.016009864295304353,0.34877507508105626,-0.14516200231557114,-0.16849220911426202,0.6929859026284989,0.03414743870549869,-0.8524867277601302,0.8469172568288058,2.118’

#tf.io.decode_csv(record, record_default): Convert CSV records to tensors. Each column maps to one tensor.

sample_str = "1, 2, 3, 4, 5"
record_defaults_1 = [tf.constant(0, dtype=tf.int32)] * 5
parsed_fields = tf.io.decode_csv(sample_str, record_defaults_1)
print(parsed_fields)

[<tf.Tensor: id=247, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=248, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=249, shape=(), dtype=int32, numpy=3>, <tf.Tensor: id=250, shape=(), dtype=int32, numpy=4>, <tf.Tensor: id=251, shape=(), dtype=int32, numpy=5>]

record_defaults_2 = [
    tf.constant(0, dtype=tf.int32),
    0,
    np.nan,
    "hello",
    tf.constant([])
]
parsed_fields_2 = tf.io.decode_csv(sample_str, record_defaults_2)
print(parsed_fields_2)

[<tf.Tensor: id=258, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=259, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=260, shape=(), dtype=float32, numpy=3.0>, <tf.Tensor: id=261, shape=(), dtype=string, numpy=b’ 4’>, <tf.Tensor: id=262, shape=(), dtype=float32, numpy=5.0>]

#如果 1.参数record为空值且record_default没有设置默认值，则会报错；
#     2.参数record 和 record_default 不匹配会报错。
try:
    parsed_fields = tf.io.decode_csv(',,,,', record_defaults_2)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Field 4 is required but missing in record 0! [Op:DecodeCSV]

try:
    parsed_fields = tf.io.decode_csv('1, 2, 3, 4, 5, 6, 7', record_defaults_2)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]

一只工程狮

关注

5
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
TensorFlow2.X——读取CSV文件（tf.io.decode_csv()使用）

读取CSV文件并解析（tf.io.decode_csv()使用）要将csv文件读取并解析可以分为两个部分：读取csv存储位置形成一个文件名的dataset；从这个dataset中读取csv文件并解析，最后将读取的数据合并形成一个整个的dataset。使用到的一些函数的介绍：1.tf.data.Dataset.list_files( ): 按要求列出匹配的文件。参数：lis...
复制链接

扫一扫