TensorFlow2.X——读取CSV文件(tf.io.decode_csv()使用 )

读取CSV文件(tf.io.decode_csv()使用 )

要将csv文件读取并解析可以分为两个部分:

  1. 读取csv存储位置形成一个文件名的dataset;
  2. 从这个dataset中读取csv文件并解析,最后将读取的数据合并形成一个整个的dataset。

使用到的一些函数的介绍:
1.tf.data.Dataset.list_files( ): 按要求列出匹配的文件。

参数:

list_files(
  file_pattern, shuffle=None, seed=None )

  • file_pattern:文件匹配模型,一个字符串或者字符串列表,例:../*.py 列出当前文件夹下的.py文件;
  • shuffle=None: 列出的文件是否需要混排;
  • seed=None: 随机种子。

2.tf.data.TextLineDataset():从一个或多个文本中读取数据形成一个Dataset。

参数:

tf.data.TextLineDataset(
  filenames,
  compression_type=None,
  buffer_size=None,
  num_parallel_reads=None
)

  • filenames,: 一个tf.string张量或者tf.data.Dataset,其中包含一个或多个文件名;

  • compression_type=None, :格式是ZLIB或者GZIP;

  • buffer_size=None, :决定缓冲字节数多少

  • num_parallel_reads=None :并行读取的文件数。

3.tf.io.decode_csv():将csv记录转化为张量,每一条记录映射一个张量

参数:

tf.io.decode_csv(
  records,
  record_defaults,
  field_delim=',',
  use_quote_delim=True,
   na_value='',
   select_cols=None,
  name=None
)

  • records:一个string类型的Tensor。每个字符串都是csv中的记录/行,所有记录都应具有相同的格式;
  • record_defaults:具有特定类型的Tensor对象列表。可接受的类型有float32,float64,int32,int64,string。输入记录的每列一个张量,具有该列的标量默认值或者如果需要该列则为空向量;
  • field_delim=',':可选的string。默认为","。用于分隔记录中字段的char分隔符;
  • use_quote_delim=True:可选的bool。默认为True。如果为false,则将双引号视为字符串字段内的常规字符;
  • na_value='':要识别为NA/NaN的附加字符串;
  • select_cols=None:可选的列索引的可选排序列表。如果指定,则仅解析并返回此列的子集;
  • name=None:操作的名称。

返回:

Tensor对象列表。与record_defaults具有相同的类型。每个张量将与记录具有相同的形状。

代码示例:

# 查看文件
#train_filenames 、 valid_filenames 、 test_filenames 分别为保存的CSV文件夹名
import pprint
print("train filenames: ")
pprint.pprint(train_filenames)
print("valid filenames:")
pprint.pprint(valid_filenames)
print("test filenames: ")
pprint.pprint(test_filenames)

train filenames:
[‘customize_generate_csv\train_00.csv’,
‘customize_generate_csv\train_01.csv’,
‘customize_generate_csv\train_02.csv’,
‘customize_generate_csv\train_03.csv’,
‘customize_generate_csv\train_04.csv’,
‘customize_generate_csv\train_05.csv’,
‘customize_generate_csv\train_06.csv’,
‘customize_generate_csv\train_07.csv’,
‘customize_generate_csv\train_08.csv’,
‘customize_generate_csv\train_09.csv’,
‘customize_generate_csv\train_10.csv’,
‘customize_generate_csv\train_11.csv’,
‘customize_generate_csv\train_12.csv’,
‘customize_generate_csv\train_13.csv’,
‘customize_generate_csv\train_14.csv’,
‘customize_generate_csv\train_15.csv’,
‘customize_generate_csv\train_16.csv’,
‘customize_generate_csv\train_17.csv’,
‘customize_generate_csv\train_18.csv’,
‘customize_generate_csv\train_19.csv’]
valid filenames:
[‘customize_generate_csv\valid_00.csv’,
‘customize_generate_csv\valid_01.csv’,
‘customize_generate_csv\valid_02.csv’,
‘customize_generate_csv\valid_03.csv’,
‘customize_generate_csv\valid_04.csv’,
‘customize_generate_csv\valid_05.csv’,
‘customize_generate_csv\valid_06.csv’,
‘customize_generate_csv\valid_07.csv’,
‘customize_generate_csv\valid_08.csv’,
‘customize_generate_csv\valid_09.csv’]
test filenames:
[‘customize_generate_csv\test_00.csv’,
‘customize_generate_csv\test_01.csv’,
‘customize_generate_csv\test_02.csv’,
‘customize_generate_csv\test_03.csv’,
‘customize_generate_csv\test_04.csv’,
‘customize_generate_csv\test_05.csv’,
‘customize_generate_csv\test_06.csv’,
‘customize_generate_csv\test_07.csv’,
‘customize_generate_csv\test_08.csv’,
‘customize_generate_csv\test_09.csv’]

#tf.data.Data.list_files() : 用于列出目录中的所有文件
filename_dataset = tf.data.Dataset.list_files(train_filenames)
for filename in filename_dataset:
    print(filename)

tf.Tensor(b’customize_generate_csv\train_09.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_12.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_17.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_19.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_08.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_07.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_03.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_16.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_15.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_10.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_11.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_01.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_18.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_00.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_06.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_02.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_14.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_04.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_13.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_05.csv’, shape=(), dtype=string)

#tf.data.Data.interleave() : 遍历文件中的数据返回一个数据集dataset
#tf.data.TextLineDataset() : 从一个或多个文本中读取数据形成一个dataset

n_readers = 5
dataset = filename_dataset.interleave(
    #skip()跳过第一行header
    lambda filename: tf.data.TextLineDataset(filename).skip(1),
    #并行数为5,默认一次从并行数中取出一条数据
    cycle_length = n_readers
)

for line in dataset.take(5):
    print(line.numpy())

b’-0.46357383731798407,-0.9969472983623009,-0.360665182362259,-0.03758824275346155,-0.7513782282717916,-0.11044346277054949,-1.3324374269537262,1.2692798012625406,0.946’
b’-0.27877631514723744,0.26961493546674115,-0.44563685950976084,-0.0952067699106492,1.2317425845839447,-0.027998641115256996,-0.7266744085540406,0.7624447479420658,2.141’
b’-0.819775446057801,-0.12618576260483452,-0.25010557828795343,0.05971663748529316,-1.1013928489500828,-0.08518203260477861,1.085954931118859,-0.8226570364621759,0.875’
b’0.6216133377416374,0.34877507508105626,0.09784787148671302,-0.15320100586458107,-0.1957854000052381,-0.04840063829783664,0.7970525684974694,-1.2102367831190115,3.116’
b’-0.016009864295304353,0.34877507508105626,-0.14516200231557114,-0.16849220911426202,0.6929859026284989,0.03414743870549869,-0.8524867277601302,0.8469172568288058,2.118’

#tf.io.decode_csv(record, record_default): Convert CSV records to tensors. Each column maps to one tensor.

sample_str = "1, 2, 3, 4, 5"
record_defaults_1 = [tf.constant(0, dtype=tf.int32)] * 5
parsed_fields = tf.io.decode_csv(sample_str, record_defaults_1)
print(parsed_fields)

[<tf.Tensor: id=247, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=248, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=249, shape=(), dtype=int32, numpy=3>, <tf.Tensor: id=250, shape=(), dtype=int32, numpy=4>, <tf.Tensor: id=251, shape=(), dtype=int32, numpy=5>]

record_defaults_2 = [
    tf.constant(0, dtype=tf.int32),
    0,
    np.nan,
    "hello",
    tf.constant([])
]
parsed_fields_2 = tf.io.decode_csv(sample_str, record_defaults_2)
print(parsed_fields_2)

[<tf.Tensor: id=258, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=259, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=260, shape=(), dtype=float32, numpy=3.0>, <tf.Tensor: id=261, shape=(), dtype=string, numpy=b’ 4’>, <tf.Tensor: id=262, shape=(), dtype=float32, numpy=5.0>]

#如果 1.参数record为空值且record_default没有设置默认值,则会报错;
#     2.参数record 和 record_default 不匹配会报错。
try:
    parsed_fields = tf.io.decode_csv(',,,,', record_defaults_2)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Field 4 is required but missing in record 0! [Op:DecodeCSV]

try:
    parsed_fields = tf.io.decode_csv('1, 2, 3, 4, 5, 6, 7', record_defaults_2)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]

  • 5
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值