读取CSV文件(tf.io.decode_csv()使用 )
要将csv文件读取并解析可以分为两个部分:
- 读取csv存储位置形成一个文件名的dataset;
- 从这个dataset中读取csv文件并解析,最后将读取的数据合并形成一个整个的dataset。
使用到的一些函数的介绍:
1.tf.data.Dataset.list_files( ): 按要求列出匹配的文件。
参数:
list_files(
file_pattern
,shuffle=None
,seed=None
)
file_pattern
:文件匹配模型,一个字符串或者字符串列表,例:../*.py
列出当前文件夹下的.py
文件;shuffle=None
: 列出的文件是否需要混排;seed=None
: 随机种子。
2.tf.data.TextLineDataset():从一个或多个文本中读取数据形成一个Dataset。
参数:
tf.data.TextLineDataset(
filenames
,
compression_type=None,
buffer_size=None
,
num_parallel_reads=None
)
-
filenames
,: 一个tf.string张量或者tf.data.Dataset,其中包含一个或多个文件名; -
compression_type=None,
:格式是ZLIB或者GZIP; -
buffer_size=None
, :决定缓冲字节数多少 -
num_parallel_reads=None
:并行读取的文件数。
3.tf.io.decode_csv():将csv记录转化为张量,每一条记录映射一个张量
参数:
tf.io.decode_csv(
records
,
record_defaults
,
field_delim=','
,
use_quote_delim=True
,
na_value=''
,
select_cols=None
,
name=None
)
records
:一个string类型的Tensor。每个字符串都是csv中的记录/行,所有记录都应具有相同的格式;record_defaults
:具有特定类型的Tensor对象列表。可接受的类型有float32,float64,int32,int64,string。输入记录的每列一个张量,具有该列的标量默认值或者如果需要该列则为空向量;field_delim=','
:可选的string。默认为","。用于分隔记录中字段的char分隔符;use_quote_delim=True
:可选的bool。默认为True。如果为false,则将双引号视为字符串字段内的常规字符;na_value=''
:要识别为NA/NaN的附加字符串;select_cols=None
:可选的列索引的可选排序列表。如果指定,则仅解析并返回此列的子集;name=None
:操作的名称。
返回:
Tensor对象列表。与record_defaults具有相同的类型。每个张量将与记录具有相同的形状。
代码示例:
# 查看文件
#train_filenames 、 valid_filenames 、 test_filenames 分别为保存的CSV文件夹名
import pprint
print("train filenames: ")
pprint.pprint(train_filenames)
print("valid filenames:")
pprint.pprint(valid_filenames)
print("test filenames: ")
pprint.pprint(test_filenames)
train filenames:
[‘customize_generate_csv\train_00.csv’,
‘customize_generate_csv\train_01.csv’,
‘customize_generate_csv\train_02.csv’,
‘customize_generate_csv\train_03.csv’,
‘customize_generate_csv\train_04.csv’,
‘customize_generate_csv\train_05.csv’,
‘customize_generate_csv\train_06.csv’,
‘customize_generate_csv\train_07.csv’,
‘customize_generate_csv\train_08.csv’,
‘customize_generate_csv\train_09.csv’,
‘customize_generate_csv\train_10.csv’,
‘customize_generate_csv\train_11.csv’,
‘customize_generate_csv\train_12.csv’,
‘customize_generate_csv\train_13.csv’,
‘customize_generate_csv\train_14.csv’,
‘customize_generate_csv\train_15.csv’,
‘customize_generate_csv\train_16.csv’,
‘customize_generate_csv\train_17.csv’,
‘customize_generate_csv\train_18.csv’,
‘customize_generate_csv\train_19.csv’]
valid filenames:
[‘customize_generate_csv\valid_00.csv’,
‘customize_generate_csv\valid_01.csv’,
‘customize_generate_csv\valid_02.csv’,
‘customize_generate_csv\valid_03.csv’,
‘customize_generate_csv\valid_04.csv’,
‘customize_generate_csv\valid_05.csv’,
‘customize_generate_csv\valid_06.csv’,
‘customize_generate_csv\valid_07.csv’,
‘customize_generate_csv\valid_08.csv’,
‘customize_generate_csv\valid_09.csv’]
test filenames:
[‘customize_generate_csv\test_00.csv’,
‘customize_generate_csv\test_01.csv’,
‘customize_generate_csv\test_02.csv’,
‘customize_generate_csv\test_03.csv’,
‘customize_generate_csv\test_04.csv’,
‘customize_generate_csv\test_05.csv’,
‘customize_generate_csv\test_06.csv’,
‘customize_generate_csv\test_07.csv’,
‘customize_generate_csv\test_08.csv’,
‘customize_generate_csv\test_09.csv’]
#tf.data.Data.list_files() : 用于列出目录中的所有文件
filename_dataset = tf.data.Dataset.list_files(train_filenames)
for filename in filename_dataset:
print(filename)
tf.Tensor(b’customize_generate_csv\train_09.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_12.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_17.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_19.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_08.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_07.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_03.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_16.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_15.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_10.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_11.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_01.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_18.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_00.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_06.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_02.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_14.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_04.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_13.csv’, shape=(), dtype=string)
tf.Tensor(b’customize_generate_csv\train_05.csv’, shape=(), dtype=string)
#tf.data.Data.interleave() : 遍历文件中的数据返回一个数据集dataset
#tf.data.TextLineDataset() : 从一个或多个文本中读取数据形成一个dataset
n_readers = 5
dataset = filename_dataset.interleave(
#skip()跳过第一行header
lambda filename: tf.data.TextLineDataset(filename).skip(1),
#并行数为5,默认一次从并行数中取出一条数据
cycle_length = n_readers
)
for line in dataset.take(5):
print(line.numpy())
b’-0.46357383731798407,-0.9969472983623009,-0.360665182362259,-0.03758824275346155,-0.7513782282717916,-0.11044346277054949,-1.3324374269537262,1.2692798012625406,0.946’
b’-0.27877631514723744,0.26961493546674115,-0.44563685950976084,-0.0952067699106492,1.2317425845839447,-0.027998641115256996,-0.7266744085540406,0.7624447479420658,2.141’
b’-0.819775446057801,-0.12618576260483452,-0.25010557828795343,0.05971663748529316,-1.1013928489500828,-0.08518203260477861,1.085954931118859,-0.8226570364621759,0.875’
b’0.6216133377416374,0.34877507508105626,0.09784787148671302,-0.15320100586458107,-0.1957854000052381,-0.04840063829783664,0.7970525684974694,-1.2102367831190115,3.116’
b’-0.016009864295304353,0.34877507508105626,-0.14516200231557114,-0.16849220911426202,0.6929859026284989,0.03414743870549869,-0.8524867277601302,0.8469172568288058,2.118’
#tf.io.decode_csv(record, record_default): Convert CSV records to tensors. Each column maps to one tensor.
sample_str = "1, 2, 3, 4, 5"
record_defaults_1 = [tf.constant(0, dtype=tf.int32)] * 5
parsed_fields = tf.io.decode_csv(sample_str, record_defaults_1)
print(parsed_fields)
[<tf.Tensor: id=247, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=248, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=249, shape=(), dtype=int32, numpy=3>, <tf.Tensor: id=250, shape=(), dtype=int32, numpy=4>, <tf.Tensor: id=251, shape=(), dtype=int32, numpy=5>]
record_defaults_2 = [
tf.constant(0, dtype=tf.int32),
0,
np.nan,
"hello",
tf.constant([])
]
parsed_fields_2 = tf.io.decode_csv(sample_str, record_defaults_2)
print(parsed_fields_2)
[<tf.Tensor: id=258, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=259, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=260, shape=(), dtype=float32, numpy=3.0>, <tf.Tensor: id=261, shape=(), dtype=string, numpy=b’ 4’>, <tf.Tensor: id=262, shape=(), dtype=float32, numpy=5.0>]
#如果 1.参数record为空值且record_default没有设置默认值,则会报错;
# 2.参数record 和 record_default 不匹配会报错。
try:
parsed_fields = tf.io.decode_csv(',,,,', record_defaults_2)
except tf.errors.InvalidArgumentError as ex:
print(ex)
Field 4 is required but missing in record 0! [Op:DecodeCSV]
try:
parsed_fields = tf.io.decode_csv('1, 2, 3, 4, 5, 6, 7', record_defaults_2)
except tf.errors.InvalidArgumentError as ex:
print(ex)
Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]