numpy getfromtex函数学习
对于python是人工智能研究的第一语言,python的科学计算numpy依赖是必不可少的部分,下面就介绍虾numpy的函数 getfromtext。
安装
如果是anaconda 则默认已经安装好了,如果ubuntu下使用系统python 那可能需要你自己安装下numpy的库
sudo apt install python3-pip
pip3 install numpy
函数解析
我们在pycharm或者命令口输入:
print(help(np.genfromtxt))
便可以看到这个函数的官方文档和实例,这里为了不影响体验放在文章末尾了。下面来解释下该函数用法:
函数原型:
genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0,
skip_footer=0,deletechars=None, replace_space='_', autostrip=False, case_sensitive=True,
defaultfmt='f%i',unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None,
encoding='bytes')
参数 fname
要读取的文件,具体的路径。唯一一个不能缺省的参数。
参数 dtype
读取后返回的数组的数据类型,通常设置为str,最后可以再转化为其他类型。
参数comments
注释参数,通常情况下默认为“#“,在此符号后的同行内被认为是注释词汇,不读入结果(如果可选的参数names= True,第一行有注释行会被认为是名称读取)。
参数delimiter
分隔符,将每行中的元素分割为列的字符或者字符串,一般是用”,“。当然是一系列整数作为参数也是可以的
如:
data = "123456789\n 4 7 9\n 4567 9"
data_str = np.genfromtxt(StringIO(data), delimiter=(4, 3, 2))
print(data_str)
//输入为:
[[1234., 567., 89.],
[ 4., 7., 9.],
[ 45., 67., 9.]]
参数skiprows
文件头通常不是我们要读取的数据,int类型,指的是滤过开头行数,开始读取。在numpy 1.10中已经遗弃,请使用skip_header 代替
参数skip_header
文件头通常不是我们要读取的数据,int类型,指的是滤过开头行数,开始读取
参数skip_footer
和skip_header相反,指的是文本的最后多少行不读取,类型 int
参数 converters
转化函数,将某一列的数据转化为一个值。可以提供一个默认的值,当不为空的时候就利用转换函数转化
def conv_to_int(x):
print(x)
b = float(x)
return int(b)
data = "1,1.2,123\n 2,2.2,78"
a = np.genfromtxt(StringIO(data), delimiter=",",
comments='#', converters={1: conv_to_int}
)
print(a)
print(type(a[0][1]))
//输出:
[(1., 1, 123.) (2., 2, 78.)]
<class 'numpy.int32'>
converters={1: conv_to_int}
第一个参数为要转化的列数,第二个参数为转换的函数,当然也可以如官方例子中直接定义函数converters = {3: lambda s: float(s or 0)}
。
注意:的是如果使用了转化列参数,那么最好dtype为str ,如例子中,若设置了dtype 为float 则会和转化的int冲突。
参数missing
在1.10中遗弃,请使用missing_values 代替
参数missing_values
默认情况下使用空格表示缺失,我们可以使用更复杂的字符表示缺失,例如’N/A’或’???’
参数 filling_values
当值缺失时候填充的值。
参数 usecols
读取的列数,一般默认是从0-max,我们也可以自己定义如:
usecols = (1, 4, 5)
//将读取第2、5、6列数据。
参数names
类型为{None, True, str, sequence},
true:文件读取的第一行为文件名称,就是skip_header 跳过后的第一行
sequence或者一个单独的用逗号隔开的string names: 文件names 将用来定义类型的结构,如下面第一行
user_id,item_id,behavior_type,user_geohash,item_category,time
99512554,37320317,3,94gn6nd,9232,2014-11-26 20
9909811,266982489,1,,3475,2014-12-02 23
98692568,27121464,1,94h63np,5201,2014-11-19 13
96089426,114407102,1,949g5i3,836,2014-11-26 07
90795949,402391768,1,94h6dlp,3046,2014-12-09 21
当我们可以dtype设置类型,也可以重写names,默认的names是none,当names=none时,将有numpy产生一些标准默认的值”f%i”,我们可以通过defaultfmt改变默认格式。
data = StringIO("1 2 3\n 4 5 6")
ndtype=[('a',int), ('b', float), ('c', int)]
names = ["A", "B", "C"]
np.genfromtxt(data, names=names, dtype=ndtype)
输出:
array([(1, 2.0, 3), (4, 5.0, 6)],dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')])
参数 excludelist
添加在默认执行操作[‘return’,’file’,’print’]后的操作,具体如何使用,本人还不是很清楚。
这篇博客 这里给出了个解释,还是不是很清楚,这里就不误导大家了,也欢迎大家给出正解,小弟先谢过,博客的解释为:
excludelist :出一系列要删除的name,如return,file,print……如果其中一个输入的名字出现在这个列表中,将会给它附加一个下划线(“_”)。
参数deletechars
names中所有包含的需要删除的无效字符。默认情况下无效字符~!@#$%^&*()——+~=|]}({;:/?>,<。
参数defaultfmt
在names中定义的默认格式,如”f%i” 或者 “f_%02i”,可以看names参数第二个例子。
参数autostrip
bool类型,是否自动从变量中删除空白
参数replace_space
用来替换names中的空格部分,默认是”_“
参数: case_sensitive
case_sensitive:名字是否应该区分大小写(case_sensitive = True),转换为大写(case_sensitive=False or case_sensitive=’upper’)或转换为小写(case_sensitive=’lower’)
参数:unpack
bool类型,为true时,返回的结果数组是解开的,x,y,z = loadtxt(…)
参数 usemask
为true时返回一个mask数组
false 返回一个平常数组
参数loose
为true时候,出现无效值时,不跑出错误
参数invalid_raise
为true时, 当有无效值时,就抛出异常
false时,跳过无效值,发出警告
参数max_rows
读取的最大行数,int类型。使用的时必须和skip_footer同用,至少为1,默认是读到文件尾。
参数encoding
当file是一个文件的时候不建议使用,它是文件的编码,一般会自动匹配兼容。
帮助文档:
Help on function genfromtxt in module numpy.lib.npyio:
genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0,
deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i',
unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes')
Load data from a text file, with missing values handled as specified.
Each line past the first `skip_header` lines is split at the `delimiter`
character, and characters following the `comments` character are discarded.
Parameters
----------
fname : file, str, pathlib.Path, list of str, generator
File, filename, list, or generator to read. If the filename
extension is `.gz` or `.bz2`, the file is first decompressed. Note
that generators must return byte strings in Python 3k. The strings
in a list or produced by a generator are treated as lines.
dtype : dtype, optional
Data type of the resulting array.
If None, the dtypes will be determined by the contents of each
column, individually.
comments : str, optional
The character used to indicate the start of a comment.
All the characters occurring on a line after a comment are discarded
delimiter : str, int, or sequence, optional
The string used to separate values. By default, any consecutive
whitespaces act as delimiter. An integer or sequence of integers
can also be provided as width(s) of each field.
skiprows : int, optional
`skiprows` was removed in numpy 1.10. Please use `skip_header` instead.
skip_header : int, optional
The number of lines to skip at the beginning of the file.
skip_footer : int, optional
The number of lines to skip at the end of the file.
converters : variable, optional
The set of functions that convert the data of a column to a value.
The converters can also be used to provide a default value
for missing data: ``converters = {3: lambda s: float(s or 0)}``.
missing : variable, optional
`missing` was removed in numpy 1.10. Please use `missing_values`
instead.
missing_values : variable, optional
The set of strings corresponding to missing data.
filling_values : variable, optional
The set of values to be used as default when the data are missing.
usecols : sequence, optional
Which columns to read, with 0 being the first. For example,
``usecols = (1, 4, 5)`` will extract the 2nd, 5th and 6th columns.
names : {None, True, str, sequence}, optional
If `names` is True, the field names are read from the first line after
the first `skip_header` lines. This line can optionally be proceeded
by a comment delimeter. If `names` is a sequence or a single-string of
comma-separated names, the names will be used to define the field names
in a structured dtype. If `names` is None, the names of the dtype
fields will be used, if any.
excludelist : sequence, optional
A list of names to exclude. This list is appended to the default list
['return','file','print']. Excluded names are appended an underscore:
for example, `file` would become `file_`.
deletechars : str, optional
A string combining invalid characters that must be deleted from the
names.
defaultfmt : str, optional
A format used to define default field names, such as "f%i" or "f_%02i".
autostrip : bool, optional
Whether to automatically strip white spaces from the variables.
replace_space : char, optional
Character(s) used in replacement of white spaces in the variables
names. By default, use a '_'.
case_sensitive : {True, False, 'upper', 'lower'}, optional
If True, field names are case sensitive.
If False or 'upper', field names are converted to upper case.
If 'lower', field names are converted to lower case.
unpack : bool, optional
If True, the returned array is transposed, so that arguments may be
unpacked using ``x, y, z = loadtxt(...)``
usemask : bool, optional
If True, return a masked array.
If False, return a regular array.
loose : bool, optional
If True, do not raise errors for invalid values.
invalid_raise : bool, optional
If True, an exception is raised if an inconsistency is detected in the
number of columns.
If False, a warning is emitted and the offending lines are skipped.
max_rows : int, optional
The maximum number of rows to read. Must not be used with skip_footer
at the same time. If given, the value must be at least 1. Default is
to read the entire file.
.. versionadded:: 1.10.0
encoding : str, optional
Encoding used to decode the inputfile. Does not apply when `fname` is
a file object. The special value 'bytes' enables backward compatibility
workarounds that ensure that you receive byte arrays when possible
and passes latin1 encoded strings to converters. Override this value to
receive unicode arrays and pass strings as input to converters. If set
to None the system default is used. The default value is 'bytes'.
.. versionadded:: 1.14.0
Returns
-------
out : ndarray
Data read from the text file. If `usemask` is True, this is a
masked array.
See Also
--------
numpy.loadtxt : equivalent function when no data is missing.
Notes
-----
* When spaces are used as delimiters, or when no delimiter has been given
as input, there should not be any missing data between two fields.
* When the variables are named (either by a flexible dtype or with `names`,
there must not be any header in the file (else a ValueError
exception is raised).
* Individual values are not stripped of spaces by default.
When using a custom converter, make sure the function does remove spaces.
References
----------
.. [1] NumPy User Guide, section `I/O with NumPy
<http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html>`_.
Examples
---------
>>> from io import StringIO
>>> import numpy as np
Comma delimited file with mixed dtype
>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
Using dtype = None
>>> s.seek(0) # needed for StringIO example only
>>> data = np.genfromtxt(s, dtype=None,
... names = ['myint','myfloat','mystring'], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
Specifying dtype and names
>>> s.seek(0)
>>> data = np.genfromtxt(s, dtype="i8,f8,S5",
... names=['myint','myfloat','mystring'], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
An example with fixed-width columns
>>> s = StringIO("11.3abcde")
>>> data = np.genfromtxt(s, dtype=None, names=['intvar','fltvar','strvar'],
... delimiter=[1,3,5])
>>> data
array((1, 1.3, 'abcde'),
dtype=[('intvar', '<i8'), ('fltvar', '<f8'), ('strvar', '|S5')])
None