python的numpy loadtxt_使用NumPy loadtxt / savetxt指定编码

Using the NumPy loadtxt and savetxt functions fails whenever non-ASCII characters are involved. These function are primarily ment for numeric data, but alphanumeric headers/footers are also supported.

Both loadtxt and savetxt seem to be applying the latin-1 encoding, which I find very orthogonal to the rest of Python 3, which is thoroughly unicode-aware and always seem to be using utf-8 as the default encoding.

Given that NumPy hasn't moved to utf-8 as the default encoding, can I at least change the encoding away from latin-1, either via some implemented function/attribute or a known hack, either just for loadtxt/savetxt or for NumPy in its entirety?

That this is not possible with Python 2 is forgivable, but it really should not be a problem when using Python 3. I've found the problem using any combination of Python 3.x and the last many versions of NumPy.

Example code

Consider the file data.txt with the content

# This is π

3.14159265359

Trying to load this with

import numpy as np

pi = np.loadtxt('data.txt')

print(pi)

fails with a UnicodeEncodeError exception, stating that the latin-1 codec can't encode the character '\u03c0' (the π character).

This is frustrating because π is only present in a comment/header line, so there is no reason for loadtxt to even attempt to encode this character.

I can successfully read in the file by explicitly skipping the first row, using pi = np.loadtxt('data.txt', skiprows=1), but it is inconvenient to have to know the exact number of header lines.

The same exception is thrown if I try to write a unicode character using savetxt:

np.savetxt('data.txt', [3.14159265359], header='# This is π')

To accomplish this task successfully, I first have to write the header by some other means, and then save the data to a file object opened with the 'a+b' mode, e.g.

with open('data.txt', 'w') as f:

f.write('# This is π\n')

with open('data.txt', 'a+b') as f:

np.savetxt(f, [3.14159265359])

which needless to say is both ugly and inconvenient.

Solution

I settled on the solution by hpaulj, which I thought would be nice to spell out fully. Near the top of my program I now do

import numpy as np

asbytes = lambda s: s if isinstance(s, bytes) else str(s).encode('utf-8')

asstr = lambda s: s.decode('utf-8') if isinstance(s, bytes) else str(s)

np.compat.py3k.asbytes = asbytes

np.compat.py3k.asstr = asstr

np.compat.py3k.asunicode = asstr

np.lib.npyio.asbytes = asbytes

np.lib.npyio.asstr = asstr

np.lib.npyio.asunicode = asstr

after which np.loadtxt and np.savetxt handles Unicode correctly.

Note that for newer versions of NumPy (I can confirm 1.14.3, but properly somewhat older versions as well) this trick is not needed, as it seems that Unicode is now handled properly by default.

解决方案

At least for savetxt the encodings are handled in

Signature: np.lib.npyio.asbytes(s)

Source:

def asbytes(s):

if isinstance(s, bytes):

return s

return str(s).encode('latin1')

File: /usr/local/lib/python3.5/dist-packages/numpy/compat/py3k.py

Type: function

Signature: np.lib.npyio.asstr(s)

Source:

def asstr(s):

if isinstance(s, bytes):

return s.decode('latin1')

return str(s)

File: /usr/local/lib/python3.5/dist-packages/numpy/compat/py3k.py

Type: function

The header is written to the wb file with

header = header.replace('\n', '\n' + comments)

fh.write(asbytes(comments + header + newline))

Write numpy unicode array to a text file has some of my previous explorations. There I was focusing on characters in the data, not the header.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值