python格式读取,Python - 以奇怪的utf-16格式读取文本文件

在Python中读取UTF-16-LE编码的文本文件时遇到了问题,原始代码导致了字符显示异常。解决方法是使用`decode('utf-16-le')`来解码数据,或者在打开文件时指定`io.open`或`codecsmodule`的编码为UTF-16-LE。这样可以正确地将字符串转换为浮点数。
摘要由CSDN通过智能技术生成

I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:

file = open('data.txt','r')

lines = file.readlines()

for line in lines[0:1]:

print line,

print line.split()

Output:

0.0200197 1.97691e-005

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']

Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?

I put the sample datafile here if you would like to try to load it:

https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt

I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.

解决方案

I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.

To fix this, just decode the data:

print line.decode('utf-16-le').split()

Or do the same thing at the file level with the io or codecs module:

file = io.open('data.txt','r', encoding='utf-16-le')

* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值