python开头两行_为什么Python每两行写一个错误编码的行?

I am trying to dump the content of a table column from SQL Server 2K into text files, that I want to later treat with Python and output new text files.

My problem is that I can't get python to use the correct encoding, and while the input files appear fine on my text editor the output ones have broken characters apparently once every two lines.

My python code can be reduced to:

input = open('input', 'r')

string = input.read()

# Do stuff

output = open('output', 'w+')

output.write(string)

Printing this string in the windows shell gives me the expected characters, though separated from each other by one space too many.

But when I open the output file, once every two line everything is broken (though the "added" whitespaces have disappeared)

Some context: To dump the column to files, I'm using this script: spWriteStringTofile which I believe is using the default server encoding.

After some research, it appears that this encoding is SQL_Latin1_General_CP1_CI_AS. I tried adding # -*- coding: latin_1 -* at the beginning of the script, I tried converting the encoding inside SQL Server to Latin1_General_CI_AS, I tried to string.decode('latin_1').encode('utf8') but it didn't change a thing (except the last attempt which output only broken characters).

What can I try ?

EDIT 2: I tried the newFile.write(line.decode('utf-16-be').encode('utf-16-le')) solution, with throws an error at the first line of my file. From the python GUI:

(Pdb) print line

ÿþ

(Pdb) print repr(line)

'\xff\xfe\n'

(Pdb) line.decode('utf-16-be').encode('utf-16-le')

*** UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 2: truncated data

Only a newline appear in Sublime Text 2 for this first line...

When I bypass it (try: ... except: pass, quick&dirty), a newline is added between correct and incorrect lines, but the broken characters are still here.

EDIT: I went through the document line by line

newFile = open('newfile', 'a+')

with open('input') as fp:

for line in fp:

import pdb

pdb.set_trace()

newFile.write(line)

In the pdb, on a faulty line:

(Pdb) print line

a s S o l d D e b i t o r , #

(Pdb) print repr(line)

'\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'

However for some reason I couldn't copy/paste the print line value: I can copy the individual alphabetic character but not when I select the "whitespace" that is betzeen them...

Input:

r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda

then isnull(Debit,0) else isnull(DevDebit,0) end)

else 0 end)

- Sum(case when DataInr >= @BeginDate and DataInr <= @Data2

then (case when @Deviza='' or @Deviza=@sMoneda

then isnull(Credit,0) else isnull(DevCredit,0) end)

else 0 end)

else 0 end

as SoldDebitor,

Output:

r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda

then isnull(Debit,0) else isnull(DevDebit,0) end)

਍ऀ                       攀氀猀攀   攀渀搀⤀ ഀഀ

- Sum(case when DataInr >= @BeginDate and DataInr <= @Data2

then (case when @Deviza='' or @Deviza=@sMoneda

then isnull(Credit,0) else isnull(DevCredit,0) end)

਍ऀ                       攀氀猀攀   攀渀搀⤀ ഀഀ

else 0 end

਍ऀ                 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀഀ

解决方案

Your corrupted data is UTF-16, using big-endian byte order:

>>> line = '\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'

>>> line.decode('utf-16-be')

u'\t as SoldDebitor,\r\n'

but whatever is reading your file again is interpreting the data UTF-16 in little endian byte order instead:

>>> print data.decode('utf-16-le')

ऀ                 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀ਀

That's most likely because you didn't include a BOM at the start of the file, or you mangled the input data.

You really should not be reading UTF-16 data, in text modus, without decoding, as newlines encoded in two bytes are almost guaranteed to be mangled, leading to off-by-one byte order errors, which can also lead to every other line or almost every other line being mangled.

Use io.open() to read unicode data instead:

import io

with io.open('input', 'r', encoding='utf16') as infh:

string = infh.read()

# Do stuff

with io.open('output', 'w+', encoding='utf16') as outfh:

outfh.write(string)

because it appears your input file already has a UTF-16 BOM.

This does mean the rest of your code needs to be adjusted to handle Unicode strings instead of byte strings as well.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值