python开头两行_为什么Python每两行写一个错误编码的行？

最新推荐文章于 2024-06-03 20:05:44 发布

weixin_39735247

最新推荐文章于 2024-06-03 20:05:44 发布

阅读量124

点赞数

文章标签： python开头两行

本文链接：https://blog.csdn.net/weixin_39735247/article/details/111721469

版权

I am trying to dump the content of a table column from SQL Server 2K into text files, that I want to later treat with Python and output new text files.

My problem is that I can't get python to use the correct encoding, and while the input files appear fine on my text editor the output ones have broken characters apparently once every two lines.

My python code can be reduced to:

input = open('input', 'r')

string = input.read()

# Do stuff

output = open('output', 'w+')

output.write(string)

Printing this string in the windows shell gives me the expected characters, though separated from each other by one space too many.

But when I open the output file, once every two line everything is broken (though the "added" whitespaces have disappeared)

Some context: To dump the column to files, I'm using this script: spWriteStringTofile which I believe is using the default server encoding.

After some research, it appears that this encoding is SQL_Latin1_General_CP1_CI_AS. I tried adding # -*- coding: latin_1 -* at the beginning of the script, I tried converting the encoding inside SQL Server to Latin1_General_CI_AS, I tried to string.decode('latin_1').encode('utf8') but it didn't change a thing (except the last attempt which output only broken characters).

What can I try ?

EDIT 2: I tried the newFile.write(line.decode('utf-16-be').encode('utf-16-le')) solution, with throws an error at the first line of my file. From the python GUI:

(Pdb) print line

ÿþ

(Pdb) print repr(line)

'\xff\xfe\n'

(Pdb) line.decode('utf-16-be').encode('utf-16-le')

*** UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 2: truncated data

Only a newline appear in Sublime Text 2 for this first line...

When I bypass it (try: ... except: pass, quick&dirty), a newline is added between correct and incorrect lines, but the broken characters are still here.

EDIT: I went through the document line by line

newFile = open('newfile', 'a+')

with open('input') as fp:

for line in fp:

import pdb

pdb.set_trace()

newFile.write(line)

In the pdb, on a faulty line:

(Pdb) print line

a s S o l d D e b i t o r , #

(Pdb) print repr(line)

'\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'

However for some reason I couldn't copy/paste the print line value: I can copy the individual alphabetic character but not when I select the "whitespace" that is betzeen them...

Input:

r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda

then isnull(Debit,0) else isnull(DevDebit,0) end)

else 0 end)

- Sum(case when DataInr >= @BeginDate and DataInr <= @Data2

then (case when @Deviza='' or @Deviza=@sMoneda

then isnull(Credit,0) else isnull(DevCredit,0) end)

else 0 end)

else 0 end

as SoldDebitor,

Output:

r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda

then isnull(Debit,0) else isnull(DevDebit,0) end)

਍ऀ 攀氀猀攀　攀渀搀⤀ ഀഀ

- Sum(case when DataInr >= @BeginDate and DataInr <= @Data2

then (case when @Deviza='' or @Deviza=@sMoneda

then isnull(Credit,0) else isnull(DevCredit,0) end)

਍ऀ 攀氀猀攀　攀渀搀⤀ ഀഀ

else 0 end

਍ऀ 愀猀匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀഀ

解决方案

Your corrupted data is UTF-16, using big-endian byte order:

>>> line = '\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'

>>> line.decode('utf-16-be')

u'\t as SoldDebitor,\r\n'

but whatever is reading your file again is interpreting the data UTF-16 in little endian byte order instead:

>>> print data.decode('utf-16-le')

ऀ 愀猀匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀ਀

That's most likely because you didn't include a BOM at the start of the file, or you mangled the input data.

You really should not be reading UTF-16 data, in text modus, without decoding, as newlines encoded in two bytes are almost guaranteed to be mangled, leading to off-by-one byte order errors, which can also lead to every other line or almost every other line being mangled.

Use io.open() to read unicode data instead:

import io

with io.open('input', 'r', encoding='utf16') as infh:

string = infh.read()

# Do stuff

with io.open('output', 'w+', encoding='utf16') as outfh:

outfh.write(string)

because it appears your input file already has a UTF-16 BOM.

This does mean the rest of your code needs to be adjusted to handle Unicode strings instead of byte strings as well.