I am trying to dump the content of a table column from SQL Server 2K into text files, that I want to later treat with Python and output new text files.
My problem is that I can't get python to use the correct encoding, and while the input files appear fine on my text editor the output ones have broken characters apparently once every two lines.
My python code can be reduced to:
input = open('input', 'r')
string = input.read()
# Do stuff
output = open('output', 'w+')
output.write(string)
Printing this string in the windows shell gives me the expected characters, though separated from each other by one space too many.
But when I open the output file, once every two line everything is broken (though the "added" whitespaces have disappeared)
Some context: To dump the column to files, I'm using this script: spWriteStringTofile which I believe is using the default server encoding.
After some research, it appears that this encoding is SQL_Latin1_General_CP1_CI_AS. I tried adding # -*- coding: latin_1 -* at the beginning of the script, I tried converting the encoding inside SQL Server to Latin1_General_CI_AS, I tried to string.decode('latin_1').encode('utf8') but it didn't change a thing (except the last attempt which output only broken characters).
What can I try ?
EDIT 2: I tried the newFile.write(line.decode('utf-16-be').encode('utf-16-le')) solution, with throws an error at the first line of my file. From the python GUI:
(Pdb) print line
ÿþ
(Pdb) print repr(line)
'\xff\xfe\n'
(Pdb) line.decode('utf-16-be').encode('utf-16-le')
*** UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 2: truncated data
Only a newline appear in Sublime Text 2 for this first line...
When I bypass it (try: ... except: pass, quick&dirty), a newline is added between correct and incorrect lines, but the broken characters are still here.
EDIT: I went through the document line by line
newFile = open('newfile', 'a+')
with open('input') as fp:
for line in fp:
import pdb
pdb.set_trace()
newFile.write(line)
In the pdb, on a faulty line:
(Pdb) print line
a s S o l d D e b i t o r , #
(Pdb) print repr(line)
'\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'
However for some reason I couldn't copy/paste the print line value: I can copy the individual alphabetic character but not when I select the "whitespace" that is betzeen them...
Input:
r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda
then isnull(Debit,0) else isnull(DevDebit,0) end)
else 0 end)
- Sum(case when DataInr >= @BeginDate and DataInr <= @Data2
then (case when @Deviza='' or @Deviza=@sMoneda
then isnull(Credit,0) else isnull(DevCredit,0) end)
else 0 end)
else 0 end
as SoldDebitor,
Output:
r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda
then isnull(Debit,0) else isnull(DevDebit,0) end)
ऀ 攀氀猀攀 攀渀搀⤀ ഀഀ
- Sum(case when DataInr >= @BeginDate and DataInr <= @Data2
then (case when @Deviza='' or @Deviza=@sMoneda
then isnull(Credit,0) else isnull(DevCredit,0) end)
ऀ 攀氀猀攀 攀渀搀⤀ ഀഀ
else 0 end
ऀ 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀഀ
解决方案
Your corrupted data is UTF-16, using big-endian byte order:
>>> line = '\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'
>>> line.decode('utf-16-be')
u'\t as SoldDebitor,\r\n'
but whatever is reading your file again is interpreting the data UTF-16 in little endian byte order instead:
>>> print data.decode('utf-16-le')
ऀ 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀ
That's most likely because you didn't include a BOM at the start of the file, or you mangled the input data.
You really should not be reading UTF-16 data, in text modus, without decoding, as newlines encoded in two bytes are almost guaranteed to be mangled, leading to off-by-one byte order errors, which can also lead to every other line or almost every other line being mangled.
Use io.open() to read unicode data instead:
import io
with io.open('input', 'r', encoding='utf16') as infh:
string = infh.read()
# Do stuff
with io.open('output', 'w+', encoding='utf16') as outfh:
outfh.write(string)
because it appears your input file already has a UTF-16 BOM.
This does mean the rest of your code needs to be adjusted to handle Unicode strings instead of byte strings as well.