设置方式如下:
在python的lib目录下site-packages目录中,新建sitecustomize.py
,
C:\Python27\lib\site-pachages\sitecustomize.py
输入以下内容,保存关闭。
#! encoding: utf-8
# sitecustomize.py
# this file can be anywhere in your Python path,
# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('iso-8859-1') #分别尝试 ascii(默认),UTF-8,gb2312,
每设置完后重新运行Python IDE。
结果如下:
一、iso-8859-1
>>> import sys
>>> sys.getdefaultencoding()
'iso-8859-1'
>>> s=u'我是中国人'
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>> s='我是中国人'
>>> s
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
我是中国人
>>>
二、ascii 默认编码方式,可以不用新建sitecustomize.py
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> s=u'我是中国人'
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>> s='我是中国人'
>>> s
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
我是中国人
>>>
三、UTF-8
>>> import sys
>>> sys.getdefaultencoding()
'UTF-8'
>>> s=u'我是中国人'
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>> s='我是中国人'
>>> s
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
我是中国人
>>>
四、gb2312
>>> import sys
>>> sys.getdefaultencoding()
'gb2312'
>>> s=u'我是中国人'
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>> s='我是中国人'
>>> s
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
我是中国人
>>>
发现没,他们输出的结果都一样…这让我表示郁闷,那设置这个有什么用吗?
按照书上的说法,设置默认的编码后,可以这样来用。
>>> s=u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb' #正好对应‘我是中国人’
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>>
但是四种编码方式都一样的,这结果让我更不知所措了…
刚才试着读中文格式的xml,报错了,xml文本如下:
<?xml version="1.0" encoding="gb2312"?>
<preface>
<title>我是中国人</title>
</preface>
Python IDE如下:
>>> from xml.dom import minidom
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
xmldoc=minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 3, column 10
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
>>>
这是什么问题,我也不懂了,然后想了想一般编码不区分大小写的,但是抱着试一试的心态把xml中的编码方式改了一下,改成GB2312,接着读xml:
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
>>>
额,这是为什么,没有异常了。
既然这样,那就这样吧,也只能这样了,以后大家写xml或者html或者其他地方要写编码尽量用大写吧!
不过接着就出问题了,试着把读取到的东西输出:
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
>>> title=xmldoc.getElementsByTagName_r('title')[0].firstChild.data
>>> title
u'\u6445\u646e\ufae0$\u6563\u726f\u876e\u5f73\u1a50\u01fe'
>>> print title
摅摮$散牯蝮彳ᩐǾ
【释】这..这..又是乱码!
要不试着将文字换换编码再输出?
>>> title.encode('GB2312')
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
converttitle=title.encode('GB2312')
UnicodeEncodeError: 'gb2312' codec can't encode character u'\u646e' in position 1: illegal multibyte sequence
>>> title.encode('UTF-8')
'\xe6\x91\x85\xe6\x91\xae\xef\xab\xa0$\xe6\x95\xa3\xe7\x89\xaf\xe8\x9d\xae\xe5\xbd\xb3\xe1\xa9\x90\xc7\xbe'
>>> print title.encode('UTF-8')
摅摮$散牯蝮彳ᩐǾ
>>> a='我是中国人'
>>> a
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> converttitle=title.encode('gb2312')
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
converttitle=title.encode('gb2312')
UnicodeEncodeError: 'gb2312' codec can't encode character u'\u646e' in position 1: illegal multibyte sequence
>>> converttitle=title.encode('ascii')
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
converttitle=title.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> converttitle=title.encode('iso-8859-1')
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
converttitle=title.encode('iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
>>>
这怎么办,只有utf-8 能输出,但是还是乱码昂..这个问题,有待研究呀。
还有个问题,xml还是GB2312,使用这个用户配置编码sitecustomize.py 分别设置为iso-8859-1 ascii utf-8 GB2312 ,对xml进行读取。
>>> import sys
>>> sys.getdefaultencoding()
'GB2312'
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc = minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: unknown encoding: line 1, column 30
>>>
>>> import sys
>>> sys.getdefaultencoding()
'UTF-8'
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc = minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 3, column 10
>>>
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc = minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 3, column 10
>>>
>>> import sys
>>> sys.getdefaultencoding()
'iso-8859-1'
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc = minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: unknown encoding: line 1, column 30
>>>
【释】发现没有,UTF-8和ASCII 会报ExpatError: not well-formed (invalid token): line 3, column 10 的异常,这不是之前大小写的问题吗?那为什么GB2312和iso-8859-1 会报ExpatError: unknown encoding: line 1, column 30的错误呢?要不把这个配置的编码删掉再试试:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> from xml.dom import minidom
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc=minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: unknown encoding: line 1, column 30
>>>
错了错了,颠覆了之前大小写的原因了。而且伦乱了,无论gb2312还是GB2312都是unknown encoding了,未知编码方式…
再试试,配置编码留着,但是什么都不做,只是import sys,下面#注释,IDE结果如下:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> from xml.dom import minidom
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc=minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 3, column 10
这不就是那个之前大小写时候的错误吗?哎,弄大小写了,没用….啊呀,那这个到底怎么读xml内容啊!难道前面成功读取一次的只是巧合吗?
额,乱了乱了,不过我试着重新拷贝了一个qq的xml SSOConfig.xml,然后修改如下:
<?xml version="1.0" encoding="utf-8" ?>
<i18n>
<StringBundle>
地区信息,目前只需要一个, SSOPlatform不需要地区信息
</StringBundle>
</i18n>
>>> xmldoc=minidom.parse('./mytest/SSOConfig.xml')
>>>
终于可以了,然后试着把这段话复制到russiansample.xml 中:
<?xml version="1.0" encoding="utf-8" ?>
<i18n>
<StringBundle>
地区信息,目前只需要一个, SSOPlatform不需要地区信息
</StringBundle>
<preface>
<title>
我是中国人
</title>
</preface>
</i18n>
这个xml对吧?我认为没有问题,但是还是一样的异常,一模一样的文件内容,就是名字不一样就会报错吗,我不信了,终于我发现问题了,告诉你们一个很重要的信息,那就是文件编码格式!!这个问题纠结了很久,来试试看吧!打开russiansample.xml 另存为-编码默认是ANSI选择UTF-8,保存并替换。
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
>>> xmldoc.getElementsByTagName_r('title')
[<DOM Element: title at 0x21f3328>]
>>> title = xmldoc.getElementsByTagName_r('title')[0].firstChild.data
>>> title
u'\n\t\t\u5730\u533a\u4fe1\u606f\uff0c\u76ee\u524d\u53ea\u9700\u8981\u4e00\u4e2a, SSOPlatform\u4e0d\u9700\u8981\u5730\u533a\u4fe1\u606f\n\t'
>>> print title
地区信息,目前只需要一个, SSOPlatform不需要地区信息
>>> c=title.encode('gb2312')
>>> c
'\n\t\t\xb5\xd8\xc7\xf8\xd0\xc5\xcf\xa2\xa3\xac\xc4\xbf\xc7\xb0\xd6\xbb\xd0\xe8\xd2\xaa\xd2\xbb\xb8\xf6, SSOPlatform\xb2\xbb\xd0\xe8\xd2\xaa\xb5\xd8\xc7\xf8\xd0\xc5\xcf\xa2\n\t'
>>> print c
地区信息,目前只需要一个, SSOPlatform不需要地区信息
>>>
终于成功了,而且不需要再转码输出了,我不要再试了。最后再说一句,文件编码方式很重要,这个尤其的windows上!
本人亲测:xml中的encoding是UTF-8的时候,文件保存格式一定要是utf-8.这样直接打开IDE就可以读取xml。
另附:QQ好像所有的xml文件都是utf-8编码和保存的;百度好像大部分是gb2312。
自从那一次大小写的问题读出了gb2312的xml后,目前为止再也没有碰到过,哪怕结果是乱码也没有,都是异常。尽量用utf-8吧,基本可以解决一切xml编码问题。
9/15/2013 17:57:47
原创所有,转载请附加本文链接,谢谢!