Python程序在Windows终端乱码解决方法

最新推荐文章于 2022-02-25 08:15:10 发布

可以要的

最新推荐文章于 2022-02-25 08:15:10 发布

阅读量667

点赞数

本文链接：https://blog.csdn.net/ugfffj/article/details/84137302

版权

分享一下我老师大神的人工智能教程！零基础，通俗易懂！http://blog.csdn.net/jiangjunshow

也欢迎大家转载本篇文章。分享知识，造福人民，实现我们中华民族伟大复兴！

问题提出

近期把一个Python项目移到了Windows下运行，竟然中文乱码了，在Linux上运行明明好好的。

Python程序在Windows终端乱码

呵呵，对Windows妥妥的没有爱了。。。。

问题原因

Python程序在Windows终端（cmd）下乱码，是字符串编码的问题

Python文件编码

Python 默认脚本文件都是 ANSCII 编码的，当文件中有非 ANSCII 编码范围内的字符的时候就要使用”编码指示”来修正。一个module的定义中，如果.py文件中包含中文字符（严格的说是含有非anscii字符），则需要在第一行或第二行指定编码声明：

# -*- coding=utf-8 -*-
  
  1

或者

#coding=utf-8
  
  1

其他的编码如：gbk、gb2312也可以；否则会出现类似

SyntaxError: Non-ASCII character '/xe4' in file ChineseTest.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html  for details
  
  1

这样的异常信息

实际上Python只检查#、coding和编码字符串，其他的字符都是为了美观加上的。另外，Python中可用的字符编码有很多，并且还有许多别名，还不区分大小写，比如UTF-8可以写成u8。参见http://docs.python.org/library/codecs.html#standard-encodings。

另外需要注意的是声明的编码必须与文件实际保存时用的编码一致，否则很大几率会出现代码解析异常。现在的IDE一般会自动处理这种情况，改变声明后同时换成声明的编码保存，但文本编辑器控们需要小心：）

系统默认的编码方式可以采用：

import sysprint sys.getdefaultencoding( )
  
  1
2

当然设置默认的编码方式采用：

import sysreload(sys)print sys.setdefaultencoding('utf-8')
  
  1
2
3

至于为什么要reload sys，是因为系统在加载时删除了sys.setdefaultencoding（’utf-8’）这句，所以这个时候要重新加载reload sys模块才能调用sys.setdefaultencoding（’utf-8’）语句起作用
基础就这么多了，接下来我们再看看字符串的编码。。。。

str和unicode编码

str和unicode都是basestring的子类。严格意义上说，str其实是字节串，它是unicode经过编码后的字节组成的序列。
str类型是一个包含Characters represent (at least) 8-bit bytes的序列；unicode的每个unit是一个unicode obj
所以：len(u’中国’)的值是2；len(‘ab’)的值也是2；

在str的文档中有这样的一句话：The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. 也就是说在读取一个文件的内容，或者从网络上读取到内容时，保持的对象为str类型；如果想把一个str转换成特定编码类型，需要把str转为Unicode,然后从unicode转为特定的编码类型如：utf-8、gb2312等；

对UTF-8编码的str’汉’使用len()函数时，结果是3，因为实际上，UTF-8编码的’汉’ == ‘\xE6\xB1\x89’。
unicode才是真正意义上的字符串，对字节串str使用正确的字符编码进行解码后获得，并且len(u’汉’) == 1。

例如下面的代码

#coding=utf-8import sysif __name__ == "__main__"    print sys.getdefaultencoding( )    #  默认编码为ascii    s1 = '中文'    print type(s1)      #  str[UTF-8编码]    print len(s1)       #  6    print s1            #  乱码    s2 = u'中文'       #  unicode    print type(s2)      #  2    print len(s2)       #     print s2            #  异常，UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)                        #  s2的编码为unicode，而当前文件的编码为utf-8，Python内置默认编码为ascii
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

再来看看encode()和decode()两个basestring的实例方法，理解了str和unicode的区别后，这两个方法就不会再混淆了：

encode和decode

参考Python字符编码详解

首先，现存有很多种编码方式，就中文而言，gb2312，GBK，gb18030（包含中文字符最多最全面的）
而目前国际上常用的编码方式有：utf-8.（发现中文乱码之后，百度答案：在py文件开头添加coding为utf-8）

-*- conding： utf-8 -*- xxx.decode()  是把xxx按括号中的编码方式  解码成unicode xxx.encode()  是把unicode类型的xxx按括号中的编码方式  进行编码（所
  
  1
2
3

以如果xxx不是unicode，系统会采用默认编解码方式对xxx进行解码，然后再做如上的编码操作）
下面我们看几个转换编码的例子

unicode转为 gb2312,utf-8等

#coding=utf-8#unicode转为 gb2312,utf-8等if __name__ == "__main__":    s = u'中国'    s_gb = s.encode('gb2312')    print s_gb    #  utf-8,GBK转换为unicode 使用函数unicode(s,encoding) 或者s.decode(encoding)    s = u'中国'    #s为unicode先转为utf-8    s_utf8 =  s.encode('UTF-8')    assert(s_utf8.decode('utf-8') == s)    print s_utf8.decode('utf-8')
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

普通的str转为unicode

#coding=utf-8import sys#  普通的str转为unicodeif __name__ == '__main__':    #  如果把下面几行注视掉会异常    print sys.getdefaultencoding( )    reload(sys)    sys.setdefaultencoding( "utf-8" )    print sys.getdefaultencoding( )    s = '中国'    su = u'中国'    print s         #  乱码    print su        #  不乱码    #s为unicode先转为utf-8    #  因为s为所在的.py(# -*- coding=UTF-8 -*-)文件编码应该为utf-8    #  采用sys.setdefaultencoding( "utf-8" )设置字符编码为utf-8    s_unicode =  s.decode('UTF-8')    assert(s_unicode == su)    print s_unicode    #s转为gb2312, 需要先转为unicode再转为gb2312    print s.decode('utf-8').encode('gb2312')    #如果直接执行s.encode('gb2312')会发生什么？    print s.encode('gb2312')
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

对比

#coding=utf-8import sysif __name__ == '__main__':    s = '中国'    print s    #如果直接执行s.encode('gb2312')会发生什么？    print s.encode('gb2312')
  
  1
2
3
4
5
6
7
8
9

这里会发生一个异常：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
  
  1

Python 会自动的先将 s 解码为 unicode ，然后再编码成 gb2312。因为解码是python自动进行的，我们没有指明解码方式，python 就会使用 sys.defaultencoding 指明的方式来解码。很多情况下 sys.defaultencoding 是 ANSCII，如果 s 不是这个类型就会出错。
拿上面的情况来说，我的 sys.defaultencoding 是 anscii，而 s 的编码方式和文件的编码方式一致，是 utf8 的，所以出错了: UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe4 in position 0: ordinal not in range(128)
对于这种情况，我们有两种方法来改正错误：
一是明确的指示出 s 的编码方式

#! /usr/bin/env python # -*- coding: utf-8 -*- s = '中文' s.decode('utf-8').encode('gb2312') 
  
  1
2
3
4

二是更改 sys.defaultencoding 为文件的编码方式

#coding=utf-8import sysif __name__ == '__main__':    reload(sys)    sys.setdefaultencoding( "utf-8" )    s = '中国'    print s    #如果直接执行s.encode('gb2312')会发生什么？    print s.encode('gb2312')
  
  1
2
3
4
5
6
7
8
9
10
11
12

Pyton内部编码

首先要搞清楚，字符串在Python内部的表示是unicode编码，因此，在做编码转换时，通常需要以unicode作为中间编码，即先将其他编码的字符串解码（decode）成unicode，再从unicode编码（encode）成另一种编码。

在某些IDE中，字符串的输出总是出现乱码，甚至错误，其实是由于IDE的结果输出控制台自身不能显示字符串的编码，而不是程序本身的问题。

#coding=utf-8import sys if __name__ == "__main__":#  该文件的编码必需与s.decode('utf8')指定的编码一致，不然会抛出解码异常信息，#  可以通过s.decode("gbk", "ignore")或s.decode("gbk", "replace")来解决。    print sys.getdefaultencoding( )    #  默认编码为ascii    s = '中文'    # 该文件的编码为UTF-8, 因此无异常    s.decode('utf8')            #  将默认的unicode编码字符解码成utf8    print s    #  异常UnicodeDecodeError: 'gbk' codec can't decode bytes in position 2-3: illegal multibyte sequence    #s.decode('GBK')             #把unicode类型串的s按括号中GBK的编码方式,编码成GBK    #print s    # 异常UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)    #s.encode('GBK')    #print s    s.decode('gbk', "ignore")   #  将s按照gbk编码解码为utf8，忽略其中有异常的编码，仅显示有效的编码    print s    s.decode('gbk', 'replace')  #  替换其中异常的编码，这个相对来可能一眼就知道那些字符编码出问题了    print s
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

该文件的编码必需与s.decode(‘utf8’)指定的编码一致，不然会抛出解码异常信息

文件编码与print函数

建立一个文件test.txt，文件格式用ANSI，内容为:

abc中文
  
  1

用如下python代码来读取

#coding=gbkprint open("Test.txt").read()
  
  1
2

结果输出

abc中文
  
  1

把文件格式改成UTF-8：
结果：

abc涓枃
  
  1

显然，这里需要解码

# coding=gbkimport codecsprint open("Test.txt").read().decode("utf-8")
  
  1
2
3

结果：

abc中文
  
  1

上面的test.txt我是用Editplus来编辑的，但当我用Windows自带的记事本编辑并存成UTF-8格式时，
运行时报错：

Traceback (most recent call last):  File "ChineseTest.py", line 3, in <module>    print open("Test.txt").read().decode("utf-8")UnicodeEncodeError: 'gbk' codec can't encode character u'/ufeff' in position 0: illegal multibyte sequence
  
  1
2
3
4

原来，某些软件，如notepad，在保存一个以UTF-8编码的文件时，会在文件开始的地方插入三个不可见的字符（0xEF 0xBB 0xBF，即BOM）。
因此我们在读取时需要自己去掉这些字符，python中的codecs module定义了这个常量：

# coding=gbkimport codecsdata = open("Test.txt").read()if data[:3] == codecs.BOM_UTF8: data = data[3:]print data.decode("utf-8")
  
  1
2
3
4
5
6

结果：

abc中文
  
  1

总结

Windows下终端输出中午的方法

明确的指示出 s 的编码方式下输出中文

#coding=utf-8import sysif __name__ == "__main__":    #reload(sys)    #sys.setdefaultencoding( "utf-8" )    # 下面两种代码等价    print "中文".decode("utf-8")    print u"中文"    assert("中文".decode("utf-8") == u"中文")    # 下面两种代码等价    print "中文".decode("utf-8").encode("GBK")     #  将字符先解码成utf8，在编码成GBK    print u"中文".encode("GBK")   # u"XXX"定义的字符为unicode字符，直接将unicode编码的字符编码为GBK编码，然后输出    #  同上    print "中文".decode("utf-8").encode('gb2312')    print u"中文".encode('gb2312')    print "中文".decode("utf-8").encode('cp936')    print u"中文".encode('cp936')    print u'中文'.encode('utf-8').decode('utf-8')    #如果直接执行s.encode('gb2312')异常    print "中国".encode('gb2312')
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

或者

更改sys.defaultencoding为文件的编码方式下输出中文

这种方式下上面所用的输出方法同样适用

#coding=utf-8import sysif __name__ == "__main__":    #  当字符串的编码不是utf-8    print sys.getdefaultencoding( )    reload(sys)    sys.setdefaultencoding( "utf-8" )    print sys.getdefaultencoding( )    #如果直接执行s.encode('gb2312')    print "中国".encode('gb2312')
  
  1
2
3
4
5
6
7
8
9
10
11
12
13