python 获取系统相关编码的函数

最新推荐文章于 2024-07-27 22:58:17 发布

剑西楼

最新推荐文章于 2024-07-27 22:58:17 发布

阅读量4.3k

点赞数 3

怎么避免UnicodeEncodeError: ‘ascii’ codec can’t…类似的错误？

1、首先在py文件头部指定文件内容编码，例如：# coding: utf8

2、文件保存的时候要和py文件头部编码一致

3、在用decode和encode的时候，一定要确认要转换的字符原编码是什么。

例如：网页中都会指定编码(<meta http-equiv=content-type content=”text/html; charset=gb2312″>), 你在抓取这个网站并获取它的html后进行编码转化就要注意了:

import urllib2

html = urllib2.urlopen(url)

html = html.decode(‘gb2312′)

只要做上面三个就不会出现转换编码错误了

python建议，在python代码中最好所有变量都是unicode; 流程可以这么写：变量(转换成unicode)——>python代码——–>变量(转换成其他编码)

sys.getdefaultencoding():系统的缺省编码(一般就是ascii),python默认语言的编码是ascii编码, 这就是为什么在py文件的头部都要指定编码了# coding:utf-8

Python获取系统编码参数的几个函数

系统的缺省编码(一般就是ascii)：sys.getdefaultencoding()
系统当前的编码：locale.getdefaultlocale()
系统代码中临时被更改的编码（通过locale.setlocale(locale.LC_ALL,“zh_CN.UTF-8″)）：locale.getlocale()
文件系统的编码：sys.getfilesystemencoding()
终端的输入编码：sys.stdin.encoding
终端的输出编码：sys.stdout.encoding
代码的缺省编码：文件头上# -*- coding: utf-8 –*-

来源：http://justpy.com/archives/144

(二)

http://www.cnblogs.com/itrust/archive/2010/05/14/1735185.html

字符串

python有两种字符串

 
         byteString  
         =  
         "hello world! (in my default locale)" 
        
         unicodeString  
         =  
         u 
         "hello Unicode world!"

相互转换

 
    
         1  
         s  
         =  
         "hello normal string" 
        
 
         2  
         u  
         =  
         unicode 
         ( s,  
         "utf-8"  
         ) 
        
 
         3  
         backToBytes  
         =  
         u.encode(  
         "utf-8"  
         ) 
        
 
         3  
         backToUtf8  
         =  
         backToBytes.decode(‘utf 
         - 
         8 
         ’)  
         #与第二行效果相同 
        
 
  

如何判断

 
    
         if  
         isinstance 
         ( s,  
         str  
         ):  
         # 对Unicode strings，这个判断结果为False 
        
 
         if  
         isinstance 
         ( s,  
         unicode 
         ):  
         # 对Unicode strings，这个判断结果为True 
        
 
         if  
         isinstance 
         ( s,  
         basestring  
         ):  
         # 对两种字符串，返回都为True 
        
 
  

做个试验

 
    
         import  
         sys  
        
 
         print  
         'default encoding: '  
         , sys.getdefaultencoding() 
        
 
         print  
         'file system encoding: '  
         , sys.getfilesystemencoding() 
        
 
         print  
         'stdout encoding: '  
         , sys.stdout.encoding 
        
 
         print  
         u 
         'u"中文" is unicode: ' 
         ,  
         isinstance 
         (u 
         '中文' 
         , 
         unicode 
         ) 
        
 
         print  
         u 
         '"中文" is unicode: ' 
         ,  
         isinstance 
         ( 
         '中文' 
         , 
         unicode 
         ) 
        
 
  

看输出结果，注意下列事实：

python系统缺省的编码格式为ASCII，这个缺省编码在Python转换字符串时用的到，这里给两个例子：

1. a = "abc" + u"bcd", Python会如此转换"abc".decode(sys.getdefaultencoding()) 然后将两个Unicode字符合并。

2. print unicode('中文') , 这句话执行会出错“UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 …”，是因为Python试图用缺省编码来编码，而这个字符串不是ASCII，因此需要显示的指出，如果你的文件源类型为utf-8，则应如此：print unicode('中文','utf-8’)

Windows下getfilesystemencoding输出mbcs（多字节编码，windows的mbcs，也就是ansi，它会在不同语言的windows中使用不同的编码，在中文的windows中就是gb系列的编码)

Windows下控制台编码为cp936, 当你打印东西到控制台时Python自动做了转换。这里会引发一个有趣的问题, 试一下这个简单的例子test.py：

 
         # -*- coding: utf-8 -*- 
        
         s  
         =  
         u 
         '中文' 
        
         print  
         s

在控制台中分别运行 python test.py 和 python test.py > 1.txt

你会发现后者会报错，原因是打印控制台时Python会自动转换编码到sys.stdout.encoding, 而输出到文件时Python不会自动在write调用中进行内部字符转换。这个问题在PrintFails中有较详细的说明。

UTF-8编码格式

保存utf-8格式的文件

 
    
         import  
         codecs 
        
 
         fileObj  
         =  
         codecs. 
         open 
         (  
         "someFile" 
         ,  
         "r" 
         ,  
         "utf-8"  
         ) 
        
 
         u  
         =  
         fileObj.read()  
         # Returns a Unicode string from the UTF-8 bytes in the file 
        
 
  

自己写BOM头

 
         out  
         =  
         file 
         (  
         "someFile" 
         ,  
         "w"  
         ) 
        
         out.write( codecs.BOM_UTF8 ) 
        
         out.write( unicodeString.encode(  
         "utf-8"  
         ) ) 
        
         out.close()

自己去掉BOM头

对UTF-16, Python将BOM解码为空字串。然而对UTF-8, BOM被解码为一个字符，如例：

 
 
  
  
   
   
    
     
        
          1 
         
        
          2 
         
        
          3 
         
        
          4 
         
         
          
          >>> codecs.BOM_UTF16.decode(  
          "utf16"  
          )  
          
          
          u''  
          
          
          >>> codecs.BOM_UTF8.decode(  
          "utf8"  
          )  
          
          
          u 
          '\ufeff'

不知道为什么会这样不同，因此你需要在读文件时自己去掉BOM：

 
         import codecs 
        
         if  
         s.beginswith( codecs. 
         BOM_UTF8  
         ): 
        
         # The byte string s begins with the BOM: Do something. 
        
         # For example, decode the string as UTF-8 
        
         if  
         u[ 
         0 
         ] == unicode( codecs. 
         BOM_UTF8 
         ,  
         "utf8"  
         ): 
        
         # The unicode string begins with the BOM: Do something. 
        
         # For example, remove the character. 
        
         # Strip the BOM from the beginning of the Unicode string, if it exists 
        
         u.lstrip( unicode( codecs. 
         BOM_UTF8 
         ,  
         "utf8"  
         ) )

源码文件的编码

关于Python对代码文件的编码处理，PEP0263 讲的很清楚，现摘录如下

python缺省认为文件为ASCII编码。

可在代码头一行或二行加入声明文件编码申明，通知python该文件的编码格式，如

# -*- coding: utf-8 –*- # 注意使用的编辑器，确保文件保存时使用了该编码格式

对于Windows这样的平台，它使用了BOM（文件头三个字节 \xef\xbb\xbf）来申明文件为utf-8编码，这种情况下：

如果文件中没有编码申明，python以utf8处理
如果有编码申明但不是utf-8, python报错

==============另外，关于BOM================

(三)

某些软件，如notepad，在保存一个以UTF-8编码的文件时，会在文件开始的地方插入三个不可见的字符（0xEF 0xBB 0xBF，即BOM）。
因此我们在读取时需要自己去掉这些字符，python中的codecs module定义了这个常量：

 
    
         # coding=gbk 
        
 
         import  
         codecs 
        
 
         data  
         =  
         open 
         ( 
         "Test.txt" 
         ).read() 
        
 
         if  
         data[: 
         3 
         ]  
         = 
         =  
         codecs.BOM_UTF8: 
        
 
           
         data  
         =  
         data[ 
         3 
         :] 
        
 
         print  
         data.decode( 
         "utf-8" 
         ) 
        
 
  

剑西楼

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
python 获取系统相关编码的函数

怎么避免UnicodeEncodeError: ‘ascii’ codec can’t…类似的错误？1、首先在py文件头部指定文件内容编码，例如：# coding: utf82、文件保存的时候要和py文件头部编码一致3、在用decode和encode的时候，一定要确认要转换的字符原编码是什么。例如：网页中都会指定编码(), 你在抓取这个网站并获取它的html后进行编
复制链接

扫一扫