Python 3.0 编码变动_all text is unicode; however encoded unicode is re-CSDN博客

Text Vs. Data Instead Of Unicode Vs. 8-bit

Everything you thought you knew about binary data and Unicode haschanged.

Python 3.0 uses the concepts of text and (binary) data insteadof Unicode strings and 8-bit strings. All text is Unicode; howeverencoded Unicode is represented as binary data. The type used tohold text is str, the type used to hold data isbytes. The biggest difference with the 2.x situation isthat any attempt to mix text and data in Python 3.0 raisesTypeError, whereas if you were to mix Unicode and 8-bitstrings in Python 2.x, it would work if the 8-bit string happened tocontain only 7-bit (ASCII) bytes, but you would getUnicodeDecodeError if it contained non-ASCII values. Thisvalue-specific behavior has caused numerous sad faces over theyears.
As a consequence of this change in philosophy, pretty much all codethat uses Unicode, encodings or binary data most likely has tochange. The change is for the better, as in the 2.x world therewere numerous bugs having to do with mixing encoded and unencodedtext. To be prepared in Python 2.x, start using unicodefor all unencoded text, and str for binary or encoded dataonly. Then the 2to3 tool will do most of the work for you.
You can no longer use u"..." literals for Unicode text.However, you must use b"..." literals for binary data.
As the str and bytes types cannot be mixed, youmust always explicitly convert between them. Use str.encode()to go from str to bytes, and bytes.decode()to go from bytes to str. You can also usebytes(s, encoding=...) and str(b, encoding=...),respectively.
Like str, the bytes type is immutable. There is aseparate mutable type to hold buffered binary data,bytearray. Nearly all APIs that accept bytes alsoaccept bytearray. The mutable API is based oncollections.MutableSequence.
All backslashes in raw string literals are interpreted literally.This means that '\U' and '\u' escapes in raw strings are nottreated specially. For example, r'\u20ac' is a string of 6characters in Python 3.0, whereas in 2.6, ur'\u20ac' was thesingle “euro” character. (Of course, this change only affects rawstring literals; the euro character is '\u20ac' in Python 3.0.)
The builtin basestring abstract type was removed. Usestr instead. The str and bytes typesdon’t have functionality enough in common to warrant a shared baseclass. The 2to3 tool (see below) replaces every occurrence ofbasestring with str.
Files opened as text files (still the default mode for open())always use an encoding to map between strings (in memory) and bytes(on disk). Binary files (opened with a b in the mode argument)always use bytes in memory. This means that if a file is openedusing an incorrect mode or encoding, I/O will likely fail loudly,instead of silently producing incorrect data. It also means thateven Unix users will have to specify the correct mode (text orbinary) when opening a file. There is a platform-dependent defaultencoding, which on Unixy platforms can be set with the LANGenvironment variable (and sometimes also with some otherplatform-specific locale-related environment variables). In manycases, but not all, the system default is UTF-8; you should nevercount on this default. Any application reading or writing more thanpure ASCII text should probably have a way to override the encoding.There is no longer any need for using the encoding-aware streamsin the codecs module.
Filenames are passed to and returned from APIs as (Unicode) strings.This can present platform-specific problems because on someplatforms filenames are arbitrary byte strings. (On the other hand,on Windows filenames are natively stored as Unicode.) As awork-around, most APIs (e.g. open() and many functions in theos module) that take filenames accept bytes objectsas well as strings, and a few APIs have a way to ask for abytes return value. Thus, os.listdir() returns alist of bytes instances if the argument is a bytesinstance, and os.getcwdb() returns the current workingdirectory as a bytes instance. Note that whenos.listdir() returns a list of strings, filenames thatcannot be decoded properly are omitted rather than raisingUnicodeError.
Some system APIs like os.environ and sys.argv canalso present problems when the bytes made available by the system isnot interpretable using the default encoding. Setting the LANGvariable and rerunning the program is probably the best approach.
PEP 3138: The repr() of a string no longer escapesnon-ASCII characters. It still escapes control characters and codepoints with non-printable status in the Unicode standard, however.
PEP 3120: The default source encoding is now UTF-8.
PEP 3131: Non-ASCII letters are now allowed in identifiers.(However, the standard library remains ASCII-only with the exceptionof contributor names in comments.)
The StringIO and cStringIO modules are gone. Instead,import the io module and use io.StringIO orio.BytesIO for text and data respectively.
See also the Unicode HOWTO, which was updated for Python 3.0.