Reference:
1. interpretation about unicode, ascii
https://nedbatchelder.com/text/unipain.html
2. how to avoid any encoding problem
copy below two sentences at the top of your program
(remove ” before copying them)
“#!/usr/bin/env python”
“# -- coding: utf-8 --”
http://stackoverflow.com/questions/728891/correct-way-to-define-python-source-code-encoding
Problem 1: UnicodeDecodeError
Python 2 tries to be helpful when working with unicode and byte strings. If you try to perform a string operation that combines a unicode string with a byte string, Python 2 will automatically decode the byte string to produce a second unicode string, then will complete the operation with the two unicode strings.
For example, we try to concatenate a unicode “Hello ” with a byte string “world”. The result is a unicode “Hello world”. On our behalf, Python 2 is decoding the byte string “world” using the ASCII codec. The encoding used for these implicit decodings is the value of sys.getdefaultencoding().
The implicit encoding is ASCII because it’s the only safe guess: ASCII is so widely accepted, and is a subset of so many encodings, that it’s unlikely to produce false positives.
Of course, these implicit decodings are not immune to decoding errors. If you try to combine a byte string with a unicode string and the byte string can’t be decoded as ASCII, then the operation will raise a UnicodeDecodeError.
This is the source of those painful UnicodeErrors. Your code inadvertently mixes unicode strings and byte strings, and as long as the data is all ASCII, the implicit conversions silently succeed. Once a non-ASCII character finds its way into your program, an implicit decode will fail, causing a UnicodeDecodeError.
And also other several problems (see references)
Pain relief
solution 1:
As we saw with Fact of Life #1, the data coming into and going out of your program must be bytes. But you don’t need to deal with bytes on the inside of your program. The best strategy is to decode incoming bytes as soon as possible, producing unicode. You use unicode throughout your program, and then when outputting data, encode it to bytes as late as possible.
This creates a Unicode sandwich: bytes on the outside, Unicode on the inside.
Keep in mind that sometimes, a library you’re using may do some of these conversions for you. The library may present you with Unicode input, or will accept Unicode for output, and the library will take care of the edge conversion to and from bytes. For example, Django provides Unicode, as does the json module.
Solution 2:
The second rule is, you have to know what kind of data you are dealing with. At any point in your program, you need to know whether you have a byte string or a unicode string. This shouldn’t be a matter of guessing, it should be by design.
In addition, if you have a byte string, you should know what encoding it is if you ever intend to deal with it as text.
When debugging your code, you can’t simply print a value to see what it is. You need to look at the type, and you may need to look at the repr of the value in order to get to the bottom of what data you have.
Conclusion
Unicode sandwich:
keep all text in your program as Unicode, and convert as close to the edges as possible.
Know what your strings are: you should be able to explain which of your strings are Unicode, which are bytes, and for your byte strings, what encoding they use. (sometimes, if you encode with wrong encoder, it causes mojibake)
Test your Unicode support. Use exotic strings throughout your test suites to be sure you’re covering all the cases.