Converting Unicode Strings to 8-bit Strings 转换unicode到utf-8

A Unicode string holds characters from the Unicode character set.

If you want an 8-bit string, you need to decide what encoding you want to use. Common encodings are US-ASCII (which is the default if you convert from Unicode to 8-bit strings in Python), ISO-8859-1 (aka Latin-1), and UTF-8 (a variable-width encoding that can represent all Unicode strings).

 

For example, if you want Latin-1 strings, you can use one of:

    s = u.encode("iso-8859-1") # fail if some character cannot be converted
s = u.encode("iso-8859-1", "replace") # instead of failing, replace with ?
s = u.encode("iso-8859-1", "ignore") # instead of failing, leave it out

If you want an ASCII string, replace “iso-8859-1” above with “ascii” or “us-ascii”.

If you want to output the data to a web browser or an XML file, you can use:

    import cgi
s = cgi.escape(u).encode("ascii", "xmlcharrefreplace")

The cgi.escape function converts reserved characters (< > and &) to character entities (&lt;, &gt; and &amp;), and the xmlcharrefreplace flag tells the encoder to use character references (&#nn;) for any character that cannot be encoded in the given encoding. The browser (or XML parser) at the other end will convert things back to Unicode.

Note that cgi.escape doesn’t escape quotes by default. To use the value in an attribute, you need to pass in an extra flag to escape, and put the result in double quotes:

 
    s = 'attr="%s"' % cgi.escape(u,1).encode("ascii", "xmlcharrefreplace")

The unaccent.py script shows how to strip off accents from latin characters:

 
Example: Use a dynamically populated translation dictionary to remove accents from a string.
import unicodedata, sys

CHAR_REPLACEMENT = {
# latin-1 characters that don't have a unicode decomposition
0xc6: u"AE", # LATIN CAPITAL LETTER AE
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
0xde: u"Th", # LATIN CAPITAL LETTER THORN
0xdf: u"ss", # LATIN SMALL LETTER SHARP S
0xe6: u"ae", # LATIN SMALL LETTER AE
0xf0: u"d", # LATIN SMALL LETTER ETH
0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
0xfe: u"th", # LATIN SMALL LETTER THORN
}

##
# Translation dictionary. Translation entries are added to this
# dictionary as needed.

class unaccented_map(dict):

##
# Maps a unicode character code (the key) to a replacement code
# (either a character code or a unicode string).

def mapchar(self, key):
ch = self.get(key)
if ch is not None:
return ch
de = unicodedata.decomposition(unichr(key))
if de:
try:
ch = int(de.split(None, 1)[0], 16)
except (IndexError, ValueError):
ch = key
else:
ch = CHAR_REPLACEMENT.get(key, key)
self[key] = ch
return ch

if sys.version >= "2.5":
# use __missing__ where available
__missing__ = mapchar
else:
# otherwise, use standard __getitem__ hook (this is slower,
# since it's called for each character)
__getitem__ = mapchar


if __name__ == "__main__":

text = u"""

"Jo, når'n da ha gått ett stôck te, så kommer'n te e å,
å i åa ä e ö."
"Vasa", sa'n.
"Å i åa ä e ö", sa ja.
"Men va i all ti ä dä ni säjer, a, o?", sa'n.
"D'ä e å, vett ja", skrek ja, för ja ble rasen, "å i åa
ä e ö, hörer han lite, d'ä e å, å i åa ä e ö."
"A, o, ö", sa'n å dämmä geck'en.
Jo, den va nôe te dum den.

(taken from the short story "Dumt fôlk" in Gustaf Fröding's
"Räggler å paschaser på våra mål tå en bonne" (1895).

"""

print text.translate(unaccented_map())

# note that non-letters are passed through as is; you can use
# encode("ascii", "ignore") to get rid of them. alternatively,
# you can tweak the translation dictionary to return None for
# characters >= "/x80".

map = unaccented_map()

print repr(u"12/xbd inch".translate(map))
print repr(u"12/xbd inch".translate(map).encode("ascii", "ignore"))

Comment:

1. I'm not sure if "eth" should be converted into "d" or "dh", and the "capital O with stroke" into "OE" or "Oe", but you as a Scandinavian surely know better. 2. Please don't confine the translation to Latin-1 only. I especially miss the "l with stroke", which is very frequent in Polish. Here is a fragment of my program performing the same task with additional non-decomposable characters that you may consider to add:

    # non-decomposable characters from  Latin-1 and Latin Extended A
charmap = {
u'/N{Latin capital letter AE}': 'AE',
u'/N{Latin small letter ae}': 'ae',
u'/N{Latin capital letter Eth}': 'Dh',
u'/N{Latin small letter eth}': 'dh',
u'/N{Latin capital letter O with stroke}': 'Oe',
u'/N{Latin small letter o with stroke}': 'oe',
u'/N{Latin capital letter Thorn}': 'Th',
u'/N{Latin small letter thorn}': 'th',
u'/N{Latin small letter sharp s}': 'ss',
u'/N{Latin capital letter D with stroke}': 'Dj',
u'/N{Latin small letter d with stroke}': 'dj',
u'/N{Latin capital letter H with stroke}': 'H',
u'/N{Latin small letter h with stroke}': 'h',
u'/N{Latin small letter dotless i}': 'i',
u'/N{Latin small letter kra}': 'q',
u'/N{Latin capital letter L with stroke}': 'L',
u'/N{Latin small letter l with stroke}': 'l',
u'/N{Latin capital letter Eng}': 'Ng',
u'/N{Latin small letter eng}': 'ng',
u'/N{Latin capital ligature OE}': 'Oe',
u'/N{Latin small ligature oe}': 'oe',
u'/N{Latin capital letter T with stroke}': 'Th',
u'/N{Latin small letter t with stroke}': 'th',
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值