Converting Unicode Strings to 8-bit Strings 转换unicode到utf-8

最新推荐文章于 2021-09-02 16:05:50 发布

chenbuaa

最新推荐文章于 2021-09-02 16:05:50 发布

阅读量1.5k

点赞数

文章标签： character translation dictionary string encoding browser

本文链接：https://blog.csdn.net/chenbuaa/article/details/2449630

版权

A Unicode string holds characters from the Unicode character set.

If you want an 8-bit string, you need to decide what encoding you want to use. Common encodings are US-ASCII (which is the default if you convert from Unicode to 8-bit strings in Python), ISO-8859-1 (aka Latin-1), and UTF-8 (a variable-width encoding that can represent all Unicode strings).

For example, if you want Latin-1 strings, you can use one of:

    s = u.encode("iso-8859-1") # fail if some character cannot be converted
    s = u.encode("iso-8859-1", "replace") # instead of failing, replace with ?
    s = u.encode("iso-8859-1", "ignore") # instead of failing, leave it out

If you want an ASCII string, replace “iso-8859-1” above with “ascii” or “us-ascii”.

If you want to output the data to a web browser or an XML file, you can use:

    import cgi
    s = cgi.escape(u).encode("ascii", "xmlcharrefreplace")

The cgi.escape function converts reserved characters (< > and &) to character entities (<, > and &), and the xmlcharrefreplace flag tells the encoder to use character references (&#nn;) for any character that cannot be encoded in the given encoding. The browser (or XML parser) at the other end will convert things back to Unicode.

Note that cgi.escape doesn’t escape quotes by default. To use the value in an attribute, you need to pass in an extra flag to escape, and put the result in double quotes:

    s = 'attr="%s"' % cgi.escape(u,1).encode("ascii", "xmlcharrefreplace")

The unaccent.py script shows how to strip off accents from latin characters:

Example: Use a dynamically populated translation dictionary to remove accents from a string.

import unicodedata, sys

CHAR_REPLACEMENT = {
    # latin-1 characters that don't have a unicode decomposition
    0xc6: u"AE", # LATIN CAPITAL LETTER AE
    0xd0: u"D",  # LATIN CAPITAL LETTER ETH
    0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
    0xde: u"Th", # LATIN CAPITAL LETTER THORN
    0xdf: u"ss", # LATIN SMALL LETTER SHARP S
    0xe6: u"ae", # LATIN SMALL LETTER AE
    0xf0: u"d",  # LATIN SMALL LETTER ETH
    0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
    0xfe: u"th", # LATIN SMALL LETTER THORN
    }

##
# Translation dictionary.  Translation entries are added to this
# dictionary as needed.

class unaccented_map(dict):

    ##
    # Maps a unicode character code (the key) to a replacement code
    # (either a character code or a unicode string).

    def mapchar(self, key):
        ch = self.get(key)
        if ch is not None:
            return ch
        de = unicodedata.decomposition(unichr(key))
        if de:
            try:
                ch = int(de.split(None, 1)[0], 16)
            except (IndexError, ValueError):
                ch = key
        else:
            ch = CHAR_REPLACEMENT.get(key, key)
        self[key] = ch
        return ch

    if sys.version >= "2.5":
        # use __missing__ where available
        __missing__ = mapchar
    else:
        # otherwise, use standard __getitem__ hook (this is slower,
        # since it's called for each character)
        __getitem__ = mapchar


if __name__ == "__main__":

    text = u"""

    "Jo, når'n da ha gått ett stôck te, så kommer'n te e å,
    å i åa ä e ö."
    "Vasa", sa'n.
    "Å i åa ä e ö", sa ja.
    "Men va i all ti ä dä ni säjer, a, o?", sa'n.
    "D'ä e å, vett ja", skrek ja, för ja ble rasen, "å i åa
    ä e ö, hörer han lite, d'ä e å, å i åa ä e ö."
    "A, o, ö", sa'n å dämmä geck'en.
    Jo, den va nôe te dum den.

    (taken from the short story "Dumt fôlk" in Gustaf Fröding's
    "Räggler å paschaser på våra mål tå en bonne" (1895).

    """

    print text.translate(unaccented_map())

    # note that non-letters are passed through as is; you can use
    # encode("ascii", "ignore") to get rid of them.  alternatively,
    # you can tweak the translation dictionary to return None for
    # characters >= "/x80".

    map = unaccented_map()

    print repr(u"12/xbd inch".translate(map))
    print repr(u"12/xbd inch".translate(map).encode("ascii", "ignore"))

Comment:

1. I'm not sure if "eth" should be converted into "d" or "dh", and the "capital O with stroke" into "OE" or "Oe", but you as a Scandinavian surely know better. 2. Please don't confine the translation to Latin-1 only. I especially miss the "l with stroke", which is very frequent in Polish. Here is a fragment of my program performing the same task with additional non-decomposable characters that you may consider to add:

    # non-decomposable characters from  Latin-1 and Latin Extended A
    charmap = {
        u'/N{Latin capital letter AE}': 'AE',
        u'/N{Latin small letter ae}': 'ae',
        u'/N{Latin capital letter Eth}': 'Dh',
        u'/N{Latin small letter eth}': 'dh',
        u'/N{Latin capital letter O with stroke}': 'Oe',
        u'/N{Latin small letter o with stroke}': 'oe',
        u'/N{Latin capital letter Thorn}': 'Th',
        u'/N{Latin small letter thorn}': 'th',
        u'/N{Latin small letter sharp s}': 'ss',
        u'/N{Latin capital letter D with stroke}': 'Dj',
        u'/N{Latin small letter d with stroke}': 'dj',
        u'/N{Latin capital letter H with stroke}': 'H',
        u'/N{Latin small letter h with stroke}': 'h',
        u'/N{Latin small letter dotless i}': 'i',
        u'/N{Latin small letter kra}': 'q',
        u'/N{Latin capital letter L with stroke}': 'L',
        u'/N{Latin small letter l with stroke}': 'l',
        u'/N{Latin capital letter Eng}': 'Ng',
        u'/N{Latin small letter eng}': 'ng',
        u'/N{Latin capital ligature OE}': 'Oe',
        u'/N{Latin small ligature oe}': 'oe',
        u'/N{Latin capital letter T with stroke}': 'Th',
        u'/N{Latin small letter t with stroke}': 'th',
    }

chenbuaa

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Converting Unicode Strings to 8-bit Strings 转换unicode到utf-8

A Unicode string holds characters from the Unicode character set. If you want an 8-bit string, you need to decide what encoding you want to use. Common encodings are US-ASCII (which is the default
复制链接

扫一扫