java匹配unicode符号和表情,从Unicode字符串中正确提取表情符号

I am working in Python 2 and I have a string containing emojis as well as other unicode characters. I need to convert it to a list where each entry in the list is a single character/emoji.

x = u'😘😘xyz😊😊'

char_list = [c for c in x]

The desired output is:

['😘', '😘', 'x', 'y', 'z', '😊', '😊']

The actual output is:

[u'\ud83d', u'\ude18', u'\ud83d', u'\ude18', u'x', u'y', u'z', u'\ud83d', u'\ude0a', u'\ud83d', u'\ude0a']

How can I achieve the desired output?

解决方案

First of all, in Python2, you need to use Unicode strings (u'<...>') for Unicode characters to be seen as Unicode characters. And correct source encoding if you want to use the chars themselves rather than the \UXXXXXXXX representation in source code.

Now, as per Python: getting correct string length when it contains surrogate pairs and Python returns length of 2 for single Unicode character string, in Python2 "narrow" builds (with sys.maxunicode==65535), 32-bit Unicode characters are represented as surrogate pairs, and this is not transparent to string functions. This has only been fixed in 3.3 (PEP0393).

The simplest resolution (save for migrating to 3.3+) is to compile a Python "wide" build from source as outlined on the 3rd link. In it, Unicode characters are all 4-byte (thus are a potential memory hog) but if you need to routinely handle wide Unicode chars, this is probably an acceptable price.

The solution for a "narrow" build is to make a custom set of string functions (len, slice; maybe as a subclass of unicode) that would detect surrogate pairs and handle them as a single character. I couldn't readily find an existing one (which is strange), but it's not too hard to write:

as per UTF-16#U+10000 to U+10FFFF - Wikipedia,

the 1st character (high surrogate) is in range 0xD800..0xDBFF

the 2nd character (low surrogate) - in range 0xDC00..0xDFFF

these ranges are reserved and thus cannot occur as regular characters

So here's the code to detect a surrogate pair:

def is_surrogate(s,i):

if 0xD800 <= ord(s[i]) <= 0xDBFF:

try:

l = s[i+1]

except IndexError:

return False

if 0xDC00 <= ord(l) <= 0xDFFF:

return True

else:

raise ValueError("Illegal UTF-16 sequence: %r" % s[i:i+2])

else:

return False

And a function that returns a simple slice:

def slice(s,start,end):

l=len(s)

i=0

while i

if is_surrogate(s,i):

start+=1

end+=1

i+=1

i+=1

while i

if is_surrogate(s,i):

end+=1

i+=1

i+=1

return s[start:end]

Here, the price you pay is performance, as these functions are much slower than built-ins:

>>> ux=u"a"*5000+u"\U00100000"*30000+u"b"*50000

>>> timeit.timeit('slice(ux,10000,100000)','from __main__ import slice,ux',number=1000)

46.44128203392029 #msec

>>> timeit.timeit('ux[10000:100000]','from __main__ import slice,ux',number=1000000)

8.814016103744507 #usec

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值