python里面的and符号,如何从Python中的字符串中提取表情符号和标志？

最新推荐文章于 2023-01-29 18:13:32 发布

weixin_39973271

最新推荐文章于 2023-01-29 18:13:32 发布

阅读量107

点赞数

文章标签： python里面的and符号

import emoji

def emoji_lis(string):

_entities = []

for pos,c in enumerate(string):

if c in emoji.UNICODE_EMOJI:

print("Matched!!", c ,c.encode('ascii',"backslashreplace"))

_entities.append({

"location":pos,

"emoji": c

})

return _entities

emoji_lis("👧🏿 مدیحہ🇵🇰 así, se 😌 ds 💕👭")

Matched!! 👧 \U0001f467

Matched!! 🏿 \U0001f3ff

Matched!! 😌 \U0001f60c

Matched!! 💕 \U0001f495

Matched!! 👭 \U0001f46d

My code is working of all other emoji's but how can I detect country flags 🇵🇰?

解决方案

Here is an article about how Unicode encodes country flags. They are represented as sequences of two regional indicator symbols (code points ranging from U+1F1E6 to U+1F1FF), although obviously not every possible combination of two symbols corresponds to a country (and therefore a flag), obviously. You could just assume that no "bad" combinations will happen or maintain (or import) a set with the (currently) 270 valid pairs of symbols.

Then there are regional flags. These are represented as a black flag code point (U+1F3F4) followed by a sequence of tags (code points U+E0001 and range from U+E0020 to U+E007F) spelling the region identifier (for example, for the flag or Wales that would be "gbwls"), plus a "cancel tag" code point (U+E007F).

And, besides all that, you also have of course regular emojis that look like flags. The aforementioned black flag (U+1F3F4) is one of them, but you also have triangular flag (U+1F6A9), etc. Most of these you should already be able to detect, since they are just like other emojis. However, we are not quite done here. You have the issue of composite emojis, which affects some flags but also many other emojis. In your example, you can see that the matched emoji for the black woman in the input string is a "base" woman emoji, and then this brown patch. This is because the black woman emoji is made up of two code points, woman (U+1F469) and dark skin tone (U+1F311). In many other cases, you would need the two code points, plus a zero-width joiner (U+200D) in between, to specify that you want them merged. And sometimes you also need to throw in a variation selector (typically 16, U+FE0F) to indicate that you want things to be used as emojis. You can read more about this in this article. In the case of flags, you have for example the rainbow flag (U+1F3F3, U+FE0F,‍ U+200D, U+1F308), that would read "white flag, variation selector 16 (to use white flag emoji, not text), zero-width joiner, rainbow"; or the pirate flag (U+1F3F4,‍ U+200D, U+2620, U+FE0F), that would read "black flag, zero-width joiner, skull and crossbones, variation selector 16 (to use skull and crossbones emoji, not text)".

Now, there are different ways you can deal with all this, but in your current approach you are iterating one code point at a time, so you will not be able to detect complex emojis. You can just have a big set of all interesting sequences (flags, some composite emojis, etc.) and look for them in the input. You can check if the current character is a regional indicator symbol and, if that is the case, try to read the next code point to form a flag (and settle for individual simple emojis for the rest). I would not know for sure what is the best solution for your case (in terms complexity/benefits trade-off), but you should be aware of the nuances of emoji encoding and the pitfalls you may find.

weixin_39973271

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python里面的and符号,如何从Python中的字符串中提取表情符号和标志？

import emojidef emoji_lis(string):_entities = []for pos,c in enumerate(string):if c in emoji.UNICODE_EMOJI:print("Matched!!", c ,c.encode('ascii',"backslashreplace"))_entities.append({"location":pos,"...
复制链接

扫一扫