python里面的and符号,如何从Python中的字符串中提取表情符号和标志?

import emoji

def emoji_lis(string):

_entities = []

for pos,c in enumerate(string):

if c in emoji.UNICODE_EMOJI:

print("Matched!!", c ,c.encode('ascii',"backslashreplace"))

_entities.append({

"location":pos,

"emoji": c

})

return _entities

emoji_lis("👧🏿 مدیحہ🇵🇰 así, se 😌 ds 💕👭")

Matched!! 👧 \U0001f467

Matched!! 🏿 \U0001f3ff

Matched!! 😌 \U0001f60c

Matched!! 💕 \U0001f495

Matched!! 👭 \U0001f46d

My code is working of all other emoji's but how can I detect country flags 🇵🇰?

解决方案

Here is an article about how Unicode encodes country flags. They are represented as sequences of two regional indicator symbols (code points ranging from U+1F1E6 to U+1F1FF), although obviously not every possible combination of two symbols corresponds to a country (and therefore a flag), obviously. You could just assume that no "bad" combinations will happen or maintain (or import) a set with the (currently) 270 valid pairs of symbols.

Then there are regional flags. These are represented as a black flag code point (U+1F3F4) followed by a sequence of tags (code points U+E0001 and range from U+E0020 to U+E007F) spelling the region identifier (for example, for the flag or Wales that would be "gbwls"), plus a "cancel tag" code point (U+E007F).

And, besides all that, you also have of course regular emojis that look like flags. The aforementioned black flag (U+1F3F4) is one of them, but you also have triangular flag (U+1F6A9), etc. Most of these you should already be able to detect, since they are just like other emojis. However, we are not quite done here. You have the issue of composite emojis, which affects some flags but also many other emojis. In your example, you can see that the matched emoji for the black woman in the input string is a "base" woman emoji, and then this brown patch. This is because the black woman emoji is made up of two code points, woman (U+1F469) and dark skin tone (U+1F311). In many other cases, you would need the two code points, plus a zero-width joiner (U+200D) in between, to specify that you want them merged. And sometimes you also need to throw in a variation selector (typically 16, U+FE0F) to indicate that you want things to be used as emojis. You can read more about this in this article. In the case of flags, you have for example the rainbow flag (U+1F3F3, U+FE0F,‍ U+200D, U+1F308), that would read "white flag, variation selector 16 (to use white flag emoji, not text), zero-width joiner, rainbow"; or the pirate flag (U+1F3F4,‍ U+200D, U+2620, U+FE0F), that would read "black flag, zero-width joiner, skull and crossbones, variation selector 16 (to use skull and crossbones emoji, not text)".

Now, there are different ways you can deal with all this, but in your current approach you are iterating one code point at a time, so you will not be able to detect complex emojis. You can just have a big set of all interesting sequences (flags, some composite emojis, etc.) and look for them in the input. You can check if the current character is a regional indicator symbol and, if that is the case, try to read the next code point to form a flag (and settle for individual simple emojis for the rest). I would not know for sure what is the best solution for your case (in terms complexity/benefits trade-off), but you should be aware of the nuances of emoji encoding and the pitfalls you may find.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值