为什么在Swift字符串中像‍‍‍这样的表情符号字符被如此奇怪地对待？

最新推荐文章于 2024-06-28 13:48:44 发布

asdfgh0077

最新推荐文章于 2024-06-28 13:48:44 发布

阅读量409

点赞数

文章标签： swift string unicode emoji

原文链接：https://oldbug.net/q/2x19z/Why-are-emoji-characters-like-treated-so-strangely-in-Swift-strings

版权

本文翻译自：Why are emoji characters like 👩‍👩‍👧‍👦 treated so strangely in Swift strings?

The character 👩‍👩‍👧‍👦 (family with two women, one girl, and one boy) is encoded as such: 字符👩‍👩‍👧‍ character（有两个女人，一个女孩和一个男孩的家庭）的编码如下：

U+1F469 WOMAN , U+1F469 WOMAN ，
‍U+200D ZWJ , ‍U+200D ZWJ ，
U+1F469 WOMAN , U+1F469 WOMAN ，
U+200D ZWJ , U+200D ZWJ ，
U+1F467 GIRL , U+1F467 GIRL ，
U+200D ZWJ , U+200D ZWJ ，
U+1F466 BOY U+1F466 BOY

So it's very interestingly-encoded; 因此，它的编码非常有趣； the perfect target for a unit test. 单元测试的理想目标。 However, Swift doesn't seem to know how to treat it. 但是，Swift似乎不知道如何处理它。 Here's what I mean: 这就是我的意思：

"👩‍👩‍👧‍👦".contains("👩‍👩‍👧‍👦") // true
"👩‍👩‍👧‍👦".contains("👩") // false
"👩‍👩‍👧‍👦".contains("\u{200D}") // false
"👩‍👩‍👧‍👦".contains("👧") // false
"👩‍👩‍👧‍👦".contains("👦") // true

So, Swift says it contains itself (good) and a boy (good!). 因此，斯威夫特说，它包含了自己（好）和一个男孩（好！）。 But it then says it does not contain a woman, girl, or zero-width joiner. 但随后它说它不包含女人，女孩或零宽度细木工。 What's happening here? 这里发生了什么事？ Why does Swift know it contains a boy but not a woman or girl? 为什么Swift知道其中包含一个男孩，却没有一个女人或女孩？ I could understand if it treated it as a single character and only recognized it containing itself, but the fact that it got one subcomponent and no others baffles me. 我能理解它是否被视为单个字符，并且只识别出它包含自身，但事实是它只有一个子组件，而没有其他使我感到困惑。

This does not change if I use something like "👩".characters.first! 如果我使用"👩".characters.first!类的东西，这不会改变"👩".characters.first! . 。

Even more confounding is this: 更令人困惑的是：

let manual = "\u{1F469}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}"
Array(manual.characters) // ["👩‍", "👩‍", "👧‍", "👦"]

Even though I placed the ZWJs in there, they aren't reflected in the character array. 即使我将ZWJ放置在其中，它们也没有反映在字符数组中。 What followed was a little telling: 随之而来的是一点点的告诉：

manual.contains("👩") // false
manual.contains("👧") // false
manual.contains("👦") // true

So I get the same behavior with the character array... which is supremely annoying, since I know what the array looks like. 所以我在字符数组上也得到了相同的行为……这非常令人讨厌，因为我知道该数组的外观。

This also does not change if I use something like "👩".characters.first! 如果我使用"👩".characters.first!类的东西，这也不会改变"👩".characters.first! . 。

#1楼

参考：https://stackoom.com/question/2x19z/为什么在Swift字符串中像-这样的表情符号字符被如此奇怪地对待

#2楼

It seems that Swift considers a ZWJ to be an extended grapheme cluster with the character immediately preceding it. Swift似乎将ZWJ视为扩展的字素簇，其字符紧随其后。 We can see this when mapping the array of characters to their unicodeScalars : 当将字符数组映射到其unicodeScalars时，我们可以看到：

Array(manual.characters).map { $0.description.unicodeScalars }

This prints the following from LLDB: 这将从LLDB打印以下内容：

▿ 4 elements
  ▿ 0 : StringUnicodeScalarView("👩‍")
    - 0 : "\u{0001F469}"
    - 1 : "\u{200D}"
  ▿ 1 : StringUnicodeScalarView("👩‍")
    - 0 : "\u{0001F469}"
    - 1 : "\u{200D}"
  ▿ 2 : StringUnicodeScalarView("👧‍")
    - 0 : "\u{0001F467}"
    - 1 : "\u{200D}"
  ▿ 3 : StringUnicodeScalarView("👦")
    - 0 : "\u{0001F466}"

Additionally, .contains groups extended grapheme clusters into a single character. 此外， .contains将扩展的字素簇分组为单个字符。 For instance, taking the hangul characters ᄒ , ᅡ , and ᆫ (which combine to make the Korean word for "one": 한 ): 例如，使用hangul字符ᄒ ， ᅡ和ᆫ （它们组合成韩文单词“ one”： 한 ）：

"\u{1112}\u{1161}\u{11AB}".contains("\u{1112}") // false

This could not find ᄒ because the three codepoints are grouped into one cluster which acts as one character. 找不到ᄒ因为这三个代码点被分组为一个字符簇。 Similarly, \\u{1F469}\\u{200D} ( WOMAN ZWJ ) is one cluster, which acts as one character. 类似地， \\u{1F469}\\u{200D} （ WOMAN ZWJ ）是一个群集，充当一个字符。

#3楼

This has to do with how the String type works in Swift, and how the contains(_:) method works. 这与Swift中String类型的工作方式以及contains(_:)方法的工作方式有关。

The '👩‍👩‍👧‍👦 ' is what's known as an emoji sequence, which is rendered as one visible character in a string. “ 👩‍👩‍👧‍👦”是一个表情符号序列，它被表达为字符串中的一个可见字符。 The sequence is made up of Character objects, and at the same time it is made up of UnicodeScalar objects. 该序列由Character对象组成，并且同时由UnicodeScalar对象组成。

If you check the character count of the string, you'll see that it is made up of four characters, while if you check the unicode scalar count, it will show you a different result: 如果检查字符串的字符数，将看到它由四个字符组成，而如果检查unicode标量计数，它将显示不同的结果：

print("👩‍👩‍👧‍👦".characters.count)     // 4
print("👩‍👩‍👧‍👦".unicodeScalars.count) // 7

Now, if you parse through the characters and print them, you'll see what seems like normal characters, but in fact the three first characters contain both an emoji as well as a zero-width joiner in their UnicodeScalarView : 现在，如果您解析这些字符并打印它们，您将看到看起来像普通字符的字符，但实际上，前三个字符在UnicodeScalarView既包含表情符号，又包含零宽度的连接UnicodeScalarView ：

for char in "👩‍👩‍👧‍👦".characters {
    print(char)

    let scalars = String(char).unicodeScalars.map({ String($0.value, radix: 16) })
    print(scalars)
}

// 👩‍
// ["1f469", "200d"]
// 👩‍
// ["1f469", "200d"]
// 👧‍
// ["1f467", "200d"]
// 👦
// ["1f466"]

As you can see, only the last character does not contain a zero-width joiner, so when using the contains(_:) method, it works as you'd expect. 如您所见，只有最后一个字符不包含零宽度的连接符，因此，在使用contains(_:)方法时，它可以按您期望的那样工作。 Since you aren't comparing against emoji containing zero-width joiners, the method won't find a match for any but the last character. 由于您没有与包含零宽度连接符的表情符号进行比较，因此该方法将找不到除最后一个字符以外的任何其他字符的匹配项。

To expand on this, if you create a String which is composed of an emoji character ending with a zero-width joiner, and pass it to the contains(_:) method, it will also evaluate to false . 要对此进行扩展，如果您创建一个由以零宽度连接符结尾的表情符号字符组成的String ，并将其传递给contains(_:)方法，它的结果也将为false 。 This has to do with contains(_:) being the exact same as range(of:) != nil , which tries to find an exact match to the given argument. 这与contains(_:)与range(of:) != nil完全相同有关，后者试图查找与给定参数的精确匹配。 Since characters ending with a zero-width joiner form an incomplete sequence, the method tries to find a match for the argument while combining characters ending with a zero-width joiners into a complete sequence. 由于以零宽度连接符结尾的字符形成不完整的序列，因此该方法尝试在将以零宽度连接符结尾的字符组合为完整序列的同时找到参数的匹配项。 This means that the method won't ever find a match if: 这意味着在以下情况下，该方法将永远找不到匹配项：

the argument ends with a zero-width joiner, and 该参数以零宽度连接符结尾，并且
the string to parse doesn't contain an incomplete sequence (ie ending with a zero-width joiner and not followed by a compatible character). 要解析的字符串不包含不完整的序列（即，以零宽度的连接符结尾且不跟随兼容字符）。

To demonstrate: 展示：

let s = "\u{1f469}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}" // 👩‍👩‍👧‍👦

s.range(of: "\u{1f469}\u{200d}") != nil                            // false
s.range(of: "\u{1f469}\u{200d}\u{1f469}") != nil                   // false

However, since the comparison only looks ahead, you can find several other complete sequences within the string by working backwards: 但是，由于比较只是向前看，因此可以通过向后工作来在字符串中找到其他几个完整的序列：

s.range(of: "\u{1f466}") != nil                                    // true
s.range(of: "\u{1f467}\u{200d}\u{1f466}") != nil                   // true
s.range(of: "\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}") != nil  // true

// Same as the above:
s.contains("\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}")          // true

The easiest solution would be to provide a specific compare option to the range(of:options:range:locale:) method. 最简单的解决方案是为range(of:options:range:locale:)方法提供特定的比较选项。 The option String.CompareOptions.literal performs the comparison on an exact character-by-character equivalence . 选项String.CompareOptions.literal在精确的逐字符等效性上执行比较。 As a side note, what's meant by character here is not the Swift Character , but the UTF-16 representation of both the instance and comparison string – however, since String doesn't allow malformed UTF-16, this is essentially equivalent to comparing the Unicode scalar representation. 附带说明一下，此处的字符不是 Swift Character ，而是实例和比较字符串的UTF-16表示形式-但是，由于String不允许格式错误的UTF-16，因此从本质上讲，它等效于比较Unicode标量表示形式。

Here I've overloaded the Foundation method, so if you need the original one, rename this one or something: 在这里，我已经重载了Foundation方法，因此，如果您需要原始的方法，请将该方法重命名为：

extension String {
    func contains(_ string: String) -> Bool {
        return self.range(of: string, options: String.CompareOptions.literal) != nil
    }
}

Now the method works as it "should" with each character, even with incomplete sequences: 现在该方法可以按每个字符（即使序列不完整）“应”运行：

s.contains("👩")          // true
s.contains("👩\u{200d}")  // true
s.contains("\u{200d}")    // true

#4楼

The first problem is you're bridging to Foundation with contains (Swift's String is not a Collection ), so this is NSString behavior, which I don't believe handles composed Emoji as powerfully as Swift. 第一个问题是您正在使用contains来连接Foundation（Swift的String不是Collection ），所以这是NSString行为，我不认为它可以像Swift那样强大地处理Emoji。 That said, Swift I believe is implementing Unicode 8 right now, which also needed revision around this situation in Unicode 10 (so this may all change when they implement Unicode 10; I haven't dug into whether it will or not). 就是说，Swift我相信现在正在实现Unicode 8，这也需要围绕Unicode 10的这种情况进行修订（因此，当他们实现Unicode 10时，这可能会改变；我还没有研究过是否这样做）。

To simplify thing, let's get rid of Foundation, and use Swift, which provides views that are more explicit. 为简化起见，让我们摆脱Foundation，使用Swift，它提供更明确的视图。 We'll start with characters: 我们将从字符开始：

"👩‍👩‍👧‍👦".characters.forEach { print($0) }
👩‍
👩‍
👧‍
👦

OK. 好。 That's what we expected. 这就是我们所期望的。 But it's a lie. 但这是一个谎言。 Let's see what those characters really are. 让我们看看这些字符到底是什么。

"👩‍👩‍👧‍👦".characters.forEach { print(String($0).unicodeScalars.map{$0}) }
["\u{0001F469}", "\u{200D}"]
["\u{0001F469}", "\u{200D}"]
["\u{0001F467}", "\u{200D}"]
["\u{0001F466}"]

Ah… So it's ["👩ZWJ", "👩ZWJ", "👧ZWJ", "👦"] . 啊…是["👩ZWJ", "👩ZWJ", "👧ZWJ", "👦"] 。 That makes everything a bit more clear. 这使得一切都更加清晰。 👩 is not a member of this list (it's "👩ZWJ"), but 👦 is a member. 👩不是此列表的成员（它是“👩ZWJ”），但是👦是成员。

The problem is that Character is a "grapheme cluster," which composes things together (like attaching the ZWJ). 问题在于， Character是一个“ Character簇”，它将事物组合在一起（例如附加ZWJ）。 What you're really searching for is a unicode scalar. 您真正要搜索的是unicode标量。 And that works exactly as you're expecting: 这完全符合您的期望：

"👩‍👩‍👧‍👦".unicodeScalars.contains("👩") // true
"👩‍👩‍👧‍👦".unicodeScalars.contains("\u{200D}") // true
"👩‍👩‍👧‍👦".unicodeScalars.contains("👧") // true
"👩‍👩‍👧‍👦".unicodeScalars.contains("👦") // true

And of course we can also look for the actual character that is in there: 当然，我们也可以查找其中的实际字符：

"👩‍👩‍👧‍👦".characters.contains("👩\u{200D}") // true

(This heavily duplicates Ben Leggiero's points. I posted this before noticing he'd answered. Leaving in case it is clearer to anyone.) （这在很大程度上重复了Ben Leggiero的观点。我在注意到他回答之前就发布了此内容。以防万一对任何人来说都更清楚。）

#5楼

The other answers discuss what Swift does, but don't go into much detail about why. 其他答案讨论了Swift的作用，但没有详细说明原因。

Do you expect “Å” to equal “Å”? 您期望“Å”等于“Å”吗？ I expect you would. 我希望你会。

One of these is a letter with a combiner, the other is a single composed character. 其中一个是带有组合器的字母，另一个是单个组成的字符。 You can add many different combiners to a base character, and a human would still consider it to be a single character. 您可以将多个不同的组合器添加到基本字符，而人类仍然会认为它是单个字符。 To deal with this sort of discrepancy the concept of a grapheme was created to represent what a human would consider a character regardless of the codepoints used. 为了解决这种差异，创建了字素概念来表示人们将使用什么字码来考虑字符。

Now text messaging services have been combining characters into graphical emoji for years :) → 🙂 . 现在，文本消息服务已经将字符组合成图形表情符号已有多年了:) → 🙂 。 So various emoji were added to Unicode. 因此，将各种表情符号添加到了Unicode中。
These services also started combining emoji together into composite emoji. 这些服务也开始将表情符号合并为复合表情符号。
There of course is no reasonable way to encode all possible combinations into individual codepoints, so The Unicode Consortium decided to expand on the concept of graphemes to encompass these composite characters. 当然，没有合理的方法将所有可能的组合编码为单独的代码点，因此Unicode联合会决定扩展字素的概念以包含这些复合字符。

What this boils down to is "👩‍👩‍👧‍👦" should be considered as a single "grapheme cluster" if you trying to work with it at the grapheme level, as Swift does by default. 如果您试图在字素级别使用它，那么归结为"👩‍👩‍👧‍👦"应该被视为单个“字素簇”，就像Swift默认那样。

If you want to check if it contains "👦" as a part of that, then you should go down to a lower level. 如果要检查其中是否包含"👦" ，则应降低到较低的级别。

I don't know Swift syntax so here is some Perl 6 which has similar level of support for Unicode. 我不知道Swift语法，所以这里有一些Perl 6，它对Unicode的支持水平相似。
(Perl 6 supports Unicode version 9 so there may be discrepancies) （Perl 6支持Unicode版本9，因此可能会有差异）

say "\c[family: woman woman girl boy]" eq "👩‍👩‍👧‍👦"; # True

# .contains is a Str method only, in Perl 6
say "👩‍👩‍👧‍👦".contains("👩‍👩‍👧‍👦")    # True
say "👩‍👩‍👧‍👦".contains("👦");        # False
say "👩‍👩‍👧‍👦".contains("\x[200D]");  # False

# comb with no arguments splits a Str into graphemes
my @graphemes = "👩‍👩‍👧‍👦".comb;
say @graphemes.elems;                # 1

Let's go down a level 让我们往下走

# look at it as a list of NFC codepoints
my @components := "👩‍👩‍👧‍👦".NFC;
say @components.elems;                     # 7

say @components.grep("👦".ord).Bool;       # True
say @components.grep("\x[200D]".ord).Bool; # True
say @components.grep(0x200D).Bool;         # True

Going down to this level can make some things harder though. 降至此级别可能会使某些事情变得更难。

my @match = "👩‍👩‍👧‍👦".ords;
my $l = @match.elems;
say @components.rotor( $l => 1-$l ).grep(@match).Bool; # True

I assume that .contains in Swift makes that easier, but that doesn't mean there aren't other things which become more difficult. 我以为Swift中的.contains会更容易，但这并不意味着没有其他事情会变得更加困难。

Working at this level makes it much easier to accidentally split a string in the middle of a composite character for example. 例如，在此级别上工作可使在复合字符中间不小心分割字符串变得容易得多。

What you are inadvertently asking is why does this higher level representation not work like a lower level representation would. 您无意中要问的是，为什么这种较高级别的表示不能像较低级别的表示那样起作用。 The answer is of course, it's not supposed to. 答案当然是，这不应该。

If you are asking yourself “ why does this have to be so complicated ”, the answer is of course “ humans ”. 如果您问自己“ 为什么这必须这么复杂 ”，答案当然是“ 人类 ”。

#6楼

Swift 4.0 update Swift 4.0更新

String received lots of revisions in Swift 4 update, as documented in SE-0163 . 如SE-0163中所述，String在Swift 4更新中获得了很多修订。 Two emoji are used for this demo representing two different structures. 此演示使用两个表情符号表示两个不同的结构。 Both are combined with a sequence of emoji. 两者都结合了一系列表情符号。

👍🏽 is the combination of two emoji, 👍 and 🏽 👍🏽是两个表情符号👍和🏽的组合

👩‍👩‍👧‍👦 is the combination of four emoji, with zero width joiner connected. 👩‍👩‍👧‍👦是四个表情符号的组合，其中连接了零宽度的细木工。 The format is 👩‍joiner👩‍joiner👧‍joiner👦 格式为👩‍joiner👩‍joiner👧‍joiner👦

1. Counts 1.计数

In Swift 4.0 emoji is counted as grapheme cluster. 在Swift 4.0中，表情符号被视为字素簇。 Every single emoji is counted as 1. The count property is also directly available for string. 每一个表情符号将被计为1 count属性也直接为字符串。 So you can directly call it like this. 因此，您可以像这样直接调用它。

"👍🏽".count  // 1. Not available on swift 3
"👩‍👩‍👧‍👦".count  // 1. Not available on swift 3

Character array of a string is also counted as grapheme clusters in Swift 4.0, so both of the following codes print 1. These two emoji are examples of emoji sequences, where several emoji are combined together with or without zero width joiner \\u{200d} between them. 字符串的字符数组在Swift 4.0中也被视为字素簇，因此以下两个代码均打印为1。这两个表情符号是表情符号序列的示例，其中几个表情符号组合在一起，有零宽度连接符\\u{200d}它们之间。 In swift 3.0, character array of such string separates out each emoji and results in an array with multiple elements (emoji). 在Swift 3.0中，此类字符串的字符数组会分离出每个表情符号，并导致包含多个元素（表情符号）的数组。 The joiner is ignored in this process. 在此过程中将忽略连接器。 However, in Swift 4.0, character array sees all emoji as one piece. 但是，在Swift 4.0中，字符数组将所有表情符号视为一体。 So that of any emoji will always be 1. 因此，任何表情符号都将始终为1。

"👍🏽".characters.count  // 1. In swift 3, this prints 2
"👩‍👩‍👧‍👦".characters.count  // 1. In swift 3, this prints 4

unicodeScalars remains unchanged in Swift 4. It provides the unique Unicode characters in the given string. unicodeScalars在Swift 4中保持不变。它在给定的字符串中提供唯一的Unicode字符。

"👍🏽".unicodeScalars.count  // 2. Combination of two emoji
"👩‍👩‍👧‍👦".unicodeScalars.count  // 7. Combination of four emoji with joiner between them

2. Contains 2.包含

In Swift 4.0, contains method ignores zero width joiner in emoji. 在Swift 4.0中， contains方法会忽略表情符号中的零宽度连接符。 So it returns true for any of the four emoji components of "👩‍👩‍👧‍👦" , and return false if you check for the joiner. 因此，对于"👩‍👩‍👧‍👦"的四个表情符号组件中的任何一个，它返回true，如果检查连接"👩‍👩‍👧‍👦" ，则返回false。 However, in Swift 3.0, the joiner is not ignored and is combined with the emoji in front of it. 但是，在Swift 3.0中，joiner不会被忽略，并与它前面的表情符号组合在一起。 So when you check if "👩‍👩‍👧‍👦" contains the first three component emoji, the result will be false 因此，当您检查"👩‍👩‍👧‍👦"包含前三个成分表情符号时，结果将为false

"👍🏽".contains("👍")       // true
"👍🏽".contains("🏽")        // true
"👩‍👩‍👧‍👦".contains("👩‍👩‍👧‍👦")       // true
"👩‍👩‍👧‍👦".contains("👩")       // true. In swift 3, this prints false
"👩‍👩‍👧‍👦".contains("\u{200D}") // false
"👩‍👩‍👧‍👦".contains("👧")       // true. In swift 3, this prints false
"👩‍👩‍👧‍👦".contains("👦")       // true

asdfgh0077

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫