java处理表情符号,从Java字符串中删除✅，，✈，♛和其他此类表情符号/图像/符号...-CSDN博客

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for example:

▓ railway??

→ Cats and dogs

I'm on 🔥

Apples ⚛

✅ Vi sign

♛ I'm the king ♛

Corée ♦ du Nord ☁ (French)

gjør at både ◄╗ (Norwegian)

Star me ★

Star ⭐ once more

早上好 ♛ (Chinese)

Καλημέρα ✂ (Greek)

another ✓ sign ✓

добрай раніцы ✪ (Belarus)

◄ शुभ प्रभात ◄ (Hindi)

✪ ✰ ❈ ❧ Let's get together ★. We shall meet at 12/10/2018 10:00 AM at Tony's.❉

...and many more of these.

I would like to get rid of all these signs/images and to keep only the letters (and punctuation) in the different languages.

I tried to clean the signs using the EmojiParser library:

String withoutEmojis = EmojiParser.removeAllEmojis(input);

The problem is that EmojiParser is not able to remove the majority of the signs. The ♦ sign is the only one I found till now that it removed.

Other signs such as ✪ ❉ ★ ✰ ❈ ❧ ✂ ❋ ⓡ ✿ ♛ 🔥 are not removed.

Is there a way to remove all these signs from the input strings and keeping only the letters and punctuation in the different languages?

解决方案

Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.

String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";

String emotionless = aString.replaceAll(characterFilter,"");

So:

[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s] is a range representing all numeric (\\p{N}), letter (\\p{L}), mark (\\p{M}), punctuation (\\p{P}), whitespace/separator (\\p{Z}), other formatting (\\p{Cf}) and other characters above U+FFFF in Unicode (\\p{Cs}), and newline (\\s) characters. \\p{L} specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.

The ^ in the regex character set negates the match.

Example:

String str = "hello world _# 皆さん、こんにちは！　私はジョンと申します。🔥";

System.out.print(str.replaceAll("[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]",""));

// Output:

// "hello world _# 皆さん、こんにちは！　私はジョンと申します。"

If you need more information, check out the Java documentation for regexes.