java invalid unicode,如何从java中的字符串中删除无效的unicode字符

I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info, not valid unicode characters or unicode replacement characters. These are for example U+D83D or U+FFFD. If those characters are in the file, coreNLP responds with errors messages like this one:

Nov 15, 2015 5:15:38 PM edu.stanford.nlp.process.PTBLexer next

WARNING: Untokenizable: ? (U+D83D, decimal: 55357)

Based on this answer, I tried document.replaceAll("\\p{C}", ""); to just remove those characters. document here is just the document as a string. But that didn't help.

How can I remove those characters out of the string before passing it to coreNLP?

UPDATE (Nov 16th):

For the sake of completeness I should mention that I asked this question only in order to avoid the huge amount of error messages by preprocessing the file. CoreNLP just ignores characters it can't handle, so that is not the problem.

解决方案

In a way, both answers provided by Mukesh Kumar and GsusRecovery are helping, but not fully correct.

document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");

seems to replace all invalid characters. But CoreNLP seems to not support even more. I manually figured them out by running the parser on my whole corpus, which led to this:

document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");

So right now I am running two replaceAll() commands before handing the document to the parser. The complete code snippet is

// remove invalid unicode characters

String tmpDoc1 = document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");

// remove other unicode characters coreNLP can't handle

String tmpDoc2 = tmpDoc1.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");

DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(tmpDoc2));

for (List sentence : tokenizer) {

List tagged = tagger.tagSentence(sentence);

GrammaticalStructure gs = parser.predict(tagged);

System.err.println(gs);

}

This is not necessarily a complete list of unsupported characters, though, which is why I opened an issue on GitHub.

Please note that CoreNLP automatically removes those unsupported characters. The only reason I want to preprocess my corpus is to avoid all those error messages.

UPDATE Nov 27ths

Christopher Manning just answered the GitHub Issue I opened. There are several ways to handle those characters using the class edu.stanford.nlp.process.TokenizerFactory;. Take this code example to tokenize a document:

DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document));

TokenizerFactory extends HasWord> factory=null;

factory=PTBTokenizer.factory();

factory.setOptions("untokenizable=noneDelete");

tokenizer.setTokenizerFactory(factory);

for (List sentence : tokenizer) {

// do something with the sentence

}

You can replace noneDeletein line 4 with other options. I am citing Manning:

"(...) the complete set of six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep."

That means, to keep the characters without getting all those error messages, the best way is to use the option noneKeep. This way is way more elegant than any attempt to remove those characters.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值