之前遇到过几次nutch/solr报这样的错误:Invalid UTF-8 character。
原来1.3版本的nutch有Strip UTF-8 non-character codepoints的bug,在1.4就修复了。
public static String stripNonCharCodepoints(String input) {
StringBuilder retval = new StringBuilder();
char ch;
for (int i = 0; i < input.length(); i++) {
ch = input.charAt(i);
// Strip all non-characters http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
// and non-printable control characters except tabulator, new line and carriage return
if (ch % 0x10000 != 0xffff && // 0xffff - 0x10ffff range step 0x10000
ch % 0x10000 != 0xfffe && // 0xfffe - 0x10fffe range
(ch <= 0xfdd0 || ch >= 0xfdef) && // 0xfdd0 - 0xfdef
(ch > 0x1F || ch == 0x9 || ch == 0xa || ch == 0xd)) {
return retval.toString();