HtmlCleaner是一个开源的Java语言的Html文档解析器。相当强大且简单易用。这里不介绍它的使用,具体使用可以到它的官网去看(http://htmlcleaner.sourceforge.net/javause.php)
这里说一个HtmlCleaner的bug.
问题现象:
在用htmlCleaner抓取网页内容时,如果不知道网页的编码,可以不设置编码。代码如下:
HtmlCleaner cleaner = new HtmlCleaner();
URL url = new URL("http://www.qq.com/");
TagNode node = cleaner.clean(url);
这样htmlCleaner会自动获取页面编码,但htmlCleaner在获取页面编码时,有一种情况没有考虑到。当页面的编码是以下面形式给出时
<meta charset="UTF-8" />
这时,htmlcleaner将无法获取页面编码,而使用系统编码。如果系统编码和网页编码不一致就会出现乱码。
解决方法:
public static String getCharset(URL url) throws Exception {
URLConnection urlConnection = url.openConnection();
String charset = null;
if (charset == null) {
charset = getCharsetFromContentTypeString( urlConnection.getHeaderField("Content-Type") );
}
if (charset == null) {
charset = getCharsetFromContent(url);
}
if (charset == null) {
charset = getCharsetFromMeta(url);
}
if (charset == null) {
charset = HtmlCleaner.DEFAULT_CHARSET;
}
return charset;
}
public static String getCharsetFromContentTypeString(String contentType) {
if (contentType != null) {
String pattern = "charset=([a-z\\d\\-]*)";
Matcher matcher = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(contentType);
if (matcher.find()) {
String charset = matcher.group(1);
if (Charset.isSupported(charset)) {
return charset;
}
}
}
return null;
}
public static String getCharsetFromContent(URL url) throws IOException {
InputStream stream = url.openStream();
byte chunk[] = new byte[2048];
int bytesRead = stream.read(chunk);
if (bytesRead > 0) {
String startContent = new String(chunk);
String pattern = "\\<meta\\s*http-equiv=[\\\"\\']content-type[\\\"\\']\\s*content\\s*=\\s*[\"']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\"\\'\\>]";
Matcher matcher = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(startContent);
if (matcher.find()) {
String charset = matcher.group(1);
if (Charset.isSupported(charset)) {
return charset;
}
}
}
return null;
}
public static String getCharsetFromMeta(URL url) throws Exception {
InputStream stream = url.openStream();
byte chunk[] = new byte[2048];
int bytesRead = stream.read(chunk);
if (bytesRead > 0) {
String startContent = new String(chunk);
String pattern = "\\<meta\\s*[\\\"\\']charset=([a-z\\d\\-]*)[\\\"\\'\\>]";
Matcher matcher = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(startContent);
if (matcher.find()) {
String charset = matcher.group(1);
if (Charset.isSupported(charset)) {
return charset;
}
}
}
return null;
}
注:getCharsetFromContentTypeString和 getCharsetFromContent方法是htmlCleaner包中提供的方法
使用getCharset方法获取编码,在初始化htmlCleaner时,设置网页编码:
HtmlCleaner cleaner = new HtmlCleaner();
URL url = new URL("http://www.qq.com/");
TagNode node = cleaner.clean(url,getCharset(url));