原始html文件内容(不是网页浏览器控制台上数据已加载的html):
public static String getHTML(String pageURL) {
StringBuilder pageHTML = new StringBuilder();
try {
URL url = new URL(pageURL);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("User-Agent", "MSIE 7.0");
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream(), PAGE_ENCODE_TYPE));
String line;
while ((line = br.readLine()) != null) {
pageHTML.append(line);
pageHTML.append("\r\n");
}
connection.disconnect();
} catch (Exception e) {
e.printStackTrace();
}
return pageHTML.toString();
}
如果想让Jsoup能正确识别,需格式转换:
public static String getHtmlContent(String htmlContent){
return htmlContent.replace("\\\"", "\"").replace("\\/", "/");
}
Jsoup.parseBodyFragment(htmlContent)
注意:Jsoup读取url后拿到的是原始html:
Document document = Jsoup.connect(url)
.timeout(10000)
.ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36")
.get();