JSoup官方地址:http://jsoup.org
Apache HttpComponents官方地址:http://hc.apache.org/index.html
1、抓取HTML内容
这里我们使用HttpClient库,根据URL请求远端的HTML
public static String getHTMLFromURL(String url) { String html = null; HttpClient httpClient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url); try { HttpResponse httpResponse = httpClient.execute(httpGet); int resStatu = httpResponse.getStatusLine().getStatusCode(); if (resStatu == HttpStatus.SC_OK) { HttpEntity entity = httpResponse.getEntity(); if (entity != null) { html = EntityUtils.toString(entity); } } } catch (Exception e) { e.printStackTrace(); } finally { httpClient.getConnectionManager().shutdown(); } return html; }
2、解析HTML
示例,打印百度的标题
> 解析,获得Document对象
Document doc = Jsoup.parse(html);
> 使用 CSS 或 类似 JQuery 的 Selector 选择元素
Elements elements = doc.select("title");
> 打印元素的文本内容
System.out.println(ele.text());
String html = WebCrawler.getHTMLFromURL("http://www.baidu.com"); if (html != null) { Document doc = Jsoup.parse(html); Elements elements = doc.select("title"); for (Element element : linksElements) { System.out.println(element.text()); } }
运行结果: