原文链接:https://blog.csdn.net/liu857279611/article/details/71244224?utm_source=blogxgwz8
案例:本文主要描述如何使用XPath爬取网页指定数据(IP的物理位置)
解决问题:爬取网页指定数据
一、首先创建一个maven工程,配置依赖包
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>javax.xml</groupId>
<artifactId>jaxp-api</artifactId>
<version>1.4.2</version>
</dependency>
<dependency>
<groupId>net.sourceforge.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>2.9</version>
</dependency>
</dependencies>
二、开始写入自动化测试代码
public class Test {
public static void main(String[] args) throws IOException, XPatherException, ParserConfigurationException, XPathExpressionException {
String url = "http://ip.chinaz.com/?IP=111.142.55.73";
String exp = "//*[@id=\"leftinfo\"]/div[3]/div[2]/p[2]/span[4]";
String html = null;
try {
Connection connect = Jsoup.connect(url);
html = connect.get().body().html();
} catch (IOException e) {
e.printStackTrace();
}
HtmlCleaner hc = new HtmlCleaner();
TagNode tn = hc.clean(html);
Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);
XPath xPath = XPathFactory.newInstance().newXPath();
Object result;
result = xPath.evaluate(exp, dom, XPathConstants.NODESET);
if (result instanceof NodeList) {
NodeList nodeList = (NodeList) result;
//System.out.println(nodeList.getLength());
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
System.out.println(node.getNodeValue() == null ? node.getTextContent() : node.getNodeValue());
}
}
}
}
三、运行结果
福建省泉州市 中移铁通