xpath java html_在Java中使用XPath包含HTML

最新推荐文章于 2023-09-03 20:48:45 发布

佯良

最新推荐文章于 2023-09-03 20:48:45 发布

阅读量130

点赞数

文章标签： xpath java html

本文链接：https://blog.csdn.net/weixin_29765215/article/details/114057994

版权

本文介绍了如何使用HTMLCleaner清理不规范的HTML，将其转换为DOM Document，然后利用标准XPath库进行查询。重点在于避免在字符串上操作，通过Java的DOM接口和JAXP实现XPath查询。

摘要由CSDN通过智能技术生成

关于这个：

I could use HTML Cleaner to clean to XML, serialize it back to a

string, and use that with another XPath library, but I can’t find a

good java XPath evaluator that works on a string.

这正是我要做的(除了你不需要在字符串上操作(见下文))。

很多HTML解析器尝试做太多。例如，HTMLCleaner没有正确/完全实现XPath 1.0规范(包含(例如)为an XPath 1.0 function)。好消息是，你不需要它。所有你需要从HTMLCleaner是它解析格式不正确的输入。一旦做到这一点，最好使用标准的XML接口处理结果(现在格式良好)的文档。

首先将文档转换成标准的org.w3c.dom.Document，如下所示：

TagNode tagNode = new HtmlCleaner().clean(

org.w3c.dom.Document doc = new DomSerializer(

new CleanerProperties()).createDOM(tagNode);

然后使用标准的JAXP接口来查询它：

XPath xpath = XPathFactory.newInstance().newXPath();

String str = (String) xpath.evaluate("//div//td[contains(@id, 'foo')]/text()",

doc, XPathConstants.STRING);

System.out.println(str);

输出：

Hello