Java 将HTML页面转换成DOM_java解析html至dom-CSDN博客

本文链接：https://blog.csdn.net/kydkong/article/details/50640302

Java 将HTML页面转换成DOM

附加包：

nekohtml.jar

http://pan.baidu.com/s/1sk1PhNZ

nekohtml的依赖包

http://pan.baidu.com/s/1eRyhN7W
<pre name="code" class="plain">http://pan.baidu.com/s/1jHcuZTO
<pre name="code" class="plain">http://pan.baidu.com/s/1mhhGFES
<pre name="code" class="plain">http://pan.baidu.com/s/1kTURxSN

代码例子：

package service;

import java.io.IOException;

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;


public class HTML2DomTest {  
  
	static public void main(String[] args) throws SAXException, IOException{
		//创建一个解析器  
	    DOMParser parser = new DOMParser();  
	    //解析HTML文件
	    parser.parse("html/test2.html");
	    //myparser.parse("html/test1.html");  
	    //获取解析后的DOM树  
	    Document document = parser.getDocument();  
	      
	    //通过getElementsByTagName获取Node  
	    NodeList nodeList = document.getElementsByTagName("a");
	    for (int i = 0; i < nodeList.getLength(); i++) {  
	        Element e = (Element)nodeList.item(i);  
	        System.out.print(e.getAttribute("href") + "\t");  
	        System.out.println(e.getTextContent());  
	    }
	}
}

测试用例:

html/test2.html

    <html>  
    <head><title>test2</title></head>  
    <body>  
      
    <h1>Page Title</h1>  
      
    <!-- Table -->  
    <table>  
    <tr>  
      <td>a1</td>  <td>a2</td>  <td>a3</td>  
    </tr>  
    <tr>  
      <td>b1</td>  <td>b2</td>  <td>b3</td>  
    </tr>  
    <tr>  
      <td>c1</td>  <td>c2</td>  <td>c3</td>  
    </tr>  
    </table>  
      
    <!-- Link -->  
    <a href="http://www.aaa.com/">aaa</a>  
    <a href="http://www.bbb.com/">bbb</a>  
    <a href="http://www.ccc.com/">ccc</a>  
      
    </body>  
    </html>

结果：

http://www.aaa.com/	aaa
http://www.bbb.com/	bbb
http://www.ccc.com/	ccc

附加：

我用这个的时候遇到个问题，DOM对象中的其中一个带有大量文本的div标签不能获得内容。

猜测可能是nekohtml自动对HTML补全的时候导致结构混乱了。