1 HtmlCleaner解析HTML. 2 DOM解析XML. 3 JSON.simple解析JSON

1. HtmlCleaner解析HTML

    编程的时候,有时数据源从html来。那就要对html分析提取数据。好在java社区里有好有相关库来解析html,经使用比较:个人觉得 HtmlCleaner 比 HtmlParser 好用。HtmlCleaner 的 xpath特好用。也可能我对HtmlParser不熟悉。

HtmlCleaner 下载地址:htmlcleaner2_1.jar; 源码下载:htmlcleaner2_1-all.zip

    写一个测试用的html文件:html-clean-demo.html。

<xml:lang="zh-CN" dir="ltr">    
<head>    
    <meta http-equiv="Content-Type" content="text/html; charset=GBK"/>    
    <meta http-equiv="Content-Language" content="zh-CN"/>    
    <title>html clean demo</title>    
<head>    
<body>    
<div class="d_1">    
    <ul>    
        <li>bar</li>    
        <li>foo</li>    
        <li>gzz</li>    
    </ul>    
</div>    
<div>    
    <ul>    
        <li><a name="my_href" href="1.html">text-1</a></li>    
        <li><a name="my_href" href="2.html">text-2</a></li>    
        <li><a name="my_href" href="3.html">text-3</a></li>    
        <li><a name="my_href" href="4.html">text-4</a></li>    
    </ul>    
</div>    
</body>    
</html> 

    模拟需求:取出title,name="my_href" 的链接,div的class="d_1"下的所有li内容。下面用HtmlCleaner写代码,HtmlCleanerDemo.java


package com.chenlb;  
  
import java.io.File;  
import org.htmlcleaner.HtmlCleaner;  
import org.htmlcleaner.TagNode;
  
public class HtmlCleanerDemo {  
    public static void main(String[] args) throws Exception {  
        HtmlCleaner cleaner = new HtmlCleaner();  
  
        TagNode node = cleaner.clean(new File("html/html-clean-demo.html"), "GBK");  
        //按tag取.  
        Object[] ns = node.getElementsByName("title", true);    //标题  
  
        if(ns.length > 0) {  
            System.out.println("title="+((TagNode)ns[0]).getText());  
        }  
        System.out.println("ul/li:");  
        //按xpath取  
        ns = node.evaluateXPath("//div[@class='d_1']//li");  
        for(Object on : ns) {  
            TagNode n = (TagNode) on;  
            System.out.println("\ttext="+n.getText());  
        }  
        System.out.println("a:");  
        //按属性值取  
        ns = node.getElementsByAttValue("name", "my_href", true, true);  
        for(Object on : ns) {  
            TagNode n = (TagNode) on;  
            System.out.println("\thref="+n.getAttributeByName("href")+", text="+n.getText());  
        }  
    }  
}

cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、getElementsByAttValue、getElementsByName方法了。另外说明下,HtmlCleaner 对不规范的html兼容性比较好。

    
    一个TagNode的孩子节点一般有三种类型:ContentNode、TagNode和CommentNode。可以通过下面函数访问那三种类型的孩子节点。

public void visitTagNodeChildren(TagNode node){
	List<Object> itemList = node.getAllChildren();
		
	for(Object item : itemList){
		if(item instanceof TagNode){
			TagNode tagNode = (TagNode) item;
			String nodeText = tagNode.getText().toString();
		} else if (item instanceof ContentNode){
			ContentNode contentNode = (ContentNode) item;
			String contentText = contentNode.getContent();
		} else if (item instanceof CommentNode){
			CommentNode commentNode = (CommentNode) item;
			String commentText = commentNode.getContent();
		}
	}
}


2. DOM解析XML


Here is the input XML file we need to parse:


<?xml version="1.0"?>  
<class>  
   <student rollno="393">  
      <firstname>dinkar</firstname>  
      <lastname>kad</lastname>  
      <nickname>dinkar</nickname>  
      <marks>85</marks>  
   </student>  
   <student rollno="493">  
      <firstname>Vaneet</firstname>  
      <lastname>Gupta</lastname>  
      <nickname>vinni</nickname>  
      <marks>95</marks>  
   </student>  
   <student rollno="593">  
      <firstname>jasvir</firstname>  
      <lastname>singn</lastname>  
      <nickname>jazz</nickname>  
      <marks>90</marks>  
   </student>  
</class>

DomParserDemo.java

import java.io.File;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;

public class DomParserDemo {
   public static void main(String[] args){

      try {	
         File inputFile = new File("input.txt");
         DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
         DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
         Document doc = dBuilder.parse(inputFile);
         doc.getDocumentElement().normalize();
         System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
         NodeList nList = doc.getElementsByTagName("student");
         System.out.println("----------------------------");
         for (int temp = 0; temp < nList.getLength(); temp++) {
            Node nNode = nList.item(temp);
            System.out.println("\nCurrent Element :" + nNode.getNodeName());
            if (nNode.getNodeType() == Node.ELEMENT_NODE) {
               Element eElement = (Element) nNode;
               System.out.println("Student roll no : " + eElement.getAttribute("rollno"));
               System.out.println("First Name : " + eElement.getElementsByTagName("firstname").item(0).getTextContent());
               System.out.println("Last Name : " + eElement.getElementsByTagName("lastname").item(0).getTextContent());
               System.out.println("Nick Name : " + eElement.getElementsByTagName("nickname").item(0).getTextContent());
               System.out.println("Marks : " + eElement.getElementsByTagName("marks").item(0).getTextContent());
            }
         }
      } catch (Exception e) {
         e.printStackTrace();
      }
   }
}



Source: https://www.tutorialspoint.com/java_xml/java_dom_parse_document.htm

3. JSON.simple解析JSON


Mapping between JSON and Java entities

JSON.simple maps entities from the left side to the right side while decoding or parsing, and maps entities from the right to the left while encoding.

JSON Java
string java.lang.String
number java.lang.Number
true|false java.lang.Boolean
null null
array java.util.List
object java.util.Map

On decoding, the default concrete class of java.util.List is org.json.simple.JSONArray and the default concrete class of java.util.Map is org.json.simple.JSONObject.

Encoding JSON in Java

Following is a simple example to encode a JSON object using Java JSONObject which is a subclass of java.util.HashMap. No ordering is provided. If you need the strict ordering of elements, use JSONValue.toJSONString ( map ) method with ordered map implementation such as java.util.LinkedHashMap.

import org.json.simple.JSONObject;

class JsonEncodeDemo {
   public static void main(String[] args){
      JSONObject obj = new JSONObject();

      obj.put("name", "foo");
      obj.put("num", new Integer(100));
      obj.put("balance", new Double(1000.21));
      obj.put("is_vip", new Boolean(true));

      System.out.print(obj);
   }
}


Following is another example that shows a JSON object streaming using Java JSONObject −

import org.json.simple.JSONObject;

class JsonEncodeDemo {
   public static void main(String[] args){
      JSONObject obj = new JSONObject();

      obj.put("name","foo");
      obj.put("num",new Integer(100));
      obj.put("balance",new Double(1000.21));
      obj.put("is_vip",new Boolean(true));

      StringWriter out = new StringWriter();
      obj.writeJSONString(out);
      
      String jsonText = out.toString();
      System.out.print(jsonText);
   }
}


Decoding JSON in Java

The following example makes use of JSONObject and JSONArray where JSONObject is a java.util.Map and JSONArray is a java.util.List, so you can access them with standard operations of Map or List.

import org.json.simple.JSONObject;
import org.json.simple.JSONArray;
import org.json.simple.parser.ParseException;
import org.json.simple.parser.JSONParser;

class JsonDecodeDemo {
   public static void main(String[] args){
      JSONParser parser = new JSONParser();
      String s = "[0,{\"1\":{\"2\":{\"3\":{\"4\":[5,{\"6\":7}]}}}}]";
		
      try{
         Object obj = parser.parse(s);
         JSONArray array = (JSONArray)obj;
			
         System.out.println("The 2nd element of array");
         System.out.println(array.get(1));
         System.out.println();

         JSONObject obj2 = (JSONObject)array.get(1);
         System.out.println("Field \"1\"");
         System.out.println(obj2.get("1"));    

         s = "{}";
         obj = parser.parse(s);
         System.out.println(obj);

         s = "[5,]";
         obj = parser.parse(s);
         System.out.println(obj);

         s = "[5,,2]";
         obj = parser.parse(s);
         System.out.println(obj);
      }catch(ParseException pe){
         System.out.println("position: " + pe.getPosition());
         System.out.println(pe);
      }
   }
}


Source: https://www.tutorialspoint.com/json/json_java_example.htm

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值