1. HtmlCleaner解析HTML
编程的时候,有时数据源从html来。那就要对html分析提取数据。好在java社区里有好有相关库来解析html,经使用比较:个人觉得 HtmlCleaner 比 HtmlParser 好用。HtmlCleaner 的 xpath特好用。也可能我对HtmlParser不熟悉。
HtmlCleaner 下载地址:htmlcleaner2_1.jar;
写一个测试用的html文件:html-clean-demo.html。
<xml:lang="zh-CN" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=GBK"/>
<meta http-equiv="Content-Language" content="zh-CN"/>
<title>html clean demo</title>
<head>
<body>
<div class="d_1">
<ul>
<li>bar</li>
<li>foo</li>
<li>gzz</li>
</ul>
</div>
<div>
<ul>
<li><a name="my_href" href="1.html">text-1</a></li>
<li><a name="my_href" href="2.html">text-2</a></li>
<li><a name="my_href" href="3.html">text-3</a></li>
<li><a name="my_href" href="4.html">text-4</a></li>
</ul>
</div>
</body>
</html>
模拟需求:取出title,name="my_href" 的链接,div的class="d_1"下的所有li内容。下面用HtmlCleaner写代码,HtmlCleanerDemo.java
package com.chenlb;
import java.io.File;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
public class HtmlCleanerDemo {
public static void main(String[] args) throws Exception {
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(new File("html/html-clean-demo.html"), "GBK");
//按tag取.
Object[] ns = node.getElementsByName("title", true); //标题
if(ns.length > 0) {
System.out.println("title="+((TagNode)ns[0]).getText());
}
System.out.println("ul/li:");
//按xpath取
ns = node.evaluateXPath("//div[@class='d_1']//li");
for(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println("\ttext="+n.getText());
}
System.out.println("a:");
//按属性值取
ns = node.getElementsByAttValue("name", "my_href", true, true);
for(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println("\thref="+n.getAttributeByName("href")+", text="+n.getText());
}
}
}
cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、getElementsByAttValue、getElementsByName方法了。另外说明下,HtmlCleaner 对不规范的html兼容性比较好。
public void visitTagNodeChildren(TagNode node){
List<Object> itemList = node.getAllChildren();
for(Object item : itemList){
if(item instanceof TagNode){
TagNode tagNode = (TagNode) item;
String nodeText = tagNode.getText().toString();
} else if (item instanceof ContentNode){
ContentNode contentNode = (ContentNode) item;
String contentText = contentNode.getContent();
} else if (item instanceof CommentNode){
CommentNode commentNode = (CommentNode) item;
String commentText = commentNode.getContent();
}
}
}
2. DOM解析XML
Here is the input XML file we need to parse:
<?xml version="1.0"?>
<class>
<student rollno="393">
<firstname>dinkar</firstname>
<lastname>kad</lastname>
<nickname>dinkar</nickname>
<marks>85</marks>
</student>
<student rollno="493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>vinni</nickname>
<marks>95</marks>
</student>
<student rollno="593">
<firstname>jasvir</firstname>
<lastname>singn</lastname>
<nickname>jazz</nickname>
<marks>90</marks>
</student>
</class>
DomParserDemo.java
import java.io.File;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
public class DomParserDemo {
public static void main(String[] args){
try {
File inputFile = new File("input.txt");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(inputFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("student");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
System.out.println("\nCurrent Element :" + nNode.getNodeName());
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("Student roll no : " + eElement.getAttribute("rollno"));
System.out.println("First Name : " + eElement.getElementsByTagName("firstname").item(0).getTextContent());
System.out.println("Last Name : " + eElement.getElementsByTagName("lastname").item(0).getTextContent());
System.out.println("Nick Name : " + eElement.getElementsByTagName("nickname").item(0).getTextContent());
System.out.println("Marks : " + eElement.getElementsByTagName("marks").item(0).getTextContent());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Source: https://www.tutorialspoint.com/java_xml/java_dom_parse_document.htm
3. JSON.simple解析JSON
Mapping between JSON and Java entities
JSON.simple maps entities from the left side to the right side while decoding or parsing, and maps entities from the right to the left while encoding.
JSON | Java |
---|---|
string | java.lang.String |
number | java.lang.Number |
true|false | java.lang.Boolean |
null | null |
array | java.util.List |
object | java.util.Map |
On decoding, the default concrete class of java.util.List is org.json.simple.JSONArray and the default concrete class of java.util.Map is org.json.simple.JSONObject.
Encoding JSON in Java
Following is a simple example to encode a JSON object using Java JSONObject which is a subclass of java.util.HashMap. No ordering is provided. If you need the strict ordering of elements, use JSONValue.toJSONString ( map ) method with ordered map implementation such as java.util.LinkedHashMap.
import org.json.simple.JSONObject;
class JsonEncodeDemo {
public static void main(String[] args){
JSONObject obj = new JSONObject();
obj.put("name", "foo");
obj.put("num", new Integer(100));
obj.put("balance", new Double(1000.21));
obj.put("is_vip", new Boolean(true));
System.out.print(obj);
}
}
Following is another example that shows a JSON object streaming using Java JSONObject −
import org.json.simple.JSONObject;
class JsonEncodeDemo {
public static void main(String[] args){
JSONObject obj = new JSONObject();
obj.put("name","foo");
obj.put("num",new Integer(100));
obj.put("balance",new Double(1000.21));
obj.put("is_vip",new Boolean(true));
StringWriter out = new StringWriter();
obj.writeJSONString(out);
String jsonText = out.toString();
System.out.print(jsonText);
}
}
Decoding JSON in Java
The following example makes use of JSONObject and JSONArray where JSONObject is a java.util.Map and JSONArray is a java.util.List, so you can access them with standard operations of Map or List.
import org.json.simple.JSONObject;
import org.json.simple.JSONArray;
import org.json.simple.parser.ParseException;
import org.json.simple.parser.JSONParser;
class JsonDecodeDemo {
public static void main(String[] args){
JSONParser parser = new JSONParser();
String s = "[0,{\"1\":{\"2\":{\"3\":{\"4\":[5,{\"6\":7}]}}}}]";
try{
Object obj = parser.parse(s);
JSONArray array = (JSONArray)obj;
System.out.println("The 2nd element of array");
System.out.println(array.get(1));
System.out.println();
JSONObject obj2 = (JSONObject)array.get(1);
System.out.println("Field \"1\"");
System.out.println(obj2.get("1"));
s = "{}";
obj = parser.parse(s);
System.out.println(obj);
s = "[5,]";
obj = parser.parse(s);
System.out.println(obj);
s = "[5,,2]";
obj = parser.parse(s);
System.out.println(obj);
}catch(ParseException pe){
System.out.println("position: " + pe.getPosition());
System.out.println(pe);
}
}
}
Source: https://www.tutorialspoint.com/json/json_java_example.htm