今天做的对html文档的解析

最新推荐文章于 2024-08-20 19:23:57 发布

scu2scut

最新推荐文章于 2024-08-20 19:23:57 发布

阅读量1.2k

点赞数

分类专栏： XML 文章标签： html 文档 xhtml import string xml

本文链接：https://blog.csdn.net/scu2scut/article/details/631676

版权

XML 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近打算做一个列车时刻查询系统,一开始苦于没有数据,对于手机归属,企业信息等数据很多网站都是要拿来卖钱的.后来好不容易弄到了，可惜都是网页组织的,大概有几千个网页用来显示存放.要想用于开发,这些网页是不能直接使用的,于是我选择将其解析成XML文件.

. 跳出来的第一个想法是:用解析xml的工具来处理,因为html也是一种xml(可能这种说法不是很专业),但这个想法很快被否定了:html的语法不象xml那样严密,即使tag不完全匹配也还是可以正常显示.这条路死掉了就考虑直接解吸html文件,大概想了想,应该是递归下降之类的做法比较适用,做起来不会很容易.后来在某个论坛上看到了一个开源工具jtidy,可以把不规范的html文件转化为xhtml文件,xhtml文件完全符合xml的语法规范,因此用来处理xml的工具也完全可以用来处理xhtml.思路终于明朗了:用jtidy处理html,再用另一个专门处理xml文件的java工具jdom,将处理之后的xhtml文件解析成所需要格式的xml文件.该代码完全基于所要处理的文档结构,不可移植.

源代码如下:

package edu.cs.scu;

import java.io.*;
import java.util.*;
import org.w3c.tidy.Configuration;
import org.w3c.tidy.Tidy;
import org.jdom.*;
import org.jdom.input.*;
import org.jdom.output.*;

public class HTMLParser {
public void Html2Xml(String inputFileName, String outputFileName){
  Tidy tidy = new Tidy();
//  tidy.setCharEncoding(Configuration.ISO2022);
  tidy.setUpperCaseTags(true);
  tidy.setXHTML(true);

  FileInputStream fis = null;
  FileOutputStream fos = null;
  try{
   fis = new FileInputStream(inputFileName);
   fos = new FileOutputStream(outputFileName);
   tidy.parse(fis, fos);
  }catch(FileNotFoundException e1){
   e1.printStackTrace();
  }finally{
   try{
    if (fis != null)
     fis.close();
    if (fos != null)
     fos.close();
   }catch(IOException e2){
    e2.printStackTrace();
   }
  }

}

public void parseXML(String inputFileName, int p) throws Exception{
  SAXBuilder sb = new SAXBuilder();
  Document doc = sb.build(new FileInputStream(inputFileName));

  Element root = doc.getRootElement();
  List subList = root.getChildren();
  Element headNode = (Element)subList.get(0);

  List headList = headNode.getChildren();
  Element titleNode = (Element)headList.get(1);
  String title = titleNode.getText();
  Element bodyNode = (Element)subList.get(1);

  List tableList = bodyNode.getChildren();
  Element trainScheduleTableNode = (Element)tableList.get(4);

  List trList = trainScheduleTableNode.getChildren();
  Element trainElement = new Element("train");
  Document outputDocument = new Document(trainElement);
  trainElement.addContent(new Element("title").addContent(title));
  System.out.println("trList"+trList.size());
  for (int i = 1; i < trList.size(); i++){
   Element itemNode = (Element)trList.get(i);
   new HTMLParser().item(itemNode, trainElement, outputDocument);

  }
  new HTMLParser().writeToFile(outputDocument, p);
}

public void item(Element itemNode, Element outputRootElement, Document outputDocument){
  String trainID, startStation, startTime, currentStation, timeToCurrentStation, startTimeFromCurrentStation,
   endStation, endTime, totalKilometers;
  List tdList = itemNode.getChildren();
  trainID = (((Element)((Element)tdList.get(1)).getChildren().get(0))).getText();
  startStation = (((Element)((Element)tdList.get(2)).getChildren().get(0))).getText();
  startTime = ((Element)tdList.get(4)).getText();
  currentStation = (((Element)((Element)tdList.get(5)).getChildren().get(0))).getText();
  timeToCurrentStation = ((Element)tdList.get(6)).getText();
  startTimeFromCurrentStation = ((Element)tdList.get(7)).getText();
  endStation = (((Element)((Element)tdList.get(8)).getChildren().get(0))).getText();
  endTime = ((Element)tdList.get(10)).getText();
  totalKilometers = ((Element)tdList.get(11)).getText();

  Element item = new Element("item");
  item.addContent(new Element("trainID").addContent(trainID));
  item.addContent(new Element("startStation").addContent(startStation));
  item.addContent(new Element("startTime").addContent(startTime));
  item.addContent(new Element("currentStation").addContent(currentStation));
  item.addContent(new Element("timeToCurrentStation").addContent(timeToCurrentStation));
  item.addContent(new Element("startTimeFromCurrentStation").addContent(startTimeFromCurrentStation));
  item.addContent(new Element("endStation").addContent(endStation));
  item.addContent(new Element("endTime").addContent(endTime));
  item.addContent(new Element("totalKilometers").addContent(totalKilometers));

  outputRootElement.addContent(item);
}

public void writeToFile(Document document, int p){
  try{
   Format format = Format.getPrettyFormat();
   format.setEncoding("iso_8859_1");
   XMLOutputter outputter = new XMLOutputter();
   outputter.setFormat(format);
   outputter.output(document, new FileOutputStream("G:/train/Train"+p+".xml"));
  }catch (IOException e){
   e.printStackTrace();
  }
}

public static void main(String[] args)throws Exception {
  for (int i = 1; i <= 100; i++){
   String inputFile = "D:/Resources/Others/train schedule/"+i+".html";
   String outputXhtmlFile = "G:/tempxml/"+i+".xml";
   HTMLParser parser = new HTMLParser();

   parser.Html2Xml(inputFile, outputXhtmlFile);
   parser.parseXML(outputXhtmlFile, i);
  }
}
}

遇到的几个问题:1.得到某个Element之后,用getChild(name)方法总是得不到对应的子节点,而用List的get(index)却可以取得到.2.中文显示问题:用于存放中间结果的xhtml文件显示为乱吗,最终得到的xml文件若在代码中明确指定其编码方式为"iso_8859_1"，用记事本打开显示正常，浏览器却不支持该编码．

　　总结，以前对开源工具不够重视，或者说是懒得去尝试，这次完全倚仗jtidy和jdom两个功能强大的好东东，如果没有这两个东东还真不知道怎么下手．以后要多去sourceforge这样的网站走走看看，多多地去发现，学习．

scu2scut

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
今天做的对html文档的解析

最近打算做一个列车时刻查询系统,一开始苦于没有数据,对于手机归属,企业信息等数据很多网站都是要拿来卖钱的.后来好不容易弄到了，可惜都是网页组织的,大概有几千个网页用来显示存放.要想用于开发,这些网页是不能直接使用的,于是我选择将其解析成XML文件.. 跳出来的第一个想法是:用解析xml的工具来处理,因为html也是一种xml(可能这种说法不是很专业),但这个想法很快被
复制链接

扫一扫

专栏目录