学习java半年了,第一次自己独立写段小程序。
想对网页的特定文本采集,本以为很容易的,却还是笨笨的弄了很久。主要是对网页源文件的理解不够,在看HTMLPARSER的API时,对于什么是node,什么是TextNode,什么是NodeTag,TextTag等等都不了解,所以又一阵补。看些javascript,看HTML\CSS基础,看些别人写的HTMLPARSER相关程序。还好最后能初步实现了。
觉得还不错的一些参考小文:
通过HTML PARSER实时获取其余网址地外汇牌价:
http://www.qqgb.com/Program/Java/JavaBlog/Program_145777.html
使用 htmlparser 包获取网页中的 超链信息 和 超链标题:
http://topic.csdn.net/u/20081225/09/e849af53-9686-4d17-ad39-0c6470978b68.html
还有就是HTMLPARSER的文档了。
http://htmlparser.sourceforge.net/
这个小程序主要是实现了对京东的购买者评论的采集。网页内容是用WGET下载到本地,然后对这些网页进行处理。用wget对类似于http://club.360buy.com/repay/165862_cb55052b-ff73-4206-b32a-ddc6241956d8_1.html的评论内容进行采集,采集内容有四块:1、会员名、会员类型、地址;2、评论主题、评论时间以及对该产品评价等级;3、评论人评论的内容与购买时间。注意到评论内容有两种形式,在2009年4月15日前的评论是只有总结的,之后的包括三部分,优点、不足、总结,因此在处理时得分类考虑;4、对该评论的回复人的会员名、会员类型、回复时间、回复内容和楼数。
把采集分成这四块,主要是基于对网页源文件分析后得到的。每一块在源文件中都是处于父节点下,这样在采集时能提高一步效率(但相对用正则表达式,在一次读取中,读出所有的结果,效率应该会差)。
程序代码如下:
package donkey.tq.parsetest;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.htmlparser.Parser;
import org.htmlparser.PrototypicalNodeFactory;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.tags.CompositeTag;
import org.htmlparser.tags.Div;
import org.htmlparser.tags.Span;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
public class DataHtml {
public static void main(String args[]) {
// String file = "H:\\website\\REPAY\\";
// String file = "H:\\website\\club.360buy.com\\repay\\";
String file = "D:\\HTML\\";
// String file = "J:\\HTML\\";
File files = new File(file);
String[] filename = files.list();
Date time;
SimpleDateFormat sdf = new SimpleDateFormat("HH:mm:ss");
String startTime;
String endTime;
try {
FileWriter erroutput = new FileWriter("err_info_haha.txt", true);
FileWriter output = new FileWriter("result_haha.txt", true);
time = new Date(System.currentTimeMillis());
startTime = sdf.format(time);
erroutput.write("This is the err info of the test~~~~~\r\n");
output.write("开始时间:" + startTime + "\r\n\r\n");
erroutput.write("开始时间:" + startTime + "\r\n\r\n");
for (int i = 0; i < filename.length; i++) {
try {
// 这里生成了三个Parser对象,因为在初始化NodeList nSpan
// 对象时会对之间的Parser对象产生影响,这点还没想清楚
Parser parser0 = new Parser(file + filename[i]);
Parser parser1 = new Parser(file + filename[i]);
Parser parser2 = new Parser(file + filename[i]);
String filenametoprint = filename[i];
int lengthToPrint;// The number of the result to print
PrototypicalNodeFactory p = new PrototypicalNodeFactory();
p.registerTag(new Li());
parser0.setNodeFactory(p);
// 注册三个标签filter
TagNameFilter lifilter = new TagNameFilter("li");
TagNameFilter divfilter = new TagNameFilter("div");
TagNameFilter spanfilter = new TagNameFilter("span");
// 生成过虑后的节点数组
NodeList nLi = parser0.extractAllNodesThatMatch(lifilter);
NodeList nSpan = parser1
.extractAllNodesThatMatch(spanfilter);
NodeList nDiv = parser2.extractAllNodesThatMatch(divfilter);
// The first one is class of div tag;The last two items
// class of are span tags;
String[] ClassArr = new String[] { "re_topic",
"float_Right", "topic" };// re_topic is divClass
// attribute;float_Right
// is the reply
// time;topic is the
// topic of the reply
String[] result0 = getResult0("PR_list_l", nLi, erroutput,
filenametoprint);// 会员类型、会员名、会员地址的展示
String[] result1 = getResult1(ClassArr, nDiv, nSpan,
erroutput, filenametoprint);// 评论主题、时间、等级的展示
String[] result2 = getResult2("re_con", nDiv, erroutput,
filenametoprint);// 直接评论的展示
String[] result3 = getResult3("re2", nDiv, erroutput,
filenametoprint);// 对主评论人评论的评论人的回复时间、名称、会员类型、回复内容的展示
String[] result = getResult(result0, result1, result2,
result3);
if (result3[1].substring(8).trim().equals("")&& result3[0].substring(3).trim().equals("")) {
lengthToPrint = result.length - result3.length;
} else {
lengthToPrint = result.length;
}
for (int j = 0; j < lengthToPrint; j++) {
output.write(result[j] + "\r\n");
}
output
.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
} catch (ParserException e) {
erroutput.write("you have a parser wrong! This is name:"
+ filename[i] + "\r\n");
erroutput
.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
}
}
time = new Date(System.currentTimeMillis());
endTime = sdf.format(time);
output.write("\r\n结束时间:" + endTime + "\r\n");
erroutput.write("\r\n结束时间:" + endTime + "\r\n");
output.close();
erroutput.close();
} catch (IOException e) {
System.err.println("you have a file wrong! HEAD IO");
}
}
// 对于产生的四个String数组合并成一个数组展示
public static String[] getResult(String[] result0, String[] result1,
String[] result2, String[] result3) {
int result0Size = result0.length;
int result1Size = result1.length;
int result2Size = result2.length;
int result3Size = result3.length;
int resultSize = result0Size + result1Size + result2Size + result3Size;
String[] result = new String[resultSize];
for (int i = 0; i < resultSize; i++) {
if (i < result0Size) {
result[i] = result0[i];
continue;
} else if (i < result0.length + result1Size) {
result[i] = result1[i - result0Size];
continue;
} else if (i < result0.length + result1Size + result2Size) {
result[i] = result2[i - result0Size - result1Size];
continue;
} else {
result[i] = result3[i - result0Size - result1Size - result2Size];
}
}
return result;
}
// 对会员名、会员类型、会员地址的展示
public static String[] getResult0(String liClass, NodeList nLi,
FileWriter fw, String filename) {
StringBuffer re_name = new StringBuffer("");
StringBuffer re_type = new StringBuffer("");
StringBuffer re_address = new StringBuffer("");
boolean ok1 = false;
boolean ok2 = false;
boolean ok3 = false;
try {
for (int i = 1; i < nLi.size(); i++) {
Li li = (Li) nLi.elementAt(i);
String classV = li.getAttribute("class");
// System.out.println(classV);
if (classV != null && liClass.equals(classV)) {
re_name = new StringBuffer(li.getAttribute("name"));// 会员名
ok1 = true;
re_type = new StringBuffer(li.getLastChild()
.getPreviousSibling().toPlainTextString());// 会员类型
ok2 = true;
if (li.getLastChild().toPlainTextString().trim().substring(
0, 2).substring(0, 2) != "()") {
re_address = new StringBuffer(li.getLastChild()
.toPlainTextString().substring(1, 3));// 地址
ok3 = true;
}
}
}
} catch (Exception e) {
try {
if (ok1 && ok2 && ok3) {
// fw.write("ok\r\n");
// fw.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
} else {
if (!ok1) {
fw.write("会员名出错\r\n");
}
if (!ok2) {
fw.write("会员类型出错\r\n");
}
if (!ok3) {
fw.write("地址为空\r\n");
}
fw
.write("Result1:A wrong occour in [" + filename
+ "]\r\n");
fw
.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
}
} catch (IOException eio) {
System.err.println("0-EX这都能错?");
}
}
String[] member_relative = { "会员名:" + re_name.toString(),
"会员类型:" + re_type.toString(), "地址:" + re_address.toString() };
return member_relative;
}
public static String[] getResult1(String[] ClassArray, NodeList nDiv,
NodeList nSpan, FileWriter fw, String filename) {
StringBuffer re_topic = new StringBuffer("");
StringBuffer re_time = new StringBuffer("");
StringBuffer re_grate = new StringBuffer("");
boolean ok1 = false;
boolean ok2 = false;
boolean ok3 = false;
try {
for (int i = 0; i < nSpan.size(); i++) {
Span span = (Span) nSpan.elementAt(i);
String classV = span.getAttribute("class");
if (classV != null && ClassArray[2].equals(classV)) {
re_topic = new StringBuffer(span.toPlainTextString().trim());// 评论主题
ok1 = true;
}
if (classV != null
&& ClassArray[1].equals(classV)
&& span.getFirstChild().toPlainTextString().trim()
.startsWith("2")) {
re_time = new StringBuffer(span.getFirstChild()
.toPlainTextString().trim());// 购买者评论时间
ok2 = true;
}
}
for (int i = 1; i < nDiv.size(); i++) {
Div div = (Div) nDiv.elementAt(i);
String classV = div.getAttribute("class");
if (classV != null && ClassArray[0].equals(classV)) {
if (null != div.getFirstChild().getNextSibling()
.getNextSibling().toHtml()) {
re_grate = new StringBuffer(div.getFirstChild()
.getNextSibling().getNextSibling().toHtml()
.substring(26, 27));// 评论等级
ok3 = true;
}
}
}
} catch (Exception e) {
try {
if (ok1 && ok2 && ok3) {
// fw.write("ok\r\n");
// fw.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
} else {
if (!ok1) {
fw.write("主题出错\r\n");
}
if (!ok2) {
fw.write("评论时间出错\r\n");
}
if (!ok3) {
fw.write("评论等级出错\r\n");
}
fw
.write("Result1:A wrong occour in [" + filename
+ "]\r\n");
fw
.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
}
} catch (IOException eio) {
System.err.println("1-EX这都能错?");
}
}
String[] topic_relative = { "评论主题:" + re_topic.toString(),
"回复时间:" + re_time.toString(), "评价级别:" + re_grate.toString() };
return topic_relative;
}
// 对评论主体内容的展示
public static String[] getResult2(String divClass, NodeList nDiv,
FileWriter fw, String filename) {
StringBuffer re_benefits = new StringBuffer("");
StringBuffer re_Shortcomings = new StringBuffer("");
StringBuffer re_summary = new StringBuffer("");
StringBuffer re_date = new StringBuffer("");
boolean ok1 = false;
boolean ok2 = false;
boolean ok3 = false;
boolean ok4 = false;
try {
for (int i = 1; i < nDiv.size(); i++) {
Div div = (Div) nDiv.elementAt(i);
String classV = div.getAttribute("class");
if (classV != null && divClass.equals(classV)) {
if (div.toHtml().indexOf("优点:") != -1
&& div.toHtml().substring(
div.toHtml().indexOf("优点:") - 4,
div.toHtml().indexOf("优点:")).equals("<dt>")) {
re_benefits = new StringBuffer(div.getFirstChild()
.getLastChild().toPlainTextString());// 优点
ok1 = true;
re_Shortcomings = new StringBuffer(div.getFirstChild()
.getNextSibling().getLastChild()
.toPlainTextString());// 不足
ok2 = true;
re_summary = new StringBuffer(div.getFirstChild()
.getNextSibling().getNextSibling()
.getLastChild().toPlainTextString());// 总结
ok3 = true;
re_date = new StringBuffer(div.getLastChild()
.getLastChild().toPlainTextString().replaceAll(
" ", "").substring(5));// 购买日期
ok4 = true;
} else if (div.getFirstChild().getNextSibling()
.getFirstChild().getNextSibling()
.toPlainTextString().trim().equals("总结:")) {
re_summary = new StringBuffer(div.getFirstChild()
.getNextSibling().getLastChild()
.getPreviousSibling().toPlainTextString()
.trim());
ok3 = true;
re_date = new StringBuffer(div.getLastChild()
.getLastChild().toPlainTextString().replaceAll(
" ", "").substring(5));// 购买日期
ok4 = true;
break;
} else if (div.getFirstChild().getFirstChild()
.toPlainTextString().trim().equals("总结:")) {
re_summary = new StringBuffer(div.getFirstChild()
.getLastChild().toPlainTextString());
ok3 = true;
re_date = new StringBuffer(div.getLastChild()
.getLastChild().toPlainTextString().replaceAll(
" ", "").substring(5));
ok4 = true;
break;
} else if (div.getFirstChild().getLastChild()
.getPreviousSibling().toPlainTextString().equals(
"总结:")) {
re_summary = new StringBuffer(div.getFirstChild()
.getLastChild().getPreviousSibling()
.toPlainTextString());
ok3 = true;
re_date = new StringBuffer(div.getLastChild()
.getLastChild().toPlainTextString().replaceAll(
" ", "").substring(5));// 购买日期
ok4 = true;
break;
}
}
}
} catch (Exception e) {
try {
if (ok1 && ok2 && ok3 && ok4) {
// fw.write("ok\r\n");
// fw.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
} else {
if (!ok1) {
fw.write("优点出错\r\n");
}
if (!ok2) {
fw.write("不足出错\r\n");
}
if (!ok3) {
fw.write("总结出错\r\n");
}
if (!ok4) {
fw.write("购买日期出错\r\n");
}
fw
.write("Result2:A wrong occour in [" + filename
+ "]\r\n");
fw
.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
}
} catch (IOException eio) {
System.err.println("2-EX这都能错?");
}
}
String[] topic_relativenew = { "优点:" + re_benefits.toString(),
"不足:" + re_Shortcomings.toString(),
"总结:" + re_summary.toString(), "购买日期:" + re_date.toString() };
String[] topic_relativeold = { "总结:" + re_summary.toString(),
"购买日期:" + re_date.toString() };
if (!re_benefits.toString().trim().equals("")
|| !re_Shortcomings.toString().trim().equals("")) {
return topic_relativenew;
} else {
return topic_relativeold;
}
}
// 对回复者相关信息的展示
public static String[] getResult3(String divClass, NodeList nDiv,
FileWriter fw, String filename) {
StringBuffer re_reply_floor = new StringBuffer("");
StringBuffer re_reply_type = new StringBuffer("");
StringBuffer re_reply_time = new StringBuffer("");
StringBuffer re_reply_name = new StringBuffer("");
StringBuffer re_reply_detail = new StringBuffer("");
boolean ok1 = false;
boolean ok2 = false;
boolean ok3 = false;
boolean ok4 = false;
boolean ok5 = false;
try {
for (int i = 1; i < nDiv.size(); i++) {
Div div = (Div) nDiv.elementAt(i);
String classV = div.getAttribute("class");
if (classV != null
&& divClass.equals(classV)
&& null != div.getFirstChild().getNextSibling()
.toPlainTextString()) {
re_reply_floor = new StringBuffer(div.getFirstChild()
.getNextSibling().toPlainTextString());// 楼数
ok1 = true;
re_reply_type = new StringBuffer(div.getLastChild()
.getFirstChild().getFirstChild().getFirstChild()
.toPlainTextString());// 回复者会员类型
ok2 = true;
re_reply_time = new StringBuffer(div.getLastChild()
.getFirstChild().getFirstChild().getLastChild()
.toPlainTextString().replaceAll(" ", ""));// 回复时间
ok3 = true;
re_reply_name = new StringBuffer(div.getLastChild()
.getFirstChild().getLastChild()
.getPreviousSibling().toPlainTextString());// 回复者姓名
ok4 = true;
re_reply_detail = new StringBuffer(div.getLastChild()
.getFirstChild().getNextSibling().getLastChild()
.toPlainTextString());// 回复内容
ok5 = true;
}
}
} catch (Exception e) {
try {
if (ok1 && ok2 && ok3 & ok4 & ok5) {
// fw.write("ok\r\n");
// fw.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
} else {
if (!ok1) {
fw.write("楼数出错\r\n");
}
if (!ok2) {
fw.write("回复者会员类型出错\r\n");
}
if (!ok3) {
fw.write("回复时间\r\n");
}
if (!ok4) {
fw.write("回复者姓名\r\n");
}
if (!ok5) {
fw.write("回复内容出错\r\n");
}
fw
.write("Result2:A wrong occour in [" + filename
+ "]\r\n");
fw
.write("-------------------------------------------------------------------------------------------------------------------------------------------------\r\n");
}
} catch (IOException eio) {
System.err.println("2-EX这都能错?");
}
}
String[] reply_relative = { "楼数:" + re_reply_floor.toString(),
"回复者会员类型:" + re_reply_type.toString(),
"回复时间:" + re_reply_time.toString(),
"回复者姓名:" + re_reply_name.toString(),
"回复内容:" + re_reply_detail.toString() };
return reply_relative;
}
// 注册Li标签
static class Li extends CompositeTag {
private static final long serialVersionUID = 1L;
private static final String mIds[] = { "li" };
private static final String mEndTagEnders[] = { "li" };
public Li() {
}
public String[] getIds() {
return mIds;
}
public String[] getEndTagEnders() {
return mEndTagEnders;
}
}
}
这段程序能将结果写入result_haha.txt文件中,同时把生成的相关错误信息保存在err_info_haha.txt文件中,这样能检查错误文件。出现的错误主要是对于一些网页中,会员地址为空时采集出错,以及其他错误。
这段程序主要是为了生成能够写入数据库的信息的,所以把原先直接打印的信息,全都以String数组的形式返回了,之后设计相应的数据库表,依次写入数据就可以了。
在以后的改进中,可以通过正则表达式实现简单的网页内容采集,同时可以实现多线程,提高效率,最好是一边采集,一会写入库,针对性强。但基于本文是为了实验,所以直接对本地大量网页进行处理。
采集内容形式:
开始时间:14:20:22
会员名:jkl052132
会员类型:金牌会员
地址:上海
评论主题:很喜欢!
回复时间:2009-04-04 13:20
评价级别:5
总结:外观不错,手感一流,一打开包装就闻到一股皮革的味道。初次使用,放了500+RMB,八张卡,然后就显的鼓鼓的,这一点不是太好。还有如果仔细看内部边缘剪裁的话,可以发现有一些地方有些茅草,不过一影响整体性能,而且不仔细看的人很难发现。其他一切OK!总的来说我很喜欢,价格也相对实惠。PS:我下单的时候是119-20=99(用了一张优惠券)。
购买日期:2009-04-02
----------------------------------------------------------------------------------------------
结束时间:14:20:22
采集时出现错误提示:
This is the err info of the test~~~~~
开始时间:18:56:44
地址出错
Result1:A wrong occour in [106885_adc00d50-f4be-4bdc-aac6-a6f29a462ce7_1.html]
------------------------------------------------------------
地址出错
Result1:A wrong occour in [109148_0347b01e-e2a8-4e91-81b2-f94e1612a91d_1.html]
------------------------------------------------------------
地址出错
Result1:A wrong occour in [144961_3d882126-f310-4007-a719-1ea0bb192130_1.html]
-------------------------------------------------------------
地址出错
Result1:A wrong occour in [149611_2f18c17a-61e3-4800-8384-8d2b75145baa_1.html]
-------------------------------------------------------------
结束时间:19:02:05