利用HtmlParser进行网页信息提取[原创]

1.1 概述

在开发工作中,往往有些需求是需要获取某些网页中的内容。针对这一问题,目前可以采用先获取网页内容,然后对网页内容进行解析,并重新排版的方式来解决。

1.2 资源

   1 JDK 1.5.06

   2)  HTMLParser2.0

地址:http://sourceforge.net/project/showfiles.php?group_id=24399&package_id=47712

2 对网页内容的获取和解析

2.1 HTTPLook的使用

    在我们模仿IE访问之前,我们首先需要知道每次请求的包头信息和请求方式以及发送的内容,并且需要知道,响应包的包头和响应内容,这样我们才能完整正确的发送请求。这个步骤我们可以采用一个小工具HTTPLook来实现,HTTPLook可以实现对请求和响应的监视。HTTPLook在网上可以随意下载,我下载的地址是http://www.crsky.com/soft/3786.html,下载完成后双击安装,一路NEXT即可,软件的主界面如下。

      

  在操作之前,点解上图中的绿色箭头即可。

2.2 获取网页内容

我们可以采用java.net包中的HttpURLConnection类和URL类来产生和发送请求,并且获取网页。只要流程包括

1)设置采用POST方式时候,发送的请求内容

2)设置请求地址

3)打开链接

4)获取COOKIE,这一步可以有也可以没有,如果在HTTPLook截获的请求信息中包含COOKIE时,这是需要次步。

5)设置请求头中的信息

6)发送请求

7)获取响应的HTML页面

8)关闭连接

 

具体的步骤如下列出:

首先需要定义两个全局变量,如下:

HttpURLConnection httpConn = null;

URL url =  null;

这里列出了用到方法,用来调用和重用。代码如下:

 

 

/**

        * 主方法,执行页面下载和解析的业务逻辑过程

 

        * @param idCard

        *            身份证号码

        * @param passwordStr

        *            个人密码

        *

        * @return 字符串

        */

       public String downLoadPages(String idCard,String passwordStr){

              String result = "";

              try{

                     String s = ""; //采用POSTT方式时候,发送的请求内容,根据需求设定

                     String urlStr = ""; //请求地址,,根据需求设定

 

                     openHttpURLConnection(urlStr,"GET"); //打开链接

                     String head = getHeadValue("Set-Cookie");//获取COOKIE

                     String temp1 = getHtmlStr("GBK"); //获取响应的HTML页面

                     close(); //关闭连接

 

                     //再一次访问

                     urlStr = "http://www.njgjj.com/logonbyidcard.do;"+head;   //请求地址

                     openHttpURLConnection(urlStr,"POST"); //打开链接

                     setHeadValue("Cookie",head); //设置请求头中的COOKIE信息

                     setHeadValue("Content-Length",String.valueOf( s.length())); //设置请求发送内容的长度

                     setHeadValue("Content-type","application/x-www-form-urlencoded");

                     sendRequest(s); //发送请求

                     String result = getHtmlStr( "GBK");//获取响应的HTML页面

                     close();//关闭连接

 

                     System.out.println(result);

                     System.out.println("------------------------------------------------------");

                    

                     /** 解析页面 */

result = parseHtml(result.trim().replaceAll("TH","td"));

              }catch(Exception e){

                     e.printStackTrace();

              } finally {

                     try {

                            if (httpConn != null){

                                   httpConn.disconnect();

                                   httpConn = null;

                            }

                     } catch (Exception ex) {

                            ex.printStackTrace();

                     }

              }

             

              return result;

       }

 

   注意:主方法列出了业务逻辑

 

/**

       * 关闭连接

       */

       public void close(){

              try{

                     if (httpConn != null){

                            httpConn.disconnect();

                            httpConn=null;

                     }

                     } catch (Exception ex) {

                            ex.printStackTrace();

                     }

       }

      

       /**

       * 获取响应包头信息的参数

       */

       public String getHeadValue(String key){

                     String returnValue = "";

                     try{

                            int code = httpConn.getResponseCode();

                            if (code != HttpURLConnection.HTTP_OK) {

                                   System.out.println("error   code   " + String.valueOf(code));

                                   returnValue = String.valueOf(code);

                            }else{

                                   //Map map = httpConn.getHeaderFields();

                                   //Set set = map.keySet();

                                   //Iterator it = set.iterator();

                                   //while(it.hasNext()){

                                   //     System.out.println("next ---"+it.next());

                                   //}

                                   returnValue = httpConn.getHeaderField(key);

                            }

                     } catch (Exception e) {

                            e.printStackTrace();

                     }

                     return returnValue;

                    

              }

 

 

       /**

       * 在请求头中添加参数

       */

       public void setHeadValue(String key,String value){

              try {

                     httpConn.setRequestProperty(key, value);

              } catch (Exception e) {

                     e.printStackTrace();

                     try {

                     if (httpConn != null){

                            httpConn.disconnect();

                            httpConn=null;

                     }

                     } catch (Exception ex) {

                            ex.printStackTrace();

                     }

              }

       }

 

       /**

       * 发送请求

       */

       public void sendRequest(String sendStr){

              OutputStream outputstream = null;

              try {

                     outputstream = httpConn.getOutputStream();

                     outputstream.write(sendStr.getBytes());

                     outputstream.flush();

              } catch (Exception e) {

                     e.printStackTrace();

                     try {

                            if (outputstream != null){

                                   outputstream.close();

                                   outputstream=null;

                            }

                            if (httpConn != null){

                                   httpConn.disconnect();

                                   httpConn=null;

                            }

                     } catch (Exception ex) {

                            ex.printStackTrace();

                     }

              }

       }

      

       /**

       * 打开连接,设置默认请求参数

       */

       public void openHttpURLConnection(String inUrlStr,String method) {

              String urlStr = inUrlStr;

              int chByte = 0;

              String returnStr = ""; //返回值

              try {

                  url = new URL(urlStr);

                     httpConn = (HttpURLConnection) url.openConnection();

                     HttpURLConnection.setFollowRedirects(true);

                     httpConn.setUseCaches(false);

                     httpConn.setDoOutput(true);

                     httpConn.setDoInput(true);

                     httpConn.setRequestMethod(method);

                     httpConn.setRequestProperty("User-Agent",

                                                 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)");

                     httpConn

                                   .setRequestProperty(

                                                 "Accept",

                                                 "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*");

                    

                     httpConn.setRequestProperty("Connection", "Keep-Alive");

 

              } catch (Exception e) {

                     e.printStackTrace();

                     try {

                     if (httpConn != null){

                            httpConn.disconnect();

                            httpConn=null;

                     }

                     } catch (Exception ex) {

                            ex.printStackTrace();

                     }

              }

       }

      

      

                    

              /**

              * 从响应包体中获取HTML页面内容

              */

              public String getHtmlStr(String encode){

                     InputStream in = null;

                     String returnHtmlStr = "";

                     try{

                     int code = httpConn.getResponseCode();

                     if (code != HttpURLConnection.HTTP_OK) {

                            System.out.println("error   code   " + String.valueOf(code));

                            returnHtmlStr = String.valueOf(code);

                     }else{

                            in = httpConn.getInputStream();

 

                            String webpage;

                            StringBuffer bf = new StringBuffer();

                            int c;

                            while (((c = in.read()) != -1)) {

                                   int all = in.available();

                                   byte[] b = new byte[all];

                                   in.read(b);

                                   webpage = new String(b, encode);

                                   //webpage = new String(b);

                                   bf.append(webpage);

                            }

 

                            String outStr = new String(bf.toString());

                            returnHtmlStr =  outStr;                   

                           

                     }

              } catch (Exception e) {

                     e.printStackTrace();

                     try {

                            if (in != null)

                                   in.close();

                            if (httpConn != null){

                                   httpConn.disconnect();

                                   httpConn=null;

                            }

                     } catch (Exception ex) {

                            ex.printStackTrace();

                     }

              }

              return returnHtmlStr;

       }

 

2.3 解析页面

当页面获取成功后,下一步需要做的就是解析页面了。此时就用到了HTMLPARSER包。HTMLPARSER包的下载地址请参考前言部分。

我们假设需要获取HTML页面中的一个span和一个table的内容,所以只要简单的列出了解析SPANTABLE的内容的方式,以做抛砖引玉的作用,具体请参考HTMLPARSER的帮助文档,帮助文档的软件包的下载地址处可以同时下载获得。

具体代码如下:

/**

        * 在指定表中查找指定行、指定列的值

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值