Problem Preparation for HTML to TXT

Problem Preparation for HTML to TXT

 

If you have worked on website design, you must be familiar with HTML. When writing a HTML file, we always use <> </> to include some control commands which can control the appearance of the HTML file.  Sometime, we want to get, say only the content excluding the control command from the HTML file. One way to do this is to use the browser to open the file first, and then copy the content you want. It can see you through sometime really, but when you have many such HTML files to process, this becomes tough again. Who want to open, copy, close, open, copy, close… Is there a program that can solve you out from this situation? I do not know, but we can write one! 

But to write the whole program may need more consideration. Because the format of the HTML file are much more complex that our imagination. Before we can develop a solution for all kinds of HTML file, it’s better for us to start from the simplest cases.

Now, we simplify the problem to the following description. Given a file, can we extract the characters which are not in the left and right angle brackets? If you're such unfamiliar with HTML as me, you must have know why simplify it like this and why I call the problem the “Preparation for HTML to TXT”.

Input

The input will be a file which contains any characters. Some of them are between the left angle bracket and right angle brackets.

Output

It should output a file contains only the characters that are not in the angle brackets

Sample Input

File1                  // content of File1 <<>He<>l<>l<>o,<>

File2                  // content of File2 <>Good Morning<..!>!

File3                  // content of File3 <<N><HTML><A></A>>>

Sample Output

 
 
  
  

<Hello                 // result for File1

Good Morning!          // result for File2

>                      // result for File3

 

 

Solution

 

 

/* define variables */

char currentchar;

Stack charStack, idxOfLeftAngelStack;

 

int idx,     /* idx of the newest ‘<’ encountered during process */

length,      /* length of the charStack */

idxStackLen; /* length of the idxOfLeftAngelStack */

 

/* do initialization */

length = 0;

idx = idxOfLeftAngelStack[0] = -1;

idxStackLen = 1;

 

while ((currentchar = getOneCharFromTheFile())!= EOF) {

  if(currentchar == ‘<’) {

charStack.push(currentchar);

/* evaluate idx of the position of ‘<’ in the charStack.

 increase length of the charStack.

*/

idx = length++;

idxOfLeftAngelStack.push(idx);

idxStackLen++

  }

  else if(currentchar == ’>’) {

    /* if there’s not ‘<’ in the charStack, idx will be -1.

In such case, the currentchar can be written to the file directly.

*/

if(idx != -1) {

  /* pop the characters between ‘<’ and ‘>’ */

  charStack.pop(from idx, to length);

  /* new length of the charStack */

  length = idx;

  /* remove the last value from the idxOfLeftAngelStack,

  We do not remove by decrease the top of the stack.

  */

  idxStackLen--;

  /* Get the last index of ‘<’ from the idxOfLeftAngelStack */

  idx = idxOfLeftAngelStack[idxStackLen - 1];

}

else { /* no ‘<’ in the charStack */

  writeToFile(currentchar);

}   

  }

  else if(length == 0) {

    writeToFile(currentchar);

  }

  else { 

charStack.push(currentchar);

/* evaluate idx of the position of ‘<’ in the charStack.

increase length of the charStack.

*/

idx = length++;

  }

}

 

Java source code:

 

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

 

public class Html2Txt2 {

    FileReader fr;

    FileWriter fw;

 

    public Html2Txt2(File html) {

        try {

            fr = new FileReader(html);

            fw = new FileWriter(new File(html.getName()

                    + ".txt"));

        } catch (FileNotFoundException e) {

            e.printStackTrace();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

 

    public void html2txt() {

        int currentchar = 0;       

       

        StringBuffer sb = new StringBuffer("");    

        int length = 0;

       

        int[] idxStack = new int[20];      

        int idx = idxStack[0] = -1;

        int lenOfIdxStack = 1;

 

        try {

            while ((currentchar = fr.read()) != -1) {               

                if (currentchar == '<') {

                    sb.append((char)currentchar);

                    idx = length++;

                    idxStack[lenOfIdxStack++] = idx;                   

                } else  if(currentchar == '>') {

                    if(idx != -1) {

                        sb.delete(idx, length);

                        length = idx;

                        // remove last idx of '<'

                        --lenOfIdxStack;

                        idx = idxStack[lenOfIdxStack - 1];

                    } else {

                        fw.write(currentchar); 

                        fw.flush();

                    }                   

                } else if(length == 0) {

                    fw.write(currentchar);

                    fw.flush();

                } else {

                    sb.append((char)currentchar);

                    length++;

                }

            }

            // if still there's chars in the buffer

            // write all to the file

            if(sb.length() > 0) {              

                fw.write(sb.toString().toCharArray());

                fw.flush();

            }

           

        } catch (IOException e) {

            e.printStackTrace();

        } finally {

            try {

                if (fr != null) {

                    fr.close();

                }

                if (fw != null) {

                    fw.close();

                }

            } catch (IOException e) {          

                e.printStackTrace();

            }

        }

    }

   

    public static void main(String [] args) {

        if( args.length < 1) {

            System.err.println("Usage: java Html2Txt filename.html...");

            System.exit(1);

        }

        Html2Txt2 h2t = null;

        for(int i = 0; i < args.length; i++) {

            h2t = new Html2Txt2(new File(args[i]));

            if(h2t != null)

                h2t.html2txt();

        }

    }

}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值