从上一步已经得到html页面了,那么就该去掉文件中html的标签
原始文件
我们需要保留文件中的正文而去除html标签:
用sed很easy
sed -e 's/<[^>]*>//g;s/ / /g;s/</</g;s/>/>/g;s/&/&/g;s/"/"/g;' test1 |sed -e "s/ / /g;s/</</g;s/>/>/g;s/\&/\&/g;s/'/'/g"
解析命令:
1、's/<[^>]*>//g:去掉html标签。注意要加上[^>],如果使用<.*>会出错,如这样的字符串<p><span style="font-size:18px">Given a 2d grid map of <code>'1'</code>s (land) and,使用<.*>会匹配<p><span style="font-size:18px">Given a 2d grid map of <code>'1'</code>;
2、转换html中的实体字符
注意&在sed里表示匹配到的项,需要加\转移一下,如:s/\&/\&/g;
显示结果 | 描述 | 实体名称 | 实体编号 |
---|---|---|---|
空格 | |   | |
< | 小于号 | < | < |
> | 大于号 | > | > |
& | 和号 | & | & |
" | 引号 | " | " |
' | 撇号 | ' (IE不支持) | ' |
¢ | 分 | ¢ | ¢ |
£ | 镑 | £ | £ |
¥ | 日圆 | ¥ | ¥ |
€ | 欧元 | € | € |
§ | 小节 | § | § |
© | 版权 | © | © |
® | 注册商标 | ® | ® |
™ | 商标 | ™ | ™ |
× | 乘号 | × | × |
÷ | 除号 | ÷ | ÷ |
最终效果:
import java.util.*;
import java.io.*;
public class Replace{
public static void main(String[] args) throws Exception{
Scanner in = null;
PrintWriter out = null;
File inFile = new File("test1");//要去除标签的源文件
System.out.println(inFile.length());
File outFile = new File("test2");//保存修改后的文件
in = new Scanner(inFile);
out = new PrintWriter(outFile);
while(in.hasNext()){
String str = in.nextLine();
out.println(repalce(str));
}
System.out.println("Read over!!!");
in.close();
out.close();
}
public static String repalce(String str){
//删除标签
String[] html = {" ","<",">","&","""," ","<",">","&","'"};
String[] re = {" ","<",">","&","\""," ","<",">","&","'"};
str = str.replaceAll("<[^>]*>","");
for(int i = 0; i < html.length; i++){
str = str.replaceAll(html[i],re[i]);
}
return str;
}
}