取出html中的符号

最新推荐文章于 2022-07-15 17:44:50 发布

havedream_one

最新推荐文章于 2022-07-15 17:44:50 发布

阅读量1.2k

点赞数

分类专栏：博文提取小项目

本文链接：https://blog.csdn.net/havedream_one/article/details/44977775

版权

博文提取小项目专栏收录该内容

4 篇文章 0 订阅

订阅专栏

从上一步已经得到html页面了，那么就该去掉文件中html的标签

原始文件

我们需要保留文件中的正文而去除html标签：

用sed很easy

sed -e 's/<[^>]*>//g;s/&#160;/ /g;s/&#60;/</g;s/&#62;/>/g;s/&#38;/&/g;s/&#34;/"/g;' test1 |sed -e "s/&nbsp;/ /g;s/&lt;/</g;s/&gt;/>/g;s/\&amp;/\&/g;s/&#39;/'/g"

解析命令：

1、's/<[^>]*>//g：去掉html标签。注意要加上[^>],如果使用<.*>会出错，如这样的字符串<p><span style="font-size:18px">Given a 2d grid map of <code>'1'</code>s (land) and，使用<.*>会匹配<p><span style="font-size:18px">Given a 2d grid map of <code>'1'</code>；

2、转换html中的实体字符

注意&在sed里表示匹配到的项，需要加\转移一下，如：s/\&/\&/g;

显示结果	描述	实体名称	实体编号
	空格
<	小于号	<	<
>	大于号	>	>
&	和号	&	&
"	引号	"	"
'	撇号	' (IE不支持)	'
￠	分	¢	¢
£	镑	£	£
¥	日圆	¥	¥
€	欧元	€	€
§	小节	§	§
©	版权	©	©
®	注册商标	®	®
™	商标	™	™
×	乘号	×	×
÷	除号	÷	÷

按照上表转换文档里出现的实体字符

最终效果：

import java.util.*;
import java.io.*;
public class Replace{
        public static void main(String[] args) throws Exception{
                Scanner in = null;
                PrintWriter out = null;
                File inFile = new File("test1");//要去除标签的源文件
                System.out.println(inFile.length());
                File outFile = new File("test2");//保存修改后的文件
                in = new Scanner(inFile);
                out = new PrintWriter(outFile);
                while(in.hasNext()){
                        String str = in.nextLine();
                        out.println(repalce(str));
                }
                System.out.println("Read over!!!");
                in.close();
                out.close();
        }
        public static String repalce(String str){
                //删除标签
                String[] html = {"&#160;","&#60;","&#62;","&#38;","&#34;","&nbsp;","&lt;","&gt;","&amp;","&#39;"};
                String[] re = {" ","<",">","&","\""," ","<",">","&","'"};
                str = str.replaceAll("<[^>]*>","");
                for(int i = 0; i < html.length; i++){
                        str = str.replaceAll(html[i],re[i]);
                }
                return str;
        }
}

havedream_one

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
取出html中的符号

从上一步已经得到html页面了，那么就该去掉文件中html的标签原始文件我们需要保留文件中的正文而去除html标签：用sed很easysed -e 's/]*>//g;s/ / /g;s/<//g;s/&/&/g;s/"/"/g;' test1 |sed -e "s/ / /g;s/<//g;s/\&/\&/g;s/
复制链接

扫一扫