[Java]去除html中的标签或者元素属性（正则表达式/jsoup）

最新推荐文章于 2024-07-28 14:13:07 发布

fukaiit

最新推荐文章于 2024-07-28 14:13:07 发布

阅读量5.8k

点赞数 1

分类专栏： Java 文章标签：正则表达式 jsoup 解析html

本文链接：https://blog.csdn.net/fukaiit/article/details/84262471

版权

Java 专栏收录该内容

26 篇文章 2 订阅

订阅专栏

1. 使用正则表达式

（1）使用正则表达式去除html中的标签

业务场景：
如一篇使用富文本编辑器编辑的新闻稿，需要在列表页面截取前200字作为摘要，此时需要去除html标签，截取真正的文本部分。
代码实现：

/**
 * 删除Html标签
 */
public static String removeHtmlTag(String htmlStr) {
    //定义script的正则表达式{或<script[^>]*?>[\\s\\S]*?<\\/script>
    String regEx_script = "<[\\s]*?script[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?script[\\s]*?>";
    //定义style的正则表达式{或<style[^>]*?>[\\s\\S]*?<\\/style>
    String regEx_style = "<[\\s]*?style[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?style[\\s]*?>";
    //定义HTML标签的正则表达式
    String regEx_html = "<[^>]+>";
    //定义一些特殊字符的正则表达式 如：&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    String regEx_special = "\\&[a-zA-Z]{1,10};";

    //1.过滤script标签
    Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
    Matcher m_script = p_script.matcher(htmlStr);
    htmlStr = m_script.replaceAll("");
    //2.过滤style标签
    Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
    Matcher m_style = p_style.matcher(htmlStr);
    htmlStr = m_style.replaceAll("");
    //3.过滤html标签
    Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
    Matcher m_html = p_html.matcher(htmlStr);
    htmlStr = m_html.replaceAll("");
    //4.过滤特殊标签
    Pattern p_special = Pattern.compile(regEx_special, Pattern.CASE_INSENSITIVE);
    Matcher m_special = p_special.matcher(htmlStr);
    htmlStr = m_special.replaceAll("");

    return htmlStr;
}

（2）使用正则表达式去除html中的元素属性

业务场景：
如某网站历史数据中有很多富文本编辑器编辑的新闻稿，里面定义了很多行内样式，现开发了新网站，统一定义了样式，进行数据迁移时需要去除这些行内样式，但保留标签。
代码实现：

private static final String regEx_tag = "<(\\w[^>|\\s]*)[\\s\\S]*?>";

public static String removeEleProp(String htmlStr) {
	Pattern p = Pattern.compile(regEx_tag, Pattern.CASE_INSENSITIVE);
	Matcher m = p.matcher(htmlStr);
	StringBuffer sb = new StringBuffer();
	while (m.find()) {
		String tagWithProp= m.group(0);
		String tag =m.group(1);
		if ("img".equals(tag)) {
			//img标签保留属性，可进一步处理删除无用属性，仅保留src等必要属性
			m.appendReplacement(sb, tagWithProp);
		}else if ("a".equals(tag)) {
			//a标签保留属性，可进一步处理删除无用属性，仅保留href等必要属性
			m.appendReplacement(sb, tagWithProp);
		}else{
			m.appendReplacement(sb, "<" + tag + ">");
		}
	}
	m.appendTail(sb);
	return sb.toString();
}