去掉word冗余格式 java正则表达式

   word转换html时,会留下很多格式,有些格式并不是我们所需要的,然而这些格式比真正的文章内容还要多,严重影响页面的加载速度,因此就需要找个一个好的解决方案把这些多余的格式个去掉。网上有很多去除word冗余格式的js版的正则表达式,这里只提供java版的正则表达式。
	public static String clearWordFormat(String content) {
		//把<P></P>转换成</div></div>保留样式
		//content = content.replaceAll("(<P)([^>]*>.*?)(<\\/P>)", "<div$2</div>");
		//把<P></P>转换成</div></div>并删除样式
		content = content.replaceAll("(<P)([^>]*)(>.*?)(<\\/P>)", "<p$3</p>");
		//删除不需要的标签
		content = content.replaceAll("<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>", "");
		//删除不需要的属性
		content = content.replaceAll("<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>", "<$1$2>");
		//删除<STYLE TYPE="text/css"></STYLE>及之间的内容
		int styleBegin = content.indexOf("<STYLE");
		int styleEnd = content.indexOf("</STYLE>") + 8;
		String style = content.substring(styleBegin, styleEnd);
		content = content.replace(style, "");
		return content;
	}

去除不需要的标签

<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>


  • match an open tag character <
  • and optionally match a close tag sequence </  (because we also want to remove the closing tags)
  • match any of the list of unwanted tags: font,span,xml,del,ins
  • a pattern is given to match any of the namespace tags, anything beginning with o,v,w,x,p, followed by a : followed by another word
  • match any attributes as far as the closing tag character >
  • the replace string for this regex is "", which will completely remove the instances of any matching tags.
  • note that we are not removing anything between the tags, just the tags themselves
去除不需要的属性

<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>


  • match an open tag character <
  • capture any text before the unwanted attribute (This is $1 in the replace expression)
  • match (but don't capture) any of the unwanted attributes: class, lang, style, size, face, o:p, v:shape etc.
  • there should always be an = character after the attribute name
  • match the value of the attribute by identifying the delimiters. these can be single quotes, or double quotes, or no quotes at all.
  • for single quotes, the pattern is: ' followed by anything but a ' followed by a '
  • similarly for double quotes. 
  • for a non-delimited attribute value, i specify the pattern as anything except the closing tag character >
  • lastly, capture whatever comes after the unwanted attribute in ([^>]*)
  • the replacement string <$1$2> reconstructs the tag without the unwanted attribute found in the middle.
  • note: this only removes one occurence of an unwanted attribute, this is why i run the same regex twice.  For example, take the html fragment: <p class="MSO Normal" style="Margin-TOP:3em"> 
    the regex will only remove one of these attributes.  Running the regex twice will remove the second one.  I can't think of any reasonable cases where it would need to be run more than that. 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值