去掉word冗余格式 java正则表达式

最新推荐文章于 2024-10-31 10:27:34 发布

iteye_3411

最新推荐文章于 2024-10-31 10:27:34 发布

阅读量183

点赞数

文章标签：正则表达式 Java XML CSS HTML

本文链接：https://blog.csdn.net/iteye_3411/article/details/81996076

版权

word转换html时，会留下很多格式，有些格式并不是我们所需要的，然而这些格式比真正的文章内容还要多，严重影响页面的加载速度，因此就需要找个一个好的解决方案把这些多余的格式个去掉。网上有很多去除word冗余格式的js版的正则表达式，这里只提供java版的正则表达式。

	public static String clearWordFormat(String content) {
		//把<P></P>转换成</div></div>保留样式
		//content = content.replaceAll("(<P)([^>]*>.*?)(<\\/P>)", "<div$2</div>");
		//把<P></P>转换成</div></div>并删除样式
		content = content.replaceAll("(<P)([^>]*)(>.*?)(<\\/P>)", "<p$3</p>");
		//删除不需要的标签
		content = content.replaceAll("<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>", "");
		//删除不需要的属性
		content = content.replaceAll("<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>", "<$1$2>");
		//删除<STYLE TYPE="text/css"></STYLE>及之间的内容
		int styleBegin = content.indexOf("<STYLE");
		int styleEnd = content.indexOf("</STYLE>") + 8;
		String style = content.substring(styleBegin, styleEnd);
		content = content.replace(style, "");
		return content;
	}

去除不需要的标签

<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>

match an open tag character <
and optionally match a close tag sequence </ (because we also want to remove the closing tags)
match any of the list of unwanted tags: font,span,xml,del,ins
a pattern is given to match any of the namespace tags, anything beginning with o,v,w,x,p, followed by a : followed by another word
match any attributes as far as the closing tag character >
the replace string for this regex is "", which will completely remove the instances of any matching tags.
note that we are not removing anything between the tags, just the tags themselves

去除不需要的属性

<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>

match an open tag character <
capture any text before the unwanted attribute (This is $1 in the replace expression)
match (but don't capture) any of the unwanted attributes: class, lang, style, size, face, o:p, v:shape etc.
there should always be an = character after the attribute name
match the value of the attribute by identifying the delimiters. these can be single quotes, or double quotes, or no quotes at all.
for single quotes, the pattern is: ' followed by anything but a ' followed by a '
similarly for double quotes.
for a non-delimited attribute value, i specify the pattern as anything except the closing tag character >
lastly, capture whatever comes after the unwanted attribute in ([^>]*)
the replacement string <$1$2> reconstructs the tag without the unwanted attribute found in the middle.
note: this only removes one occurence of an unwanted attribute, this is why i run the same regex twice. For example, take the html fragment: <p class="MSO Normal" style="Margin-TOP:3em">
the regex will only remove one of these attributes. Running the regex twice will remove the second one. I can't think of any reasonable cases where it would need to be run more than that.