word转换html时,会留下很多格式,有些格式并不是我们所需要的,然而这些格式比真正的文章内容还要多,严重影响页面的加载速度,因此就需要找个一个好的解决方案把这些多余的格式个去掉。网上有很多去除word冗余格式的js版的正则表达式,这里只提供java版的正则表达式。
public static String clearWordFormat(String content) {
//把<P></P>转换成</div></div>保留样式
//content = content.replaceAll("(<P)([^>]*>.*?)(<\\/P>)", "<div$2</div>");
//把<P></P>转换成</div></div>并删除样式
content = content.replaceAll("(<P)([^>]*)(>.*?)(<\\/P>)", "<p$3</p>");
//删除不需要的标签
content = content.replaceAll("<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>", "");
//删除不需要的属性
content = content.replaceAll("<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>", "<$1$2>");
//删除<STYLE TYPE="text/css"></STYLE>及之间的内容
int styleBegin = content.indexOf("<STYLE");
int styleEnd = content.indexOf("</STYLE>") + 8;
String style = content.substring(styleBegin, styleEnd);
content = content.replace(style, "");
return content;
}
去除不需要的标签
<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>
- match an open tag character <
- and optionally match a close tag sequence </ (because we also want to remove the closing tags)
- match any of the list of unwanted tags: font,span,xml,del,ins
- a pattern is given to match any of the namespace tags, anything beginning with o,v,w,x,p, followed by a : followed by another word
- match any attributes as far as the closing tag character >
- the replace string for this regex is "", which will completely remove the instances of any matching tags.
- note that we are not removing anything between the tags, just the tags themselves
去除不需要的属性
<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>
- match an open tag character <
- capture any text before the unwanted attribute (This is $1 in the replace expression)
- match (but don't capture) any of the unwanted attributes: class, lang, style, size, face, o:p, v:shape etc.
- there should always be an = character after the attribute name
- match the value of the attribute by identifying the delimiters. these can be single quotes, or double quotes, or no quotes at all.
- for single quotes, the pattern is: ' followed by anything but a ' followed by a '
- similarly for double quotes.
- for a non-delimited attribute value, i specify the pattern as anything except the closing tag character >
- lastly, capture whatever comes after the unwanted attribute in ([^>]*)
- the replacement string <$1$2> reconstructs the tag without the unwanted attribute found in the middle.
- note: this only removes one occurence of an unwanted attribute, this is why i run the same regex twice. For example, take the html fragment: <p class="MSO Normal" style="Margin-TOP:3em">
the regex will only remove one of these attributes. Running the regex twice will remove the second one. I can't think of any reasonable cases where it would need to be run more than that.