1、正则表达式:
//匹配style标签,和标签内的内容
Pattern regExStyle = Pattern.compile("<style[^>]*?>[\\s\\S]*?<\\/style>");
//匹配script标签,和标签内的内容
Pattern regExScript = Pattern.compile("<script[^>]*?>[\\s\\S]*?<\\/script>");
//匹配html标签
Pattern regExHtml = Pattern.compile("<[^>]+>");
//匹配 等这些特殊字符
Pattern regExSpecial = Pattern.compile("&[a-z]+;");
2、代码实现
//匹配style标签,和标签内的内容
Pattern regExStyle = Pattern.compile("<style[^>]*?>[\\s\\S]*?<\\/style>");
//匹配script标签,和标签内的内容
Pattern regExScript = Pattern.compile("<script[^>]*?>[\\s\\S]*?<\\/script>");
//匹配html标签
Pattern regExHtml = Pattern.compile("<[^>]+>");
//匹配 等这些特殊字符
Pattern regExSpecial = Pattern.compile("&[a-z]+;");
String text="<p><!-- [if !mso]>\n" +
"<style>\n" +
"v\\:* {behavior:url(#default#VML);}\n" +
"o\\:* {behavior:url(#default#VML);}\n" +
"w\\:* {behavior:url(#default#VML);}\n" +
".shape {behavior:url(#default#VML);}\n" +
"</style>\n" +
"<p class=\"MsoToc1\" style=\"tab-stops: 42.0pt right 414.8pt;\"><span lang=\"EN-US\" style=\"color: #0060ff;\"><a style=\"color: #0060ff;\" href=\"#_Toc145942199\"><span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">一、</span></span><span style=\"line-height: 120%; font-weight: normal; text-decoration: none;\"><span style=\"mso-tab-count: 1;\"> </span></span><span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">国家政策</span></span></a></span></p>\n" +
"<p class=\"MsoToc3\" style=\"tab-stops: right 414.8pt;\"><span lang=\"EN-US\" style=\"color: #0060ff;\"><a style=\"color: #0060ff;\" href=\"#_Toc145942200\"><span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">七部门:发布汽车稳增长工作方案,力争</span></span>2023<span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">年汽车销量实现</span></span>900<span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">万辆</span></span></a></span></p>\n" +
"<p class=\"MsoToc3\" style=\"tab-stops: right 414.8pt;\"><span lang=\"EN-US\" style=\"color: #0060ff;\"><a style=\"color: #0060ff;\" href=\"#_Toc145942201\"><span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">工信部:从四个方面推动新能源汽车产业高质量发展</span></span></a></span></p>\n" +
"<p class=\"MsoToc3\" style=\"tab-stops: right 414.8pt;\"><span lang=\"EN-US\" style=\"color: #0060ff;\"><a style=\"color: #0060ff;\" href=\"#_Toc145942202\"><span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">六部门:集体释放汽车产业政策新</span></span><span style=\"font-family: 仿宋;\">“</span><span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">信号</span></span><span style=\"font-family: 仿宋;\">”</span><span lang=\"EN-US\" style=\"font-family: 仿宋;\"><span lang=\"EN-US\">,多措并举推动汽车产业政策加快落地</span></span></a></span></p>\n";
String script = regExScript.matcher(text).replaceAll("");
String style = regExStyle.matcher(script).replaceAll("");
String html = regExHtml.matcher(style).replaceAll("");
String res = regExSpecial.matcher(html).replaceAll("");
//res就是想要的结果
log.info("res");
3、匹配替换结果:
4、注意:html匹配替换必须在style和script之后。