Jtidy解析脚本时候出现StringIndexOutOfBoundsException异常问题

最新推荐文章于 2023-12-28 11:42:59 发布

jaylong35

最新推荐文章于 2023-12-28 11:42:59 发布

阅读量1.1k

点赞数

分类专栏： java htmlparser my study record

本文链接：https://blog.csdn.net/jaylong35/article/details/8533185

版权

my study record 同时被 3 个专栏收录

37 篇文章 0 订阅

订阅专栏

java

3 篇文章 0 订阅

订阅专栏

htmlparser

1 篇文章 0 订阅

订阅专栏

问题描述：

最近在做网页结构化信息抽取，用到了JTidy和xslt。当在处理一些包含很多脚本的页面时候，出现了，JTidy去脏失败，提示标题中的异常。

最后发现，问题出现在解析脚本的时候因为一些脚本里面不规范的内容，导致不能判断结束造成了上面的异常出现。

解决方法：

最初的时候想通过修改JTidy的源码来解决这个问题，但是后来做着发现可行性不高，一个是修改这个源码可能会带来其它的问题。另外一个，还要花长时间去看源码。

所以，最终还是选择了采用预处理的方式来进行处理删除掉脚本。

代码

	public static String getFilterBody(String strBody) {
		// htmlparser 解析
		
		
		Parser parser = Parser.createParser(strBody, "utf-8");
		NodeList list;
		String reValue = strBody;
		try {
			list = parser.parse(null);
			visitNodeList(list);
			reValue = list.toHtml();
		} catch (ParserException e1) {

		}
		return reValue;
	}

	// 递归过滤
	private static void visitNodeList(NodeList list) {
		for (int i = 0; i < list.size(); i++) {
			Node node = list.elementAt(i);

			if (node instanceof Tag) {
				if (node instanceof ScriptTag) {
					list.remove(i);
					continue;
				}// 这里可以增加删除的Tag
				if (node instanceof StyleTag) {
					list.remove(i);
					continue;
				}// 这里可以增加删除的Tag
			}
			NodeList children = node.getChildren();
			if (children != null && children.size() > 0)
				visitNodeList(children);
		}
	}

但是在删除脚本的时候一样遇到了相同的问题，就是在解析脚本的时候出现了错乱，把一些脚本中的标签识别为正常标签。

如：<script>里面的 '<span></span>'里面的‘</’就会被识别为脚本的结束，导致脚本获取不全，删除不全

最后在网上找到了解决的办法

通过下面两个参数的设置来解析了html对脚本的处理问题

		org.htmlparser.scanners.ScriptScanner.STRICT = false;
		org.htmlparser.lexer.Lexer.STRICT_REMARKS = false;

只要配置其中之一就可以了，下面是这两个参数的一个官方说明

org.htmlparser.scanners.ScriptScanner.STRICT = false;

/**
     * Strict parsing of CDATA flag.
     * If this flag is set true, the parsing of script is performed without
     * regard to quotes. This means that erroneous script such as:
     * <pre>
     * document.write("</script>");
     * </pre>
     * will be parsed in strict accordance with appendix
     * <a href="http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data" mce_href="http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data">
     * B.3.2 Specifying non-HTML data</a> of the
     * <a href="http://www.w3.org/TR/html4/" mce_href="http://www.w3.org/TR/html4/">HTML 4.01 Specification</a> and
     * hence will be split into two or more nodes. Correct javascript would
     * escape the ETAGO:
     * <pre>
     * document.write("<//script>");
     * </pre>
     * If true, CDATA parsing will stop at the first ETAGO ("</") no matter
     * whether it is quoted or not. If false, balanced quotes (either single or
     * double) will shield an ETAGO. Beacuse of the possibility of quotes within
     * single or multiline comments, these are also parsed. In most cases,
     * users prefer non-strict handling since there is so much broken script
     * out in the wild.
     */

org.htmlparser.lexer.Lexer.STRICT_REMARKS = false;

  /**
     * Process remarks strictly flag.
     * If <code>true</code>, remarks are not terminated by ---$gt;
     * or --!$gt;, i.e. more than two dashes. If <code>false</code>,
     * a more lax (and closer to typical browser handling) remark parsing
     * is used.
     * Default <code>true</code>.
     */

在默认情况下，htmlparser解析是按严格的html标准解析，所以当碰到不标准的标签有可能出错，

当把以上这两个参数改变以后，htmlparser解析不再严格，能应对所有可能出现的情况。