java验证html登录验证,在Java中是否有验证的HTML解析器?

I need to parse HTML 4 in Java.

Ideally I'd like an implementation that is SAX compatible.

I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.

My requirements are:

No tidying.

If the input document is invalid HTML parsing should fail.

The document should be validatable against the HTML DTDs.

The parser can produce SAX2 events.

Is there a library that meets these requirements?

解决方案

I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.

Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:

So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:

Source source=new Source("I forgot to close my link!");

source.setLogger(myListeningLogger);

source.getSourceFormatter().writeTo(new NullWriter());

// myListeningLogger has now had all the HTML flaws written to it

In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).

Jericho is available on Maven Central, which is always a good sign:

Good luck!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值