简介:
HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.
解析HTML库,两个目的 1 transformation(化为简单的html),2 extraction(抽取web资源);
使用库时,需添加 htmllexer.jar 或 htmlparser.jar,前者属于较低级别(轻量级)的parser,后者基于前者,有所提升;
- htmllexer使用情况:If your application requires only modest structural knowledge of the page, and is primarily concerned with individual, isolated nodes, you should consider using the lightweight lexer.
- htmlparser使用情况:But if your application requires knowledge of the nested structure of the page, for example processing tables, you will probably want to use the full parser.