htmlparser修改html,HTML Parser

最新推荐文章于 2021-06-03 03:14:37 发布

吃瓜少年藤井水

最新推荐文章于 2021-06-03 03:14:37 发布

阅读量107

点赞数

文章标签： htmlparser修改html

HTML Parser

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion.

Primarily used for transformation or extraction, it features filters, visitors,

custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.

Welcome to the homepage of HTMLParser - a super-fast real-time

parser for real-world HTML. What has attracted most developers to HTMLParser has

been its simplicity in design, speed and ability to handle streaming real-world

html.

The two fundamental use-cases that are handled by the parser are

extraction and transformation

(the syntheses use-case, where HTML pages are created from scratch, is better

handled by other tools closer to the source of data). While prior versions

concentrated on data extraction from web pages, Version 1.4 of the

HTMLParser has substantial improvements in the area of transforming web

pages, with simplified tag creation and editing, and verbatim toHtml() method

output.

In general, to use the HTMLParser you will need to be able to write code in

the Java programming language. Although some example programs are provided

that may be useful as they stand, it's more than likely you will need (or

want) to create your own programs or modify the ones provided to match your

intended application.

To use the library, you will need to add either the htmllexer.jar or

htmlparser.jar to your classpath when compiling and running. The

htmllexer.jar provides low level access to generic string, remark and tag nodes on

the page in a linear, flat, sequential manner. The htmlparser.jar, which

includes the classes found in htmllexer.jar, provides access to a page as a

sequence of nested differentiated tags containing string, remark and other

tag nodes. So where the output from calls to the lexer

nextNode()

method might be:

"Welcome"

etc...

The output from the parser NodeIterator would

nest the tags as children of the ,

and other nodes

(here represented by indentation):

"Welcome"

etc...

The parser attempts to balance opening tags with ending tags to present the

structure of the page, while the lexer simply spits out nodes. If your

application requires only modest structural knowledge of the page, and is

primarily concerned with individual, isolated nodes, you should consider

using the lightweight lexer. But if your application requires knowledge of

the nested structure of the page, for example processing tables, you will

probably want to use the full parser.

吃瓜少年藤井水

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
htmlparser修改html,HTML Parser

HTML ParserHTML Parser is a Java library used to parse HTML in either a linear or nested fashion.Primarily used for transformation or extraction, it features filters, visitors,custom tags and easy to ...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。