正文提取 java_Java编写的从HTML提取正文内容实现

readabilityBUNDLE

Main Content Extraction from html written in Java. It will extract the article text with out the around clutters.

Recent days its really a challenging open issue to extract the main article content from html pages. There are many open source algorithms / implementations available. What i aim in this project is concise some of the best content extraction algorithm implemented in JAVA.

My focus is mainly on the tuning parameters and customization / modifications of these algorithmic features according to my requirements.

readabilityBUNDLE will perform equally what other algorithms does plus below listed extras.

Whats extra in readabilityBUNDLE

Preserve the html tags in the extracted content.

Keep all the possible images in the content instead of finding best image.

Keep all the available videos.

Better extraction of li,ul,ol tags

Content normalization of extracted content.

Incorporated 3 best popular extraction algorithm , you can choose based on your requirement.

Provision to append next pages extracted content and create a consolidated output

Many cleaner / formatter measures added.

Some core changes in algorithms.

The main challenge which i was facing to extract the main content by keeping all the images / videos / html tags / and some realated div tags which are used as content / non content identification by most of the algorithms.

readabilityBUNDLE borrows much code and concepts from Project Goose , Snacktory and Java-Readability. My intension was just fine tune / modify the algorithm to work with my requirements.

Some html pages works very well in a particular algorithm and some not. This is the main reason i put all the available algorithm under a roof . You can choose an algorithm which best suits you.

You can see all author citations in each java file itself.

Dependency Projects

Usage

You need to say which extraction algorithm to use. The 3 extraction algorithms are ReadabilitySnack,ReadabilityCore and ReadabilityGoose. By default its ReadabilitySnack.

With out next page finding

Sample Usage

Article article = new Article();

ContentExtractor ce = new ContentExtractor();

HtmlFetcher htmlFetcher = new HtmlFetcher();

String html = htmlFetcher.getHtml("http://blogmaverick.com/2012/11/19/what-i-really-think-about-facebook/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Counterparties+%28Counterparties%29", 0);

article = ce.extractContent(html, "ReadabilitySnack");

System.out.println("Content : "+article.getCleanedArticleText());

With next page html sources

If you need to extract and append content from next pages also then,

You can use [NextPageFinder] (https://github.com/srijiths/NextPageFinder) to find out all the next pages links.

Get the html of each next pages as a List of String using Network

Pass it to the content extractor like

article = ce.extractContent(firstPageHtml,extractionAlgorithm,nextPagesHtmlSources)

Build

Using Maven , mvn clean package

License

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值