可以拿网上开源项目修改,如:Boilerpipe、html2article等,也可以自己写。
要求:
1、网站通用正文提取,90%准确率以上(不是单个网站)
2、输入html输出文章段落
参考资料:
cx-extractor,地址:https://code.google.com/archive/p/cx-extractor/
Boilerpipe,地址:http://code.google.com/p/boilerpipe/
Html2Article,地址:
http://www.cnblogs.com/jasondan/p/3497757.html
https://github.com/stanzhai/Html2Article
python:https://github.com/zhuyf8899/Html2Article
python goose,地址:https://github.com/grangier/python-goose
Readability,Python版本:https://github.com/timbertson/python-readability
newspaper,地址:https://github.com/codelucas/newspaper
arex,地址:https://github.com/ahkimkoo/arex