goose3主要用于新闻、文章的主要信息提取。
GOOSE将尝试提取以下信息:
文章主文
文章图片
文章中的YouTube / Vimeo视频
描述标记
标签
使用pip安装
pip install goose3
用法:
>>> from goose3 import Goose
>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'
>>> article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of A