Pythongoose:用于文章提取的Python库
Python-goose项目是用Python重写的Goose,Goose原来是用Java写的文章提取工具。Python-goose的目标是给定任意资讯文章或者任意文章类的网页,不仅提取出文章的主体,同时提取出所有元信息以及图片等信息,支持中文网页。
Python-goose可提取的信息包括:
文章主体内容
文章主要图片
文章中嵌入的任何Youtube/Vimeo视频
元描述
元标签
Python-goose许可为Apache 2.0。
https://github.com/grangier/python-goose
安装
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install
一个简单的例子
from goose import Goose
url = ‘http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2’
g = Goose()
article = g.extract(url=url)
article.title
u’Occupy London loses eviction fight’
article.meta_description
“Occupy London protesters who have been camped outside the landmark St. Paul’s Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London’s Court of Appeal.”
article.cleaned_text[:150]
(CNN) – Occupy London protesters who have been camped outside the landmark St. Paul’s Cathedral for the past four months lost their court bid to avoi
article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg
常见的一些变量
self.title = u”“
# stores the lovely, pure text from the article,
# stripped of html, formatting, etc...
# just raw text with paragraphs separated by newlines.
# This is probably what you want to use.
self.cleaned_text = u""
# meta description field in HTML source
self.meta_description = u""
# meta lang field in HTML source
self.meta_lang = u""
# meta favicon field in HTML source
self.meta_favicon = u""
# meta keywords field in the HTML source
self.meta_keywords = u""
# The canonical link of this article if found in the meta data
self.canonical_link = u""
# holds the domain of this article we're parsing
self.domain = u""
# holds the top Element we think
# is a candidate for the main body of the article
self.top_node = None
# holds the top Image object that
# we think represents this article
self.top_image = None
# holds a set of tags that may have
# been in the artcle, these are not meta keywords
self.tags = []
# holds a dict of all opengrah data found
self.opengraph = {}
# holds twitter embeds
self.tweets = []
# holds a list of any movies
# we found on the page like youtube, vimeo
self.movies = []
# holds links found in the main article
self.links = []
# hold author names
self.authors = []
# stores the final URL that we're going to try
# and fetch content against, this would be expanded if any
self.final_url = u""
# stores the MD5 hash of the url
# to use for various identification tasks
self.link_hash = ""
# stores the RAW HTML
# straight from the network connection
self.raw_html = u""
# the lxml Document object
self.doc = None
# this is the original JSoup document that contains
# a pure object from the original HTML without any cleaning
# options done on it
self.raw_doc = None
# Sometimes useful to try and know when
# the publish date of an article was
self.publish_date = None
# A property bucket for consumers of goose to store custom data extractions.
self.additional_data = {}