Pythongoose：用于文章提取的Python库

最新推荐文章于 2024-03-17 09:32:35 发布

yong472727322

最新推荐文章于 2024-03-17 09:32:35 发布

阅读量1.2k

点赞数

分类专栏： python 文章标签： python

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Pythongoose：用于文章提取的Python库
Python-goose项目是用Python重写的Goose，Goose原来是用Java写的文章提取工具。Python-goose的目标是给定任意资讯文章或者任意文章类的网页，不仅提取出文章的主体，同时提取出所有元信息以及图片等信息，支持中文网页。
Python-goose可提取的信息包括：

文章主体内容
文章主要图片
文章中嵌入的任何Youtube/Vimeo视频
元描述
元标签
Python-goose许可为Apache 2.0。

https://github.com/grangier/python-goose
安装
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install

一个简单的例子

from goose import Goose
url = ‘http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2’
g = Goose()
article = g.extract(url=url)
article.title

u’Occupy London loses eviction fight’

article.meta_description

“Occupy London protesters who have been camped outside the landmark St. Paul’s Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London’s Court of Appeal.”

article.cleaned_text[:150]

(CNN) – Occupy London protesters who have been camped outside the landmark St. Paul’s Cathedral for the past four months lost their court bid to avoi

article.top_image.src

http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

常见的一些变量
self.title = u”“

    # stores the lovely, pure text from the article,
    # stripped of html, formatting, etc...
    # just raw text with paragraphs separated by newlines.
    # This is probably what you want to use.
    self.cleaned_text = u""

    # meta description field in HTML source
    self.meta_description = u""

    # meta lang field in HTML source
    self.meta_lang = u""

    # meta favicon field in HTML source
    self.meta_favicon = u""

    # meta keywords field in the HTML source
    self.meta_keywords = u""

    # The canonical link of this article if found in the meta data
    self.canonical_link = u""

    # holds the domain of this article we're parsing
    self.domain = u""

    # holds the top Element we think
    # is a candidate for the main body of the article
    self.top_node = None

    # holds the top Image object that
    # we think represents this article
    self.top_image = None

    # holds a set of tags that may have
    # been in the artcle, these are not meta keywords
    self.tags = []

    # holds a dict of all opengrah data found
    self.opengraph = {}

    # holds twitter embeds
    self.tweets = []

    # holds a list of any movies
    # we found on the page like youtube, vimeo
    self.movies = []

    # holds links found in the main article
    self.links = []

    # hold author names
    self.authors = []

    # stores the final URL that we're going to try
    # and fetch content against, this would be expanded if any
    self.final_url = u""

    # stores the MD5 hash of the url
    # to use for various identification tasks
    self.link_hash = ""

    # stores the RAW HTML
    # straight from the network connection
    self.raw_html = u""

    # the lxml Document object
    self.doc = None

    # this is the original JSoup document that contains
    # a pure object from the original HTML without any cleaning
    # options done on it
    self.raw_doc = None

    # Sometimes useful to try and know when
    # the publish date of an article was
    self.publish_date = None

    # A property bucket for consumers of goose to store custom data extractions.
    self.additional_data = {}

转载自：http://www.lxway.com/4414895162.htm