Pythongoose:用于文章提取的Python库

Pythongoose:用于文章提取的Python库
Python-goose项目是用Python重写的Goose,Goose原来是用Java写的文章提取工具。Python-goose的目标是给定任意资讯文章或者任意文章类的网页,不仅提取出文章的主体,同时提取出所有元信息以及图片等信息,支持中文网页。
Python-goose可提取的信息包括:

文章主体内容
文章主要图片
文章中嵌入的任何Youtube/Vimeo视频
元描述
元标签
Python-goose许可为Apache 2.0。

https://github.com/grangier/python-goose
安装
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install

一个简单的例子

from goose import Goose
url = ‘http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2
g = Goose()
article = g.extract(url=url)
article.title

u’Occupy London loses eviction fight’

article.meta_description

“Occupy London protesters who have been camped outside the landmark St. Paul’s Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London’s Court of Appeal.”

article.cleaned_text[:150]

(CNN) – Occupy London protesters who have been camped outside the landmark St. Paul’s Cathedral for the past four months lost their court bid to avoi

article.top_image.src

http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

常见的一些变量
self.title = u”“

    # stores the lovely, pure text from the article,
    # stripped of html, formatting, etc...
    # just raw text with paragraphs separated by newlines.
    # This is probably what you want to use.
    self.cleaned_text = u""

    # meta description field in HTML source
    self.meta_description = u""

    # meta lang field in HTML source
    self.meta_lang = u""

    # meta favicon field in HTML source
    self.meta_favicon = u""

    # meta keywords field in the HTML source
    self.meta_keywords = u""

    # The canonical link of this article if found in the meta data
    self.canonical_link = u""

    # holds the domain of this article we're parsing
    self.domain = u""

    # holds the top Element we think
    # is a candidate for the main body of the article
    self.top_node = None

    # holds the top Image object that
    # we think represents this article
    self.top_image = None

    # holds a set of tags that may have
    # been in the artcle, these are not meta keywords
    self.tags = []

    # holds a dict of all opengrah data found
    self.opengraph = {}

    # holds twitter embeds
    self.tweets = []

    # holds a list of any movies
    # we found on the page like youtube, vimeo
    self.movies = []

    # holds links found in the main article
    self.links = []

    # hold author names
    self.authors = []

    # stores the final URL that we're going to try
    # and fetch content against, this would be expanded if any
    self.final_url = u""

    # stores the MD5 hash of the url
    # to use for various identification tasks
    self.link_hash = ""

    # stores the RAW HTML
    # straight from the network connection
    self.raw_html = u""

    # the lxml Document object
    self.doc = None

    # this is the original JSoup document that contains
    # a pure object from the original HTML without any cleaning
    # options done on it
    self.raw_doc = None

    # Sometimes useful to try and know when
    # the publish date of an article was
    self.publish_date = None

    # A property bucket for consumers of goose to store custom data extractions.
    self.additional_data = {}

转载自:http://www.lxway.com/4414895162.htm

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值