calibre recipes的API中文文档

最新推荐文章于 2024-06-18 09:48:10 发布

Mr0cheng

最新推荐文章于 2024-06-18 09:48:10 发布

阅读量2.5k

点赞数

文章标签： calibre RSS

本文链接：https://blog.csdn.net/Mr0cheng/article/details/73927556

版权

本文档介绍了calibre新闻源配方的API，包括基本新闻配方类、方法、类变量等。类`calibre.web.feeds.news.BasicNewsRecipe`提供下载和预处理RSS订阅内容的功能，支持自定义和扩展。主要方法包括`download()`用于下载和预处理所有文章，`extract_readable_article()`用于提取文章正文，以及`get_article_url()`用于获取文章内容URL。类变量如`articles_are_obfuscated`控制是否处理难以抓取的文章，`auto_cleanup`决定是否自动清理下载的HTML内容。推荐阅读包括关于calibre recipes API的英文文档、源码和Arc90的readability算法。

摘要由CSDN通过智能技术生成

class calibre.web.feeds.news.BasicNewsRecipe(options, log, progress_reporter)

这个基类包含逻辑所需的所有功能。通过逐步覆盖更多的功能在这个类中,你可以逐渐更多的定制/强大的recipes。

方法

abort_article(msg=None)

调用这个方法里面的任何预处理方法中止当前文章的下载。可以跳过包含不合适的内容的文章，如纯视频文章。

abort_recipe_processing(msg)

recipes下载系统中止这个recipe的下载，给用户一个简单的反馈消息。

add_toc_thumbnail(article, src)

从populate_article_metadata调用这个方法，就是从当前的article中的≶img>中src属性的链接图片的缩略图作为目录。目前kindle有显示这个的功能。

adeify_images(soup)

这个方法为了兼容Adobe Digital Editions对EPUB格式中的图像的支持, postprocess_html()调用这个方法.

canonicalize_internal_url(url, is_link=True)

返回一组规范表示的url。默认实现使用的服务器的主机名和URL的路径,忽略所有query parameters,fragments等。可以看urlparse.urlparse()函数。

is_link
True: URL是html文件里面带的
False: 下载文章的url链接

cleanup()

当所有的工作做完之后，对一些信息的清除，比如清楚登录信息。

clone_browser(br)

用来支持多线程用的

Clone the browser br. Cloned browsers are used for multi-threaded downloads, since mechanize is not thread safe. The default cloning routines should capture most browser customization, but if you do something exotic in your recipe, you should override this method in your recipe and clone manually.

Cloned browser instances use the same, thread-safe CookieJar by default, unless you have customized cookie handling.

default_cover(cover_file)

为没有封面的recipe提供一个默认的cover。

download()

下载和预处理recipe feed中的所有文章。在一个特定的recipe中，这个方法应该只调用一次。否则将导致未定义的行为。返回:index.html的地址。

extract_readable_article(html, url)

提取html的正文内容，返回一个二元组(article_html, extracted_title). 基于Arc90写的readability算法。详见推荐阅读。

get_article_url(article)

Override in a subclass to customize extraction of the URL that points to the content for each article. Return the article URL. It is called with article, an object representing a parsed article from a feed. See feedparser. By default it looks for the original link (for feeds syndicated via a service like feedburner or pheedo) and if found, returns that or else returns article.link.

get_browser(*args, **kwargs)

返回一个用于获取文档的web浏览器实例。默认情况下它返回浏览器实例，该实例支持cookies，忽略robots.txt文件，处理刷新唾mozilla firefox用户代理。

如果你的recipe需要先登录，那么重写子类的这个方法。例如,下面的代码是用于纽约时报recipe，实现了full access。

def get_browser(self):
                br = BasicNewsRecipe.get_browser(self)
                if self.username is not None and self.password is not None:
                    br.open('https://www.nytimes.com/auth/login')
                    br.select_form(name='login')
                    br['USERID']   = self.username
                    br['PASSWORD'] = self.password
                    br.submit()
                return br

get_cover_url()

返回一个封面图片的URL或者返回None。默认情况下它返回成员变量cover_url，但是cover_url通常为None。如果你想让你的recipe下载电子书的封面，可以重写此方法,或设置cover_url成员变量。但要在cover_url调用之前设置变量的值。

get_extra_css()

默认返回self.extra_css。如果你想以生成自己的extra_css，那么重写这个方法。

get_feeds()

Return a list of RSS feeds to fetch for this profile. Each element of the list must be a 2-element tuple of the form (title, url). If title is None or an empty string, the title from the feed is used. This method is useful if your recipe needs to do some processing to figure out the list of feeds to download. If so, override in your subclass.

get_masthead_title()

Override in subclass to use something other than the recipe title

get_masthead_url()

Return a URL to the masthead image for this issue or None. By default it returns the value of the member self.masthead_url which is normally None. If you want your recipe to download a masthead for the e-book override this method in your subclass, or set the member variable self.masthead_url before this method is called. Masthead images are used in Kindle MOBI files.

get_obfuscated_article(url)

If you set articles_are_obfuscated this method is called with every article URL. It should return the path to a file on the filesystem that contains the article HTML. That file is processed by the recursive HTML fetching engine, so it can contain links to