网页正文提取算法:行块分布算法 & Readability

前提

爬取百度、搜狗、必应等搜索引擎时,详情页的正文因来源多样而无法简单通过通用的规则来匹配,这就需要相关的提取算法。

在这里插入图片描述

本文在此介绍两种网页正文提取算法:行块分布算法 & Readability。

行块分布算法

算法流程

在这里插入图片描述

算法依据

  • HTML 每一行都表示一个完整的语义;
  • 正文代码在物理位置上会靠的很近;
  • 正文代码的一行中大都是文字;
  • 正文代码的一行中非 HTML 标签的文字数量较多;
  • 正文代码的一行中超链接长度所占比率不会很大。

算法特点

  • 与 HTML 是否良构无关;
  • 不用建立 Dom 树,与 HTML 标签无关;
  • 只需求出行块的分布函数即可抽取出正文;
  • 只需对脱过标签的文本扫描一次,处理效率高;
  • 去链接群、广告信息容易;
  • 扩展性好,通用抽取采用统计方法,个别网站辅以规则,做到统计与规则相结合。

示例代码

# -*- encoding: utf-8 -*-
import re
import requests
from collections import Counter
from bs4 import BeautifulSoup


def get_html(url):
    """ 获取html """
    # obj = requests.get(url)
    # return obj.text

    try:
        obj = requests.get(url)
        code = obj.status_code
        if code == 200:
            # 防止中文正文乱码
            html = obj.content
            html_doc = str(html, 'utf-8')
            return html_doc
        return None
    except:
        return None


def filter_tags(html_str, flag):
    """ 过滤各类标签
    :param html_str: html字符串
    :param flag: 是否移除所有标签
    :return: 过滤标签后的html字符串
    """
    html_str = re.sub('(?is)<!DOCTYPE.*?>', '', html_str)
    html_str = re.sub('(?is)<!--.*?-->', '', html_str)  # remove html comment
    html_str = re.sub('(?is)<script.*?>.*?</script>', '', html_str)  # remove javascript
    html_str = re.sub('(?is)<style.*?>.*?</style>', '', html_str)  # remove css
    html_str = re.sub('(?is)<a[\t|\n|\r|\f].*?>.*?</a>', '', html_str)  # remove a
    html_str = re.sub('(?is)<li[^nk].*?>.*?</li>', '', html_str)  # remove li
    # html_str = re.sub('&.{2,5};|&#.{2,5};', '', html_str) #remove special char
    if flag:
        html_str = re.sub('(?is)<.*?>', '', html_str)  # remove tag
    return html_str


def extract_text_by_block(html_str):
    """ 根据文本块密度获取正文
    :param html_str: 网页源代码
    :return: 正文文本
    """
    html = filter_tags(html_str, True)
    lines = html.split('\n')
    blockwidth = 3
    threshold = 86
    indexDistribution = []
    for i in range(0, len(lines) - blockwidth):
        wordnum = 0
        for j in range(i, i + blockwidth):
            line = re.sub("\\s+", '', lines[j])
            wordnum += len(line)
        indexDistribution.append(wordnum)
    startindex = -1
    endindex = -1
    boolstart = False
    boolend = False
    arcticle_content = []
    for i in range(0, len(indexDistribution) - blockwidth):
        if (indexDistribution[i] > threshold and boolstart is False):
            if indexDistribution[i + 1] != 0 or indexDistribution[i + 2] != 0 or indexDistribution[i + 3] != 0:
                boolstart = True
                startindex = i
                continue
        if boolstart is True:
            if indexDistribution[i] == 0 or indexDistribution[i + 1] == 0:
                endindex = i
                boolend = True
        tmp = []
        if boolend is True:
            for index in range(startindex, endindex + 1):
                line = lines[index]
                if len(line.strip()) < 5:
                    continue
                tmp.append(line.strip() + '\n')
            tmp_str = ''.join(tmp)
            if u"Copyright" in tmp_str or u"版权所有" in tmp_str:
                continue
            arcticle_content.append(tmp_str)
            boolstart = False
            boolend = False
    return ''.join(arcticle_content)


def extract_text_by_tag(html_str, article):
    """ 全网页查找根据文本块密度获取的正文的位置,获取文本父级标签内的正文,目的是提高正文准确率
    :param html: 网页html
    :param article: 根据文本块密度获取的正文
    :return: 正文文本
    """
    lines = filter_tags(html_str, False)
    soup = BeautifulSoup(lines, 'lxml')
    p_list = soup.find_all('p')
    p_in_article = []
    for p in p_list:
        if p.text.strip() in article:
            p_in_article.append(p.parent)
    tuple = Counter(p_in_article).most_common(1)[0]
    article_soup = BeautifulSoup(str(tuple[0]), 'xml')
    return remove_space(article_soup.text)


def remove_space(text):
    """ 移除字符串中的空白字符 """
    text = re.sub("[\t\r\n\f]", '', text)
    return text


def extract(url):
    """ 抽取正文
    :param url: 网页链接
    :return:正文文本
    """
    html_str = get_html(url)
    if html_str == None:
        return None
    article_temp = extract_text_by_block(html_str)
    try:
        article = extract_text_by_tag(html_str, article_temp)
    except:
        article = article_temp
    return article


if __name__ == '__main__':
    url = 'http://www.eeo.com.cn/2020/0215/376405.shtml'
    text = extract(url)
    print(text)

Readability

Readability 介绍

Readability 是一个颇有特色的“稍后阅读”网络收藏夹服务,除了在你看到喜欢的文章时可以收藏下来之外,它最大的特点在于它能自动智能地剔除网页上一些不重要的元素并重新排版,仅为你呈现干净整洁的正文部分,使你的阅读体验更佳!它除了拥有主流浏览器的插件之外,还提供了 iOS/Android/Kindle 等移动版本的应用,可以同步到手机上随时随地高效舒适地阅读。

实现原理

Readability 通过遍历 Dom 对象,通过标签和常用文字的加减权,来重新整合出页面的内容。接下来我们就简单看看这个算法是如何实现的。首先,它定义了一系列正则:

regexps: {
        unlikelyCandidates:    /combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup|tweet|twitter/i,
        okMaybeItsACandidate:  /and|article|body|column|main|shadow/i,
        positive:              /article|body|content|entry|hentry|main|page|pagination|post|text|blog|story/i,
        negative:              /combx|comment|com-|contact|foot|footer|footnote|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|widget/i,
        extraneous:            /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single/i,
        divToPElements:        /<(a|blockquote|dl|div|img|ol|p|pre|table|ul)/i,
        replaceBrs:            /(<br[^>]*>[ \n\r\t]*){2,}/gi,
        replaceFonts:          /<(\/?)font[^>]*>/gi,
        trim:                  /^\s+|\s+$/g,
        normalize:             /\s{2,}/g,
        killBreaks:            /(<br\s*\/?>(\s|&nbsp;?)*){1,}/g,
        videos:                /http:\/\/(www\.)?(youtube|vimeo)\.com/i,
        skipFootnoteLink:      /^\s*(\[?[a-z0-9]{1,2}\]?|^|edit|citation needed)\s*$/i,
        nextLink:              /(next|weiter|continue|>([^\|]|$)|»([^\|]|$))/i, // Match: next, continue, >, >>, » but not >|, »| as those usually mean last.
        prevLink:              /(prev|earl|old|new|<|«)/i
},

可以看到,标签和文字都有加权或降权分组。整个内容分析是通过 grabArticle 函数来实现的。首先开始遍历节点:

for(var nodeIndex = 0; (node = allElements[nodeIndex]); nodeIndex+=1)

然后将不像内容的元素去掉:

if (stripUnlikelyCandidates) 
{
    var unlikelyMatchString = node.className + node.id;
    if (
        (
            unlikelyMatchString.search(readability.regexps.unlikelyCandidates) !== -1 &&
            unlikelyMatchString.search(readability.regexps.okMaybeItsACandidate) === -1 &&
            node.tagName !== "BODY"
        )
    )
    {
        dbg("Removing unlikely candidate - " + unlikelyMatchString);
        node.parentNode.removeChild(node);
        nodeIndex-=1;
        continue;
    }               
}

将 DIV 替换为 P 标签后,再对目标节点进行遍历,进行计分:

var candidates = [];
for (var pt=0; pt < nodesToScore.length; pt+=1) {
    var parentNode      = nodesToScore[pt].parentNode;
    var grandParentNode = parentNode ? parentNode.parentNode : null;
    var innerText       = readability.getInnerText(nodesToScore[pt]);

    if(!parentNode || typeof(parentNode.tagName) === 'undefined') {
        continue;
    }

    /* If this paragraph is less than 25 characters, don't even count it. */
    if(innerText.length < 25) {
        continue; }

    /* Initialize readability data for the parent. */
    if(typeof parentNode.readability === 'undefined') {
        readability.initializeNode(parentNode);
        candidates.push(parentNode);
    }

    /* Initialize readability data for the grandparent. */
    if(grandParentNode && typeof(grandParentNode.readability) === 'undefined' && typeof(grandParentNode.tagName) !== 'undefined') {
        readability.initializeNode(grandParentNode);
        candidates.push(grandParentNode);
    }

    var contentScore = 0;

    /* Add a point for the paragraph itself as a base. */
    contentScore+=1;

    /* Add points for any commas within this paragraph */
    contentScore += innerText.split(',').length;
    
    /* For every 100 characters in this paragraph, add another point. Up to 3 points. */
    contentScore += Math.min(Math.floor(innerText.length / 100), 3);
    
    /* Add the score to the parent. The grandparent gets half. */
    parentNode.readability.contentScore += contentScore;

    if(grandParentNode) {
        grandParentNode.readability.contentScore += contentScore/2;             
    }
}

最后根据分值,重新拼接内容:

var articleContent        = document.createElement("DIV");
if (isPaging) {
    articleContent.id     = "readability-content";
}
var siblingScoreThreshold = Math.max(10, topCandidate.readability.contentScore * 0.2);
var siblingNodes          = topCandidate.parentNode.childNodes;


for(var s=0, sl=siblingNodes.length; s < sl; s+=1) {
    var siblingNode = siblingNodes[s];
    var append      = false;

    /**
     * Fix for odd IE7 Crash where siblingNode does not exist even though this should be a live nodeList.
     * Example of error visible here: http://www.esquire.com/features/honesty0707
    **/
    if(!siblingNode) {
        continue;
    }

    dbg("Looking at sibling node: " + siblingNode + " (" + siblingNode.className + ":" + siblingNode.id + ")" + ((typeof siblingNode.readability !== 'undefined') ? (" with score " + siblingNode.readability.contentScore) : ''));
    dbg("Sibling has score " + (siblingNode.readability ? siblingNode.readability.contentScore : 'Unknown'));

    if(siblingNode === topCandidate)
    {
        append = true;
    }

    var contentBonus = 0;
    /* Give a bonus if sibling nodes and top candidates have the example same classname */
    if(siblingNode.className === topCandidate.className && topCandidate.className !== "") {
        contentBonus += topCandidate.readability.contentScore * 0.2;
    }

    if(typeof siblingNode.readability !== 'undefined' && (siblingNode.readability.contentScore+contentBonus) >= siblingScoreThreshold)
    {
        append = true;
    }
    
    if(siblingNode.nodeName === "P") {
        var linkDensity = readability.getLinkDensity(siblingNode);
        var nodeContent = readability.getInnerText(siblingNode);
        var nodeLength  = nodeContent.length;
        
        if(nodeLength > 80 && linkDensity < 0.25)
        {
            append = true;
        }
        else if(nodeLength < 80 && linkDensity === 0 && nodeContent.search(/\.( |$)/) !== -1)
        {
            append = true;
        }
    }

    if(append) {
        dbg("Appending node: " + siblingNode);

        var nodeToAppend = null;
        if(siblingNode.nodeName !== "DIV" && siblingNode.nodeName !== "P") {
            /* We have a node that isn't a common block level element, like a form or td tag. Turn it into a div so it doesn't get filtered out later by accident. */
            
            dbg("Altering siblingNode of " + siblingNode.nodeName + ' to div.');
            nodeToAppend = document.createElement("DIV");
            try {
                nodeToAppend.id = siblingNode.id;
                nodeToAppend.innerHTML = siblingNode.innerHTML;
            }
            catch(er) {
                dbg("Could not alter siblingNode to div, probably an IE restriction, reverting back to original.");
                nodeToAppend = siblingNode;
                s-=1;
                sl-=1;
            }
        } else {
            nodeToAppend = siblingNode;
            s-=1;
            sl-=1;
        }
        
        /* To ensure a node does not interfere with readability styles, remove its classnames */
        nodeToAppend.className = "";

        /* Append sibling and subtract from our list because it removes the node when you append to another node */
        articleContent.appendChild(nodeToAppend);
    }
}

示例代码

import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Readability {

    private static final String CONTENT_SCORE = "readabilityContentScore";

    private final Document mDocument;
    private String mBodyCache;

    public Readability(String html) {
        super();
        mDocument = Jsoup.parse(html);
    }

    public Readability(String html, String baseUri) {
        super();
        mDocument = Jsoup.parse(html, baseUri);
    }

    public Readability(File in, String charsetName, String baseUri)
            throws IOException {
        super();
        mDocument = Jsoup.parse(in, charsetName, baseUri);
    }

    public Readability(URL url, int timeoutMillis) throws IOException {
        super();
        mDocument = Jsoup.parse(url, timeoutMillis);
    }

    public Readability(Document doc) {
        super();
        mDocument = doc;
    }

    // @formatter:off
    /**
     * Runs readability.
     * 
     * Workflow: 
     * 1. Prep the document by removing script tags, css, etc. 
     * 2. Build readability's DOM tree. 
     * 3. Grab the article content from the current dom tree. 
     * 4. Replace the current DOM tree with the new one. 
     * 5. Read peacefully.
     * 
     * @param preserveUnlikelyCandidates
     */
    // @formatter:on
    private void init(boolean preserveUnlikelyCandidates) {
        if (mDocument.body() != null && mBodyCache == null) {
            mBodyCache = mDocument.body().html();
        }

        prepDocument();

        /* Build readability's DOM tree */
        Element overlay = mDocument.createElement("div");
        Element innerDiv = mDocument.createElement("div");
        Element articleTitle = getArticleTitle();
        Element articleContent = grabArticle(preserveUnlikelyCandidates);

        /**
         * If we attempted to strip unlikely candidates on the first run
         * through, and we ended up with no content, that may mean we stripped
         * out the actual content so we couldn't parse it. So re-run init while
         * preserving unlikely candidates to have a better shot at getting our
         * content out properly.
         */
        if (isEmpty(getInnerText(articleContent, false))) {
            if (!preserveUnlikelyCandidates) {
                mDocument.body().html(mBodyCache);
                init(true);
                return;
            } else {
                articleContent
                        .html("<p>Sorry, readability was unable to parse this page for content.</p>");
            }
        }

        /* Glue the structure of our document together. */
        innerDiv.appendChild(articleTitle);
        innerDiv.appendChild(articleContent);
        overlay.appendChild(innerDiv);

        /* Clear the old HTML, insert the new content. */
        mDocument.body().html("");
        mDocument.body().prependChild(overlay);
    }

    /**
     * Runs readability.
     */
    public final void init() {
        init(false);
    }

    /**
     * Get the combined inner HTML of all matched elements.
     * 
     * @return
     */
    public final String html() {
        return mDocument.html();
    }

    /**
     * Get the combined outer HTML of all matched elements.
     * 
     * @return
     */
    public final String outerHtml() {
        return mDocument.outerHtml();
    }

    /**
     * Get the article title as an H1. Currently just uses document.title, we
     * might want to be smarter in the future.
     * 
     * @return
     */
    protected Element getArticleTitle() {
        Element articleTitle = mDocument.createElement("h1");
        articleTitle.html(mDocument.title());
        return articleTitle;
    }

    /**
     * Prepare the HTML document for readability to scrape it. This includes
     * things like stripping javascript, CSS, and handling terrible markup.
     */
    protected void prepDocument() {
        /**
         * In some cases a body element can't be found (if the HTML is totally
         * hosed for example) so we create a new body node and append it to the
         * document.
         */
        if (mDocument.body() == null) {
            mDocument.appendElement("body");
        }

        /* Remove all scripts */
        Elements elementsToRemove = mDocument.getElementsByTag("script");
        for (Element script : elementsToRemove) {
            script.remove();
        }

        /* Remove all stylesheets */
        elementsToRemove = getElementsByTag(mDocument.head(), "link");
        for (Element styleSheet : elementsToRemove) {
            if ("stylesheet".equalsIgnoreCase(styleSheet.attr("rel"))) {
                styleSheet.remove();
            }
        }

        /* Remove all style tags in head */
        elementsToRemove = mDocument.getElementsByTag("style");
        for (Element styleTag : elementsToRemove) {
            styleTag.remove();
        }

        /* Turn all double br's into p's */
        /*
         * TODO: this is pretty costly as far as processing goes. Maybe optimize
         * later.
         */
        mDocument.body().html(
                mDocument.body().html()
                        .replaceAll(Patterns.REGEX_REPLACE_BRS, "</p><p>")
                        .replaceAll(Patterns.REGEX_REPLACE_FONTS, "<$1span>"));
    }

    /**
     * Prepare the article node for display. Clean out any inline styles,
     * iframes, forms, strip extraneous &lt;p&gt; tags, etc.
     * 
     * @param articleContent
     */
    private void prepArticle(Element articleContent) {
        cleanStyles(articleContent);
        killBreaks(articleContent);

        /* Clean out junk from the article content */
        clean(articleContent, "form");
        clean(articleContent, "object");
        clean(articleContent, "h1");
        /**
         * If there is only one h2, they are probably using it as a header and
         * not a subheader, so remove it since we already have a header.
         */
        if (getElementsByTag(articleContent, "h2").size() == 1) {
            clean(articleContent, "h2");
        }
        clean(articleContent, "iframe");

        cleanHeaders(articleContent);

        /*
         * Do these last as the previous stuff may have removed junk that will
         * affect these
         */
        cleanConditionally(articleContent, "table");
        cleanConditionally(articleContent, "ul");
        cleanConditionally(articleContent, "div");

        /* Remove extra paragraphs */
        Elements articleParagraphs = getElementsByTag(articleContent, "p");
        for (Element articleParagraph : articleParagraphs) {
            int imgCount = getElementsByTag(articleParagraph, "img").size();
            int embedCount = getElementsByTag(articleParagraph, "embed").size();
            int objectCount = getElementsByTag(articleParagraph, "object")
                    .size();

            if (imgCount == 0 && embedCount == 0 && objectCount == 0
                    && isEmpty(getInnerText(articleParagraph, false))) {
                articleParagraph.remove();
            }
        }

        try {
            articleContent.html(articleContent.html().replaceAll(
                    "(?i)<br[^>]*>\\s*<p", "<p"));
        } catch (Exception e) {
            dbg("Cleaning innerHTML of breaks failed. This is an IE strict-block-elements bug. Ignoring.",
                    e);
        }
    }

    /**
     * Initialize a node with the readability object. Also checks the
     * className/id for special names to add to its score.
     * 
     * @param node
     */
    private static void initializeNode(Element node) {
        node.attr(CONTENT_SCORE, Integer.toString(0));

        String tagName = node.tagName();
        if ("div".equalsIgnoreCase(tagName)) {
            incrementContentScore(node, 5);
        } else if ("pre".equalsIgnoreCase(tagName)
                || "td".equalsIgnoreCase(tagName)
                || "blockquote".equalsIgnoreCase(tagName)) {
            incrementContentScore(node, 3);
        } else if ("address".equalsIgnoreCase(tagName)
                || "ol".equalsIgnoreCase(tagName)
                || "ul".equalsIgnoreCase(tagName)
                || "dl".equalsIgnoreCase(tagName)
                || "dd".equalsIgnoreCase(tagName)
                || "dt".equalsIgnoreCase(tagName)
                || "li".equalsIgnoreCase(tagName)
                || "form".equalsIgnoreCase(tagName)) {
            incrementContentScore(node, -3);
        } else if ("h1".equalsIgnoreCase(tagName)
                || "h2".equalsIgnoreCase(tagName)
                || "h3".equalsIgnoreCase(tagName)
                || "h4".equalsIgnoreCase(tagName)
                || "h5".equalsIgnoreCase(tagName)
                || "h6".equalsIgnoreCase(tagName)
                || "th".equalsIgnoreCase(tagName)) {
            incrementContentScore(node, -5);
        }

        incrementContentScore(node, getClassWeight(node));
    }

    /**
     * Using a variety of metrics (content score, classname, element types),
     * find the content that ismost likely to be the stuff a user wants to read.
     * Then return it wrapped up in a div.
     * 
     * @param preserveUnlikelyCandidates
     * @return
     */
    protected Element grabArticle(boolean preserveUnlikelyCandidates) {
        /**
         * First, node prepping. Trash nodes that look cruddy (like ones with
         * the class name "comment", etc), and turn divs into P tags where they
         * have been used inappropriately (as in, where they contain no other
         * block level elements.)
         * 
         * Note: Assignment from index for performance. See
         * http://www.peachpit.com/articles/article.aspx?p=31567&seqNum=5 TODO:
         * Shouldn't this be a reverse traversal?
         **/
        for (Element node : mDocument.getAllElements()) {
            /* Remove unlikely candidates */
            if (!preserveUnlikelyCandidates) {
                String unlikelyMatchString = node.className() + node.id();
                Matcher unlikelyCandidatesMatcher = Patterns.get(
                        Patterns.RegEx.UNLIKELY_CANDIDATES).matcher(
                        unlikelyMatchString);
                Matcher maybeCandidateMatcher = Patterns.get(
                        Patterns.RegEx.OK_MAYBE_ITS_A_CANDIDATE).matcher(
                        unlikelyMatchString);
                if (unlikelyCandidatesMatcher.find()
                        && maybeCandidateMatcher.find()
                        && !"body".equalsIgnoreCase(node.tagName())) {
                    node.remove();
                    dbg("Removing unlikely candidate - " + unlikelyMatchString);
                    continue;
                }
            }

            /*
             * Turn all divs that don't have children block level elements into
             * p's
             */
            if ("div".equalsIgnoreCase(node.tagName())) {
                Matcher matcher = Patterns
                        .get(Patterns.RegEx.DIV_TO_P_ELEMENTS).matcher(
                                node.html());
                if (!matcher.find()) {
                    dbg("Alternating div to p: " + node);
                    try {
                        node.tagName("p");
                    } catch (Exception e) {
                        dbg("Could not alter div to p, probably an IE restriction, reverting back to div.",
                                e);
                    }
                }
            }
        }

        /**
         * Loop through all paragraphs, and assign a score to them based on how
         * content-y they look. Then add their score to their parent node.
         * 
         * A score is determined by things like number of commas, class names,
         * etc. Maybe eventually link density.
         **/
        Elements allParagraphs = mDocument.getElementsByTag("p");
        ArrayList<Element> candidates = new ArrayList<Element>();

        for (Element node : allParagraphs) {
            Element parentNode = node.parent();
            Element grandParentNode = parentNode.parent();
            String innerText = getInnerText(node, true);

            /*
             * If this paragraph is less than 25 characters, don't even count
             * it.
             */
            if (innerText.length() < 25) {
                continue;
            }

            /* Initialize readability data for the parent. */
            if (!parentNode.hasAttr("readabilityContentScore")) {
                initializeNode(parentNode);
                candidates.add(parentNode);
            }

            /* Initialize readability data for the grandparent. */
            if (!grandParentNode.hasAttr("readabilityContentScore")) {
                initializeNode(grandParentNode);
                candidates.add(grandParentNode);
            }

            int contentScore = 0;

            /* Add a point for the paragraph itself as a base. */
            contentScore++;

            /* Add points for any commas within this paragraph */
            contentScore += innerText.split(",").length;

            /*
             * For every 100 characters in this paragraph, add another point. Up
             * to 3 points.
             */
            contentScore += Math.min(Math.floor(innerText.length() / 100), 3);

            /* Add the score to the parent. The grandparent gets half. */
            incrementContentScore(parentNode, contentScore);
            incrementContentScore(grandParentNode, contentScore / 2);
        }

        /**
         * After we've calculated scores, loop through all of the possible
         * candidate nodes we found and find the one with the highest score.
         */
        Element topCandidate = null;
        for (Element candidate : candidates) {
            /**
             * Scale the final candidates score based on link density. Good
             * content should have a relatively small link density (5% or less)
             * and be mostly unaffected by this operation.
             */
            scaleContentScore(candidate, 1 - getLinkDensity(candidate));

            dbg("Candidate: (" + candidate.className() + ":" + candidate.id()
                    + ") with score " + getContentScore(candidate));

            if (topCandidate == null
                    || getContentScore(candidate) > getContentScore(topCandidate)) {
                topCandidate = candidate;
            }
        }

        /**
         * If we still have no top candidate, just use the body as a last
         * resort. We also have to copy the body node so it is something we can
         * modify.
         */
        if (topCandidate == null
                || "body".equalsIgnoreCase(topCandidate.tagName())) {
            topCandidate = mDocument.createElement("div");
            topCandidate.html(mDocument.body().html());
            mDocument.body().html("");
            mDocument.body().appendChild(topCandidate);
            initializeNode(topCandidate);
        }

        /**
         * Now that we have the top candidate, look through its siblings for
         * content that might also be related. Things like preambles, content
         * split by ads that we removed, etc.
         */
        Element articleContent = mDocument.createElement("div");
        articleContent.attr("id", "readability-content");
        int siblingScoreThreshold = Math.max(10,
                (int) (getContentScore(topCandidate) * 0.2f));
        Elements siblingNodes = topCandidate.parent().children();
        for (Element siblingNode : siblingNodes) {
            boolean append = false;

            dbg("Looking at sibling node: (" + siblingNode.className() + ":"
                    + siblingNode.id() + ")" + " with score "
                    + getContentScore(siblingNode));

            if (siblingNode == topCandidate) {
                append = true;
            }

            if (getContentScore(siblingNode) >= siblingScoreThreshold) {
                append = true;
            }

            if ("p".equalsIgnoreCase(siblingNode.tagName())) {
                float linkDensity = getLinkDensity(siblingNode);
                String nodeContent = getInnerText(siblingNode, true);
                int nodeLength = nodeContent.length();

                if (nodeLength > 80 && linkDensity < 0.25f) {
                    append = true;
                } else if (nodeLength < 80 && linkDensity == 0.0f
                        && nodeContent.matches(".*\\.( |$).*")) {
                    append = true;
                }
            }

            if (append) {
                dbg("Appending node: " + siblingNode);

                /*
                 * Append sibling and subtract from our list because it removes
                 * the node when you append to another node
                 */
                articleContent.appendChild(siblingNode);
                continue;
            }
        }

        /**
         * So we have all of the content that we need. Now we clean it up for
         * presentation.
         */
        prepArticle(articleContent);

        return articleContent;
    }

    /**
     * Get the inner text of a node - cross browser compatibly. This also strips
     * out any excess whitespace to be found.
     * 
     * @param e
     * @param normalizeSpaces
     * @return
     */
    private static String getInnerText(Element e, boolean normalizeSpaces) {
        String textContent = e.text().trim();

        if (normalizeSpaces) {
            textContent = textContent.replaceAll(Patterns.REGEX_NORMALIZE, "");
        }

        return textContent;
    }

    /**
     * Get the number of times a string s appears in the node e.
     * 
     * @param e
     * @param s
     * @return
     */
    private static int getCharCount(Element e, String s) {
        if (s == null || s.length() == 0) {
            s = ",";
        }
        return getInnerText(e, true).split(s).length;
    }

    /**
     * Remove the style attribute on every e and under.
     * 
     * @param e
     */
    private static void cleanStyles(Element e) {
        if (e == null) {
            return;
        }

        Element cur = e.children().first();

        // Remove any root styles, if we're able.
        if (!"readability-styled".equals(e.className())) {
            e.removeAttr("style");
        }

        // Go until there are no more child nodes
        while (cur != null) {
            // Remove style attributes
            if (!"readability-styled".equals(cur.className())) {
                cur.removeAttr("style");
            }
            cleanStyles(cur);
            cur = cur.nextElementSibling();
        }
    }

    /**
     * Get the density of links as a percentage of the content. This is the
     * amount of text that is inside a link divided by the total text in the
     * node.
     * 
     * @param e
     * @return
     */
    private static float getLinkDensity(Element e) {
        Elements links = getElementsByTag(e, "a");
        int textLength = getInnerText(e, true).length();
        float linkLength = 0.0F;
        for (Element link : links) {
            linkLength += getInnerText(link, true).length();
        }
        return linkLength / textLength;
    }

    /**
     * Get an elements class/id weight. Uses regular expressions to tell if this
     * element looks good or bad.
     * 
     * @param e
     * @return
     */
    private static int getClassWeight(Element e) {
        int weight = 0;

        /* Look for a special classname */
        String className = e.className();
        if (!isEmpty(className)) {
            Matcher negativeMatcher = Patterns.get(Patterns.RegEx.NEGATIVE)
                    .matcher(className);
            Matcher positiveMatcher = Patterns.get(Patterns.RegEx.POSITIVE)
                    .matcher(className);
            if (negativeMatcher.find()) {
                weight -= 25;
            }
            if (positiveMatcher.find()) {
                weight += 25;
            }
        }

        /* Look for a special ID */
        String id = e.id();
        if (!isEmpty(id)) {
            Matcher negativeMatcher = Patterns.get(Patterns.RegEx.NEGATIVE)
                    .matcher(id);
            Matcher positiveMatcher = Patterns.get(Patterns.RegEx.POSITIVE)
                    .matcher(id);
            if (negativeMatcher.find()) {
                weight -= 25;
            }
            if (positiveMatcher.find()) {
                weight += 25;
            }
        }

        return weight;
    }

    /**
     * Remove extraneous break tags from a node.
     * 
     * @param e
     */
    private static void killBreaks(Element e) {
        e.html(e.html().replaceAll(Patterns.REGEX_KILL_BREAKS, "<br />"));
    }

    /**
     * Clean a node of all elements of type "tag". (Unless it's a youtube/vimeo
     * video. People love movies.)
     * 
     * @param e
     * @param tag
     */
    private static void clean(Element e, String tag) {
        Elements targetList = getElementsByTag(e, tag);
        boolean isEmbed = "object".equalsIgnoreCase(tag)
                       || "embed".equalsIgnoreCase(tag)
                       || "iframe".equalsIgnoreCase(tag);

        for (Element target : targetList) {
            Matcher matcher = Patterns.get(Patterns.RegEx.VIDEO).matcher(
                    target.outerHtml());
            if (isEmbed && matcher.find()) {
                continue;
            }
            target.remove();
        }
    }

    /**
     * Clean an element of all tags of type "tag" if they look fishy. "Fishy" is
     * an algorithm based on content length, classnames, link density, number of
     * images & embeds, etc.
     * 
     * @param e
     * @param tag
     */
    private void cleanConditionally(Element e, String tag) {
        Elements tagsList = getElementsByTag(e, tag);

        /**
         * Gather counts for other typical elements embedded within. Traverse
         * backwards so we can remove nodes at the same time without effecting
         * the traversal.
         * 
         * TODO: Consider taking into account original contentScore here.
         */
        for (Element node : tagsList) {
            int weight = getClassWeight(node);

            dbg("Cleaning Conditionally (" + node.className() + ":" + node.id()
                    + ")" + getContentScore(node));

            if (weight < 0) {
                node.remove();
            } else if (getCharCount(node, ",") < 10) {
                /**
                 * If there are not very many commas, and the number of
                 * non-paragraph elements is more than paragraphs or other
                 * ominous signs, remove the element.
                 */
                int p = getElementsByTag(node, "p").size();
                int img = getElementsByTag(node, "img").size();
                int li = getElementsByTag(node, "li").size() - 100;
                int input = getElementsByTag(node, "input").size();

                int embedCount = 0;
                Elements embeds = getElementsByTag(node, "embed");
                for (Element embed : embeds) {
                    if (!Patterns.get(Patterns.RegEx.VIDEO)
                            .matcher(embed.absUrl("src")).find()) {
                        embedCount++;
                    }
                }

                float linkDensity = getLinkDensity(node);
                int contentLength = getInnerText(node, true).length();
                boolean toRemove = false;

                if (img > p) {
                    toRemove = true;
                } else if (li > p && !"ul".equalsIgnoreCase(tag)
                        && !"ol".equalsIgnoreCase(tag)) {
                    toRemove = true;
                } else if (input > Math.floor(p / 3)) {
                    toRemove = true;
                } else if (contentLength < 25 && (img == 0 || img > 2)) {
                    toRemove = true;
                } else if (weight < 25 && linkDensity > 0.2f) {
                    toRemove = true;
                } else if (weight > 25 && linkDensity > 0.5f) {
                    toRemove = true;
                } else if ((embedCount == 1 && contentLength < 75)
                        || embedCount > 1) {
                    toRemove = true;
                }

                if (toRemove) {
                    node.remove();
                }
            }
        }
    }

    /**
     * Clean out spurious headers from an Element. Checks things like classnames
     * and link density.
     * 
     * @param e
     */
    private static void cleanHeaders(Element e) {
        for (int headerIndex = 1; headerIndex < 7; headerIndex++) {
            Elements headers = getElementsByTag(e, "h" + headerIndex);
            for (Element header : headers) {
                if (getClassWeight(header) < 0
                        || getLinkDensity(header) > 0.33f) {
                    header.remove();
                }
            }
        }
    }

    /**
     * Print debug logs
     * 
     * @param msg
     */
    protected void dbg(String msg) {
        dbg(msg, null);
    }

    /**
     * Print debug logs with stack trace
     * 
     * @param msg
     * @param t
     */
    protected void dbg(String msg, Throwable t) {
        System.out.println(msg + (t != null ? ("\n" + t.getMessage()) : "")
                + (t != null ? ("\n" + t.getStackTrace()) : ""));
    }

    private static class Patterns {
        private static Pattern sUnlikelyCandidatesRe;
        private static Pattern sOkMaybeItsACandidateRe;
        private static Pattern sPositiveRe;
        private static Pattern sNegativeRe;
        private static Pattern sDivToPElementsRe;
        private static Pattern sVideoRe;
        private static final String REGEX_REPLACE_BRS = "(?i)(<br[^>]*>[ \n\r\t]*){2,}";
        private static final String REGEX_REPLACE_FONTS = "(?i)<(\\/?)font[^>]*>";
        /* Java has String.trim() */
        // private static final String REGEX_TRIM = "^\\s+|\\s+$";
        private static final String REGEX_NORMALIZE = "\\s{2,}";
        private static final String REGEX_KILL_BREAKS = "(<br\\s*\\/?>(\\s|&nbsp;?)*){1,}";

        public enum RegEx {
            UNLIKELY_CANDIDATES, OK_MAYBE_ITS_A_CANDIDATE, POSITIVE, NEGATIVE, DIV_TO_P_ELEMENTS, VIDEO;
        }

        public static Pattern get(RegEx re) {
            switch (re) {
            case UNLIKELY_CANDIDATES: {
                if (sUnlikelyCandidatesRe == null) {
                    sUnlikelyCandidatesRe = Pattern
                            .compile(
                                    "combx|comment|disqus|foot|header|menu|meta|nav|rss|shoutbox|sidebar|sponsor",
                                    Pattern.CASE_INSENSITIVE);
                }
                return sUnlikelyCandidatesRe;
            }
            case OK_MAYBE_ITS_A_CANDIDATE: {
                if (sOkMaybeItsACandidateRe == null) {
                    sOkMaybeItsACandidateRe = Pattern.compile(
                            "and|article|body|column|main",
                            Pattern.CASE_INSENSITIVE);
                }
                return sOkMaybeItsACandidateRe;
            }
            case POSITIVE: {
                if (sPositiveRe == null) {
                    sPositiveRe = Pattern
                            .compile(
                                    "article|body|content|entry|hentry|page|pagination|post|text",
                                    Pattern.CASE_INSENSITIVE);
                }
                return sPositiveRe;
            }
            case NEGATIVE: {
                if (sNegativeRe == null) {
                    sNegativeRe = Pattern
                            .compile(
                                    "combx|comment|contact|foot|footer|footnote|link|media|meta|promo|related|scroll|shoutbox|sponsor|tags|widget",
                                    Pattern.CASE_INSENSITIVE);
                }
                return sNegativeRe;
            }
            case DIV_TO_P_ELEMENTS: {
                if (sDivToPElementsRe == null) {
                    sDivToPElementsRe = Pattern.compile(
                            "<(a|blockquote|dl|div|img|ol|p|pre|table|ul)",
                            Pattern.CASE_INSENSITIVE);
                }
                return sDivToPElementsRe;
            }
            case VIDEO: {
                if (sVideoRe == null) {
                    sVideoRe = Pattern.compile(
                            "http:\\/\\/(www\\.)?(youtube|vimeo)\\.com",
                            Pattern.CASE_INSENSITIVE);
                }
                return sVideoRe;
            }
            }
            return null;
        }
    }

    /**
     * Reads the content score.
     * 
     * @param node
     * @return
     */
    private static int getContentScore(Element node) {
        try {
            return Integer.parseInt(node.attr(CONTENT_SCORE));
        } catch (NumberFormatException e) {
            return 0;
        }
    }

    /**
     * Increase or decrease the content score for an Element by an
     * increment/decrement.
     * 
     * @param node
     * @param increment
     * @return
     */
    private static Element incrementContentScore(Element node, int increment) {
        int contentScore = getContentScore(node);
        contentScore += increment;
        node.attr(CONTENT_SCORE, Integer.toString(contentScore));
        return node;
    }

    /**
     * Scales the content score for an Element with a factor of scale.
     * 
     * @param node
     * @param scale
     * @return
     */
    private static Element scaleContentScore(Element node, float scale) {
        int contentScore = getContentScore(node);
        contentScore *= scale;
        node.attr(CONTENT_SCORE, Integer.toString(contentScore));
        return node;
    }

    /**
     * Jsoup's Element.getElementsByTag(Element e) includes e itself, which is
     * different from W3C standards. This utility function is exclusive of the
     * Element e.
     * 
     * @param e
     * @param tag
     * @return
     */
    private static Elements getElementsByTag(Element e, String tag) {
        Elements es = e.getElementsByTag(tag);
        es.remove(e);
        return es;
    }

    /**
     * Helper utility to determine whether a given String is empty.
     * 
     * @param s
     * @return
     */
    private static boolean isEmpty(String s) {
        return s == null || s.length() == 0;
    }

}

Readability 通过提供构造函数来实例化:

Readability readability = new Readability(html); // String
Readability readability = new Readability(url, timeoutMillis); // URL

通过运行以下命令开始内容提取:

readability.init();

输出是干净、可读的 HTML 格式内容。可以使用以下命令获取输出:

String cleanHtml = readability.outerHtml();
  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
SQLAlchemy 是一个 SQL 工具包和对象关系映射(ORM)库,用于 Python 编程语言。它提供了一个高级的 SQL 工具和对象关系映射工具,允许开发者以 Python 类和对象的形式操作数据库,而无需编写大量的 SQL 语句。SQLAlchemy 建立在 DBAPI 之上,支持多种数据库后端,如 SQLite, MySQL, PostgreSQL 等。 SQLAlchemy 的核心功能: 对象关系映射(ORM): SQLAlchemy 允许开发者使用 Python 类来表示数据库表,使用类的实例表示表中的行。 开发者可以定义类之间的关系(如一对多、多对多),SQLAlchemy 会自动处理这些关系在数据库中的映射。 通过 ORM,开发者可以像操作 Python 对象一样操作数据库,这大大简化了数据库操作的复杂性。 表达式语言: SQLAlchemy 提供了一个丰富的 SQL 表达式语言,允许开发者以 Python 表达式的方式编写复杂的 SQL 查询。 表达式语言提供了对 SQL 语句的灵活控制,同时保持了代码的可读性和可维护性。 数据库引擎和连接池: SQLAlchemy 支持多种数据库后端,并且为每种后端提供了对应的数据库引擎。 它还提供了连接池管理功能,以优化数据库连接的创建、使用和释放。 会话管理: SQLAlchemy 使用会话(Session)来管理对象的持久化状态。 会话提供了一个工作单元(unit of work)和身份映射(identity map)的概念,使得对象的状态管理和查询更加高效。 事件系统: SQLAlchemy 提供了一个事件系统,允许开发者在 ORM 的各个生命周期阶段插入自定义的钩子函数。 这使得开发者可以在对象加载、修改、删除等操作时执行额外的逻辑。
SQLAlchemy 是一个 SQL 工具包和对象关系映射(ORM)库,用于 Python 编程语言。它提供了一个高级的 SQL 工具和对象关系映射工具,允许开发者以 Python 类和对象的形式操作数据库,而无需编写大量的 SQL 语句。SQLAlchemy 建立在 DBAPI 之上,支持多种数据库后端,如 SQLite, MySQL, PostgreSQL 等。 SQLAlchemy 的核心功能: 对象关系映射(ORM): SQLAlchemy 允许开发者使用 Python 类来表示数据库表,使用类的实例表示表中的行。 开发者可以定义类之间的关系(如一对多、多对多),SQLAlchemy 会自动处理这些关系在数据库中的映射。 通过 ORM,开发者可以像操作 Python 对象一样操作数据库,这大大简化了数据库操作的复杂性。 表达式语言: SQLAlchemy 提供了一个丰富的 SQL 表达式语言,允许开发者以 Python 表达式的方式编写复杂的 SQL 查询。 表达式语言提供了对 SQL 语句的灵活控制,同时保持了代码的可读性和可维护性。 数据库引擎和连接池: SQLAlchemy 支持多种数据库后端,并且为每种后端提供了对应的数据库引擎。 它还提供了连接池管理功能,以优化数据库连接的创建、使用和释放。 会话管理: SQLAlchemy 使用会话(Session)来管理对象的持久化状态。 会话提供了一个工作单元(unit of work)和身份映射(identity map)的概念,使得对象的状态管理和查询更加高效。 事件系统: SQLAlchemy 提供了一个事件系统,允许开发者在 ORM 的各个生命周期阶段插入自定义的钩子函数。 这使得开发者可以在对象加载、修改、删除等操作时执行额外的逻辑。
GeoPandas是一个开源的Python库,旨在简化地理空间数据的处理和分析。它结合了Pandas和Shapely的能力,为Python用户提供了一个强大而灵活的工具来处理地理空间数据。以下是关于GeoPandas的详细介绍: 一、GeoPandas的基本概念 1. 定义 GeoPandas是建立在Pandas和Shapely之上的一个Python库,用于处理和分析地理空间数据。 它扩展了Pandas的DataFrame和Series数据结构,允许在其中存储和操作地理空间几何图形。 2. 核心数据结构 GeoDataFrame:GeoPandas的核心数据结构,是Pandas DataFrame的扩展。它包含一个或多个列,其中至少一列是几何列(geometry column),用于存储地理空间几何图形(如点、线、多边形等)。 GeoSeries:GeoPandas中的另一个重要数据结构,类似于Pandas的Series,但用于存储几何图形序列。 二、GeoPandas的功能特性 1. 读取和写入多种地理空间数据格式 GeoPandas支持读取和写入多种常见的地理空间数据格式,包括Shapefile、GeoJSON、PostGIS、KML等。这使得用户可以轻松地从各种数据源中加载地理空间数据,并将处理后的数据保存为所需的格式。 2. 地理空间几何图形的创建、编辑和分析 GeoPandas允许用户创建、编辑和分析地理空间几何图形,包括点、线、多边形等。它提供了丰富的空间操作函数,如缓冲区分析、交集、并集、差集等,使得用户可以方便地进行地理空间数据分析。 3. 数据可视化 GeoPandas内置了数据可视化功能,可以绘制地理空间数据的地图。用户可以使用matplotlib等库来进一步定制地图的样式和布局。 4. 空间连接和空间索引 GeoPandas支持空间连接操作,可以将两个GeoDataFrame按照空间关系(如相交、包含等)进行连接。此外,它还支持空间索引,可以提高地理空间数据查询的效率。
SQLAlchemy 是一个 SQL 工具包和对象关系映射(ORM)库,用于 Python 编程语言。它提供了一个高级的 SQL 工具和对象关系映射工具,允许开发者以 Python 类和对象的形式操作数据库,而无需编写大量的 SQL 语句。SQLAlchemy 建立在 DBAPI 之上,支持多种数据库后端,如 SQLite, MySQL, PostgreSQL 等。 SQLAlchemy 的核心功能: 对象关系映射(ORM): SQLAlchemy 允许开发者使用 Python 类来表示数据库表,使用类的实例表示表中的行。 开发者可以定义类之间的关系(如一对多、多对多),SQLAlchemy 会自动处理这些关系在数据库中的映射。 通过 ORM,开发者可以像操作 Python 对象一样操作数据库,这大大简化了数据库操作的复杂性。 表达式语言: SQLAlchemy 提供了一个丰富的 SQL 表达式语言,允许开发者以 Python 表达式的方式编写复杂的 SQL 查询。 表达式语言提供了对 SQL 语句的灵活控制,同时保持了代码的可读性和可维护性。 数据库引擎和连接池: SQLAlchemy 支持多种数据库后端,并且为每种后端提供了对应的数据库引擎。 它还提供了连接池管理功能,以优化数据库连接的创建、使用和释放。 会话管理: SQLAlchemy 使用会话(Session)来管理对象的持久化状态。 会话提供了一个工作单元(unit of work)和身份映射(identity map)的概念,使得对象的状态管理和查询更加高效。 事件系统: SQLAlchemy 提供了一个事件系统,允许开发者在 ORM 的各个生命周期阶段插入自定义的钩子函数。 这使得开发者可以在对象加载、修改、删除等操作时执行额外的逻辑。
SQLAlchemy 是一个 SQL 工具包和对象关系映射(ORM)库,用于 Python 编程语言。它提供了一个高级的 SQL 工具和对象关系映射工具,允许开发者以 Python 类和对象的形式操作数据库,而无需编写大量的 SQL 语句。SQLAlchemy 建立在 DBAPI 之上,支持多种数据库后端,如 SQLite, MySQL, PostgreSQL 等。 SQLAlchemy 的核心功能: 对象关系映射(ORM): SQLAlchemy 允许开发者使用 Python 类来表示数据库表,使用类的实例表示表中的行。 开发者可以定义类之间的关系(如一对多、多对多),SQLAlchemy 会自动处理这些关系在数据库中的映射。 通过 ORM,开发者可以像操作 Python 对象一样操作数据库,这大大简化了数据库操作的复杂性。 表达式语言: SQLAlchemy 提供了一个丰富的 SQL 表达式语言,允许开发者以 Python 表达式的方式编写复杂的 SQL 查询。 表达式语言提供了对 SQL 语句的灵活控制,同时保持了代码的可读性和可维护性。 数据库引擎和连接池: SQLAlchemy 支持多种数据库后端,并且为每种后端提供了对应的数据库引擎。 它还提供了连接池管理功能,以优化数据库连接的创建、使用和释放。 会话管理: SQLAlchemy 使用会话(Session)来管理对象的持久化状态。 会话提供了一个工作单元(unit of work)和身份映射(identity map)的概念,使得对象的状态管理和查询更加高效。 事件系统: SQLAlchemy 提供了一个事件系统,允许开发者在 ORM 的各个生命周期阶段插入自定义的钩子函数。 这使得开发者可以在对象加载、修改、删除等操作时执行额外的逻辑。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值