jsoup爬虫教程技巧_Jsoup V的幕后秘密：优化的技巧和窍门

最新推荐文章于 2022-05-31 17:47:42 发布

dnc8371

最新推荐文章于 2022-05-31 17:47:42 发布

阅读量315

点赞数

文章标签： java python 面试数据库 vue ViewUI

原文链接：https://www.javacodegeeks.com/2019/02/secrets-jsoup-tricks-optimization.html

版权

jsoup爬虫教程技巧

我们已经把事情做好了，现在是时候加快工作速度了。我们会牢记Donald Knuth的警告：“大约97％的时间我们应该忘记效率低下：过早的优化是万恶之源”。

根据Jonathan Hedley的介绍，他使用YourKit Java Profiler来测量内存使用情况并找到性能热点。使用此类工具的统计结果对于优化的成功至关重要，它将防止您花时间思考和进行无用的调优，这不会提高性能，但也会使代码不必要地变得复杂且难以维护。乔纳森（Jonathan）在“科隆（Colophon）”中也谈到了这一点。

我们将列出Jsoup中使用的一些技巧和窍门，它们目前是随机排列的，将来会重新组织。

1.缩进填充

// memoised padding up to 21, from "", " ", "  " to "                   "
static final String[] padding = {......};

public static String padding(int width) {
    if (width < 0)
        throw new IllegalArgumentException("width must be > 0");

    if (width < padding.length)
        return padding[width];
    char[] out = new char[width];
    for (int i = 0; i < width; i++)
        out[i] = ' ';
    return String.valueOf(out);
}

protected void indent(Appendable accum, int depth, Document.OutputSettings out) throws IOException {
    accum.append('\n').append(StringUtil.padding(depth * out.indentAmount()));
}

很聪明吧？它保留了不同长度的填充缓存，可以覆盖80％的情况-我认为这是基于作者的经验和统计数据。

2.是否上课？

Element#hasClass被标记为对性能敏感的 ，例如，我们要检查<div class="logged-in env-production intent-mouse">是否具有类production ，将类按空格分割为数组，然后循环并进行搜索，但深入了解这将是无效的。 Jsoup首先在这里介绍了Early Exit ，方法是将长度与目标类名进行比较，以避免不必要的扫描和搜索，这也将是有益的。然后，它使用一个检测空白的指针并执行regionMatches-坦白地说，这是我第一次了解方法String#regionMatches ＆＃55357;＆＃56904;＆＃55357;＆＃56837;。

public boolean hasClass(String className) {
    final String classAttr = attributes().getIgnoreCase("class");
    final int len = classAttr.length();
    final int wantLen = className.length();

    if (len == 0 || len < wantLen) {
        return false;
    }

    // if both lengths are equal, only need compare the className with the attribute
    if (len == wantLen) {
        return className.equalsIgnoreCase(classAttr);
    }

    // otherwise, scan for whitespace and compare regions (with no string or arraylist allocations)
    boolean inClass = false;
    int start = 0;
    for (int i = 0; i < len; i++) {
        if (Character.isWhitespace(classAttr.charAt(i))) {
            if (inClass) {
                // white space ends a class name, compare it with the requested one, ignore case
                if (i - start == wantLen && classAttr.regionMatches(true, start, className, 0, wantLen)) {
                    return true;
                }
                inClass = false;
            }
        } else {
            if (!inClass) {
                // we're in a class name : keep the start of the substring
                inClass = true;
                start = i;
            }
        }
    }

    // check the last entry
    if (inClass && len - start == wantLen) {
        return classAttr.regionMatches(true, start, className, 0, wantLen);
    }

    return false;
}

3.标签名称是否存在？

正如我们在之前的文章中所分析的那样， HtmlTreeBuilderState将通过检查某个集合中的标记名称是否正确来验证嵌套的正确性。我们可以比较1.7.3之前和之后的实现以进行检查。

// 1.7.2
} else if (StringUtil.in(name, "base", "basefont", "bgsound", "command", "link", "meta", "noframes", "script", "style", "title")) {
    return tb.process(t, InHead);
}

// 1.7.3
static final String[] InBodyStartToHead = new String[]{"base", "basefont", "bgsound", "command", "link", "meta", "noframes", "script", "style", "title"};
...
} else if (StringUtil.inSorted(name, Constants.InBodyStartToHead)) {
    return tb.process(t, InHead);
}

根据作者的评论，“这里有点难读，但与动态varargs相比，GC少。贡献了大约10％的解析GC负载。必须确保将它们排序，如findSorted中所使用。简单地使用static final常数数组，也可以对其进行排序，以便二进制搜索也可以从O（n）改善为O（log（n）），在这里性价比非常好。

但是，“如果添加更多的数组，则必须更新HtmlTreebuilderStateTest”不是同步恕我直言的好方法，而不是复制和粘贴，我将使用反射来检索那些常量。您可以在Pull Request ＃1157中找到我的建议：“简化状态排序状态单元测试–避免在HtmlTreeBuilderStateTest.java中重复代码” 。

4.轻量级模式

您知道Integer.valueOf(i)的技巧吗？如果已配置（ java.lang.Integer.IntegerCache.high ），它将保持-128到127或更高的IntegerCache缓存，结果，当值位于不同范围内时， == equals结果将有所不同（经典Java面试问题？）。这实际上是一个轻量级模式的示例。对于Jsoup，应用此模式还将减少对象创建时间，并提高性能。

/**
 * Caches short strings, as a flywheel pattern, to reduce GC load. Just for this doc, to prevent leaks.
 * <p />
 * Simplistic, and on hash collisions just falls back to creating a new string, vs a full HashMap with Entry list.
 * That saves both having to create objects as hash keys, and running through the entry list, at the expense of
 * some more duplicates.
 */
private static String cacheString(final char[] charBuf, final String[] stringCache, final int start, final int count) {
    // limit (no cache):
    if (count > maxStringCacheLen)
        return new String(charBuf, start, count);
    if (count < 1)
        return "";

    // calculate hash:
    int hash = 0;
    int offset = start;
    for (int i = 0; i < count; i++) {
        hash = 31 * hash + charBuf[offset++];
    }

    // get from cache
    final int index = hash & stringCache.length - 1;
    String cached = stringCache[index];

    if (cached == null) { // miss, add
        cached = new String(charBuf, start, count);
        stringCache[index] = cached;
    } else { // hashcode hit, check equality
        if (rangeEquals(charBuf, start, count, cached)) { // hit
            return cached;
        } else { // hashcode conflict
            cached = new String(charBuf, start, count);
            stringCache[index] = cached; // update the cache, as recently used strings are more likely to show up again
        }
    }
    return cached;
}

还有另一种情况，可以使用相同的想法来最小化新的StringBuilder GC。

private static final Stack<StringBuilder> builders = new Stack<>();

/**
 * Maintains cached StringBuilders in a flyweight pattern, to minimize new StringBuilder GCs. The StringBuilder is
 * prevented from growing too large.
 * <p>
 * Care must be taken to release the builder once its work has been completed, with {@see #releaseBuilder}
*/
public static StringBuilder borrowBuilder() {
    synchronized (builders) {
        return builders.empty() ?
            new StringBuilder(MaxCachedBuilderSize) :
            builders.pop();
    }
}

实际上， CharacterReader和StringUtil值得越来越消化，因为有许多有用的提示和技巧会激发您的灵感。

5.其他改善方法

使用RandomAccessFile读取文件，将文件读取时间缩短了2倍。查看＃248了解更多详情
节点层次结构重构。查看＃911了解更多详细信息
“在很大程度上基于对各种网站的分析重新排序HtmlTreeBuilder方法而带来的改进” –我在此列出了这一点，因为它非常实用。更深入地了解和观察代码的运行方式也将为您提供一些见解
调用list.toArray(0)而不是list.toArray(list.size()) –已在某些开源项目（例如h2database）中使用，因此我也在另一个Pull Request ＃1158中提出了此要求

6.未知数

优化永无止境。我目前还没有发现很多提示和技巧。如果您在Jsoup中发现更多启发性的想法，请与我分享，我将不胜感激。您可以在该网站的左侧栏中找到我的联系信息，或者直接通过ny83427 at gmail.com发送电子邮件至ny83427 at gmail.com 。

-未完待续-

翻译自: https://www.javacodegeeks.com/2019/02/secrets-jsoup-tricks-optimization.html

jsoup爬虫教程技巧

dnc8371

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
jsoup爬虫教程技巧_Jsoup V的幕后秘密：优化的技巧和窍门

jsoup爬虫教程技巧我们已经把事情做好了，现在是时候加快工作速度了。我们会牢记Donald Knuth的警告：“大约97％的时间我们应该忘记效率低下：过早的优化是万恶之源”。根据Jonathan Hedley的介绍，他使用YourKit Java Profiler来测量内存使用情况并找到性能热点。使用此类工具的统计结果对于优化的成功至关重要，它将防止您花时间思考和进行无用的调优，...
复制链接

扫一扫