jsoup源码分析

最新推荐文章于 2022-03-04 20:14:10 发布

"SOL"

最新推荐文章于 2022-03-04 20:14:10 发布

阅读量648

点赞数

分类专栏： jsoup 文章标签： jsoup html解析

本文链接：https://blog.csdn.net/weixin_40206723/article/details/90602379

版权

jsoup 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

jsoup是什么？

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

jsoup中的一些实体

因为jsoup是基于HTML的，而HTML是层级标签结构的（即树形结构），所以jsoup在整个主结构上采用的数据结构为树。符合标签的嵌套模式，对于各个节点（可理解为每个标签）的遍历采用深度优先的方式。

jsoup中的基类主要有：Node，NodeTraversor，Parser，Connection等。

Node主要是定义了树节点的操作和节点中的属性值，是每种节点的基类。
通常使用的是Node的子类Element、TextNode等。Element主要是进一步封装了节点的信息，例如增加了Tag成员变量，Tag记录的是标签名以及标签的类型（块？内联？）。

 Node parentNode;
 List<Node> childNodes;
 Attributes attributes;

NodeTraversor则是定义了Node的遍历方式，基于的是深度优先，通过下面的traverse（）方法实现。

public void traverse(Node root) {
        Node node = root;
        int depth = 0;
        
        while (node != null) {
            visitor.head(node, depth);
            if (node.childNodeSize() > 0) {
                node = node.childNode(0);
                depth++;
            } else {
                while (node.nextSibling() == null && depth > 0) {
                    visitor.tail(node, depth);
                    node = node.parentNode();
                    depth--;
                }
                visitor.tail(node, depth);
                if (node == root)
                    break;
                node = node.nextSibling();
            }
        }
    }

Parser定义的是HTML的解析方式，也是jsoup的核心部分。属性如下：

public class Parser {
    private static final int DEFAULT_MAX_ERRORS = 0; 
    private TreeBuilder treeBuilder;
    private int maxErrors = DEFAULT_MAX_ERRORS;
    private ParseErrorList errors;
    }

Parser的解析本质是由TreeBuilder 来解析的。
TreeBuilder 是一个抽象类，他的实现类有两个：HtmlTreeBuilder 和XmlTreeBuilder ，分别对应两种文件格式。
在HtmlTreeBuilder 调用HtmlTreeBuilderState的process()方法得到解析后的Document对象。process方法中定义了各种标签代码块(head,body,table等)的解析方式，
Document其实就是Element的子类。进而可以通过select方法定位到某个标签节点得到相应的内容。

Connection定义了http连接的基类，包含一些http的接口（例如get post方法）。实现类HttpConnection在已有的基础上增加了请求头等参数。
通过execute方法来进行请求响应，首先通过url获得协议名（http，https），然后建立连接，发请求，状态校验，得到响应，关闭连接（不是必要的，长连接情况）等一系列操作后完成本次请求。

下面为execute的核心代码：

static Response execute(Connection.Request req, Response previousResponse) throws IOException {
			//校验请求为空
            Validate.notNull(req, "Request must not be null");
            //校验协议
            String protocol = req.url().getProtocol();
            if (!protocol.equals("http") && !protocol.equals("https"))
                throw new MalformedURLException("Only http & https protocols supported");
            final boolean methodHasBody = req.method().hasBody();
            final boolean hasRequestBody = req.requestBody() != null;
            //校验请求体
            if (!methodHasBody)
                Validate.isFalse(hasRequestBody, "Cannot set a request body for HTTP method " + req.method());

			//建立连接
            HttpURLConnection conn = createConnection(req);
            Response res;
            try {
                conn.connect();
                if (conn.getDoOutput())
                //发请求
                    writePost(req, conn.getOutputStream(), mimeBoundary);

                int status = conn.getResponseCode();
                res = new Response(previousResponse);
                res.setupFromConnection(conn, previousResponse);
                res.req = req;

                if ((status < 200 || status >= 400) && !req.ignoreHttpErrors())
                        throw new HttpStatusException("HTTP error fetching URL", status, req.url().toString());

                // 校验content-type
                String contentType = res.contentType();
                if (contentType != null
                        && !req.ignoreContentType()
                        && !contentType.startsWith("text/")
                        && !xmlContentTypeRxp.matcher(contentType).matches()
                        )
                    throw new UnsupportedMimeTypeException("Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml",
                            contentType, req.url().toString());
                            
                if (conn.getContentLength() != 0 && req.method() != HEAD) { // -1 means unknown, chunked. sun throws an IO exception on 500 response with no content when trying to read body
                    InputStream bodyStream = null;
                    try {
                        bodyStream = conn.getErrorStream() != null ? conn.getErrorStream() : conn.getInputStream();
                        if (res.hasHeaderWithValue(CONTENT_ENCODING, "gzip"))
                            bodyStream = new GZIPInputStream(bodyStream);
						
						//获得响应数据
                        res.byteData = DataUtil.readToByteBuffer(bodyStream, req.maxBodySize());
                    } finally {
                        if (bodyStream != null) bodyStream.close();
                    }
                } else {
                    res.byteData = DataUtil.emptyByteBuffer();
                }
            } finally {
                // per Java's documentation, this is not necessary, and precludes keepalives. However in practise,
                // connection errors will not be released quickly enough and can cause a too many open files error.
                conn.disconnect();
            }

            res.executed = true;
            return res;
        }

"SOL"

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
jsoup源码分析

jsoup是什么？jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。jsoup中的一些实体因为jsoup是基于HTML的，而HTML是层级标签结构的（即树形结构），所以jsoup在整个主结构上采用的数据结构为树。符合标签的嵌套模式，对于各个节点（可理解为...
复制链接

扫一扫