Android webkit对于网络传入的数据流解码分析

最新推荐文章于 2023-12-29 11:45:19 发布

猿氏悟语

最新推荐文章于 2023-12-29 11:45:19 发布

阅读量1.8k

点赞数

分类专栏：浏览器开发详解文章标签： Android webkit webkit HTMLTokenizer 数据解码 DecodedDataDocumentP

本文链接：https://blog.csdn.net/chaoy1116/article/details/12649279

版权

浏览器开发详解专栏收录该内容

22 篇文章 0 订阅

订阅专栏

前面分析了，我们在接收数据的时候，会去创建Dom树的根节点为Document -> HTMLDocument，和render tree的根节点 RenderView 以及解析器DocumentParser。

但是具体到一个网页的时候，应该怎么去解析呢？
一个网络上的数据流到HTMLTokenizer之间会做了一些什么事情呢？
准备接着进行下面的分析。

还是先来一个堆栈说明问题：

#0  HTMLDocumentParser (this=0x2a2e8798, document=0x2a283a18, reportErrors=false) at external/webkit/Source/WebCore/html/parser/HTMLDocumentParser.cpp:77
#1  0x48f22a3a in WebCore::HTMLDocumentParser::create (this=0x2a283a18) at external/webkit/Source/WebCore/html/parser/HTMLDocumentParser.h:61
#2  WebCore::HTMLDocument::createParser (this=0x2a283a18) at external/webkit/Source/WebCore/html/HTMLDocument.cpp:281
#3  0x48eeb7ec in WebCore::Document::implicitOpen (this=0x2a283a18) at external/webkit/Source/WebCore/dom/Document.cpp:1987
#4  0x48f389e8 in WebCore::DocumentWriter::begin (this=0x2a3b16fc, url=<value optimized out>, dispatch=<value optimized out>, origin=0x0) at external/webkit/Source/WebCore/loader/DocumentWriter.cpp:149
#5  0x48d47f82 in WebCore::FrameLoader::receivedFirstData (this=0x2a146f20) at external/webkit/Source/WebCore/loader/FrameLoader.cpp:609
#6  0x48f3869c in WebCore::DocumentWriter::setEncoding (this=0x2a3b16fc, name=..., userChosen=<value optimized out>) at external/webkit/Source/WebCore/loader/DocumentWriter.cpp:243
#7  0x48f36c30 in WebCore::DocumentLoader::commitData (this=0x2a3b16a8,

这个堆栈应该是很熟悉了。
DocumentLoader::commitData -> DocumentWriter::setEncoding -> if(receivedFirstData) -> begin -> implicatiOpen -> createParser -> HTMLDocumentParser::create -> HTMLDocumentParser的构造函数.

而且在上一个文章的分析中我们知道了在创建完Parser后，是使用 parser->appendBytes(this, str, len, flush)去向parser模块传输数据。
下面这个堆栈来看一下是怎么来解析数据的：

#0  HTMLTokenizer (this=0x2a4527b8, usePreHTML5ParserQuirks=false) at external/webkit/Source/WebCore/html/parser/HTMLTokenizer.cpp:106
#1  0x48f27f28 in WebCore::HTMLTokenizer::create (this=0x2a463b08, document=<value optimized out>) at external/webkit/Source/WebCore/html/parser/HTMLTokenizer.h:123
#2  HTMLPreloadScanner (this=0x2a463b08, document=<value optimized out>) at external/webkit/Source/WebCore/html/parser/HTMLPreloadScanner.cpp:155
#3  0x48f26b0e in WebCore::HTMLDocumentParser::pumpTokenizer (this=0x2a2e8798, mode=WebCore::HTMLDocumentParser::AllowYield) at external/webkit/Source/WebCore/html/parser/HTMLDocumentParser.cpp:293
#4  0x48f26d52 in WebCore::HTMLDocumentParser::append (this=0x2a2e8798, source=...) at external/webkit/Source/WebCore/html/parser/HTMLDocumentParser.cpp:367
#5  WebCore::HTMLDocumentParser::append (this=0x2a2e8798, source=...) at external/webkit/Source/WebCore/html/parser/HTMLDocumentParser.cpp:337
#6  0x48fbe2c4 in WebCore::DecodedDataDocumentParser::appendBytes (this=0x2a2e8798, writer=<value optimized out>, 
    data=0x2a413ef0 "b/gp/?tab=wm\"><div class=\"gbzi gbsi\" style=\"background-position:-32px -50px\"></div><span class=gbzn>Gmail</span></a><a onclick=gbar.logger.il(1,{t:25}); id=gb_25 class=\"gbza\" href=\"https://drive.googl"..., length=12398, shouldFlush=false) at external/webkit/Source/WebCore/dom/DecodedDataDocumentParser.cpp:54
#7  0x48f384ea in WebCore::DocumentWriter::addData (this=0x2a3b16fc, 
    str=0x2a413ef0 "b/gp/?tab=wm\"><div class=\"gbzi gbsi\" style=\"background-position:-32px -50px\"></div><span class=gbzn>Gmail</span></a><a onclick=gbar.logger.il(1,{t:25}); id=gb_25 class=\"gbza\" href=\"https://drive.googl"..., len=12398, flush=<value optimized out>) at external/webkit/Source/WebCore/loader/DocumentWriter.cpp:207

具体的流程为：
WebCore::DocumentWriter::addData -> WebCore::DecodedDataDocumentParser::appendBytes -> WebCore::HTMLDocumentParser::append -> WebCore::HTMLDocumentParser::append -> WebCore::HTMLDocumentParser::pumpTokenizer -> HTMLPreloadScanner -> WebCore::HTMLTokenizer::create -> HTMLTokenizer

void DocumentWriter::addData(const char* str, int len, bool flush)
{
    if (len == -1)
        len = strlen(str);


    DocumentParser* parser = m_frame->document()->parser();
    if (parser)
        parser->appendBytes(this, str, len, flush);
}

当发现有parser解释器创建的时候，就往parser里面去传输数据。
这里使用的是DocumentParser的appendBytes，但是appendBytes这个函数在DocumentParser里声明的是一个虚函数，在这个里面并没有实现。
具体的实现是在DecodedDataDocumentParser这个类中。

void DecodedDataDocumentParser::appendBytes(DocumentWriter* writer , const char* data, int length, bool shouldFlush)
{
    if (!length && !shouldFlush)
        return;


    TextResourceDecoder* decoder = writer->createDecoderIfNeeded();
    String decoded = decoder->decode(data, length);
    if (shouldFlush)
        decoded += decoder->flush();
    if (decoded.isEmpty())
        return;


    writer->reportDataReceived();


    append(decoded);
}

在这个里面，首先会去创建一个decoder。创建的过程是会在DocumentWriter里面进行的创建。
在第一次打开这个网页的时候，m_decoder这个值肯定是为0的。这样的话，在createDecoderIfNeeded里面，就肯定会进行decode的创建了
比如decode是text的？还是img？当然，我们现在关心的是HTML的网页，所以这边只研究HTML的情况就可以了。
在确定完decode创建以后，就会进行下面的操作。
String decoded = decoder->decode(data, length);
docoded是一个TextResourceDecoder的对象，所以看一下在TextResourceDecoder.cpp里面的实现。
这边首先会去判断checkForHeadCharset，这个函数的作用是什么呢？
它是用来检查HTML头信息中是否有编码的信息，一般HTML的页面中如果指定了编码信息，就会放在<head>的标签中。

decode完了之后的值究竟是什么呢？我们来看个例子：

data=0x2a5e52f8 "<!DOCTYPE HTML>\n<html>\n \n <head>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n <meta http-equiv=\"Cache-Control\" content=\"no-cache\" />\n <"..., length=1303, shouldFlush=false) at external/webkit/Source/WebCore/dom/DecodedDataDocumentParser.cpp:46

在这个例子里面，我们可以看到charset=utf-8，那么就会通过TextResourceDecoder::setEncoding把监测到的编码格式给TextResourceDecoder。
如何设置这个编码格式呢？还是先贴了个堆栈信息：

#0  WebCore::TextResourceDecoder::checkForMetaCharset (this=0x2a9c26b0, 
    data=0x2aa36d68 "<!DOCTYPE html PUBLIC \"-//WAPFORUM//DTD XHTML Mobile 1.0//EN\" \"http://www.wapforum.org/DTD/xhtml-mobile10.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta http-equiv=\"Content-Type\" co"..., length=2293) at external/webkit/Source/WebCore/loader/TextResourceDecoder.cpp:579
#1  0x48d532ac in WebCore::TextResourceDecoder::checkForHeadCharset (this=0x2a9c26b0, 
    data=0x2aa36d68 "<!DOCTYPE html PUBLIC \"-//WAPFORUM//DTD XHTML Mobile 1.0//EN\" \"http://www.wapforum.org/DTD/xhtml-mobile10.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta http-equiv=\"Content-Type\" co"..., len=2293, movedDataToBuffer=<value optimized out>) at external/webkit/Source/WebCore/loader/TextResourceDecoder.cpp:575
#2  0x48d535b2 in WebCore::TextResourceDecoder::decode (this=0x2a9c26b0, 
    data=0x2aa36d68 "<!DOCTYPE html PUBLIC \"-//WAPFORUM//DTD XHTML Mobile 1.0//EN\" \"http://www.wapforum.org/DTD/xhtml-mobile10.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta http-equiv=\"Content-Type\" co"..., len=2293) at external/webkit/Source/WebCore/loader/TextResourceDecoder.cpp:638
#3  0x48fbe25c in WebCore::DecodedDataDocumentParser::appendBytes (this=0x2a41dc78, writer=0x2a46dc54, 
    data=0x2aa36d68 "<!DOCTYPE html PUBLIC \"-//WAPFORUM//DTD XHTML Mobile 1.0//EN\" \"http://www.wapforum.org/DTD/xhtml-mobile10.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta http-equiv=\"Content-Type\" co"..., length=2293, shouldFlush=false) at external/webkit/Source/WebCore/dom/DecodedDataDocumentParser.cpp:46
#4  0x48f384ea in WebCore::DocumentWriter::addData (this=0x2a46dc54, 
    str=0x2aa36d68 "<!DOCTYPE html PUBLIC \"-//WAPFORUM//DTD XHTML Mobile 1.0//EN\" \"http://www.wapforum.org/DTD/xhtml-mobile10.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta http-equiv=\"Content-Type\" co"..., len=2293, flush=<value optimized out>) at external/webkit/Source/WebCore/loader/DocumentWriter.cpp:207

具体的调用为： WebCore::DocumentWriter::addData -> WebCore::DecodedDataDocumentParser::appendBytes -> WebCore::TextResourceDecoder::decode -> WebCore::TextResourceDecoder::checkForHeadCharset -> WebCore::TextResourceDecoder::checkForMetaCharset
在TextResourceDecoder.cpp的TextResourceDecoder::setEncoding函数中，我们看到了编码信息的设置：m_encoding = encoding;
gdb去查看了一下m_encoding的值为
(gdb) p m_encoding
$22 = {m_name = 0x491851ca "UTF-8", m_backslashAsCurrencySymbol = 92}
这样的话，就完成了对编码格式的设置。-> TextCodecUTF8.cpp

重新回到DecodedDataDocumentParser::appendBytes
但是docode为：$16 = {m_impl = {m_ptr = 0x2a41f8c8}}

这个时候shouldFlush是false，decoded并不是null所以这两个判断都不会进入。
data经过解码后，已经变成了docode的字符串。

writer->reportDataReceived()是什么作用呢？顾名思义，是报告有新的数据被接收。
向谁报告呢？m_frame->document()->recalcStyle(Node::Force); 通过这个我们可以看到是调用的document的recalctyle而参数为Force。
在Document::recalcStyle(StyleChange change)这个函数中，我们着重关注Force的情况：

    if (change == Force) {
        // style selector may set this again during recalc
        m_hasNodesWithPlaceholderStyle = false;


        RefPtr<RenderStyle> documentStyle = CSSStyleSelector::styleForDocument(this);
        StyleChange ch = diff(documentStyle.get(), renderer()->style());
        if (renderer() && ch != NoChange)
            renderer()->setStyle(documentStyle.release());
    }

当第一次载入网页的时候，会进入到这个函数的判断中。这边就会根据当前网页的信息去确定CSSStyleSelector的选择.
CSSStyleSelector这个类的作用是什么呢？

CSSStyleSelector是一个重要的类，在CSSStyleSelector的构建中，会缺省载入一些style，这些style是以代码的形式存在UserAgentStyleSheets.h和UserAgentStyleSheetsData.cpp中的，按数据是以整数的形式保存的，不太容易看出来数据的具体意义。

当如果现在的style和载入的网页的style是不一样的时候，就会重新设置render的sytle。

再回到DecodedDataDocumentParser.cpp中，DecodedDataDocumentParser::appendBytes接下来就会调用append(decoded);
而在DocumentParser的创建过程中，针对HTML的网页，最终子类的HTMLDocumentParser会对其进行实现。
所以,在这个例子中我们走到的是 void HTMLDocumentParser::append(const SegmentedString& source)中进行具体的处理。

在HTMLDocumentParser中有成员HTMLInputStream m_input;此处，把参数传入的String数据追加到HTMLDocumentParser::m_input，这样HTMLDocumentParser中已经保存了解码后的字符串了。
至次，解码过程已经完毕，我们又处于HTMLDocumentParser中，并且已经保存了解码后的输入数据。
具体操作为：m_input.appendToEnd(source);

在HTMLDocumentParser::append里面，pumpTokenizerIfPossible(AllowYield)是需要我们额外关注的。
这个的实现其实非常简单，只是对pumpTokenizer(mode);起了一层封装的效果。

void HTMLDocumentParser::pumpTokenizerIfPossible(SynchronousMode mode)
{
    if (isStopped() || m_treeBuilder->isPaused())
        return;


    // Once a resume is scheduled, HTMLParserScheduler controls when we next pump.
    if (isScheduledForResume()) {
        ASSERT(mode == AllowYield);
        return;
    }


    pumpTokenizer(mode);
}

去判断在进行词法解析之前，HTMLTreeBuilder是否完成暂停，DocumentParser是否已经停止，或者是否需要重新启动一次HTMLParserScheduler
如果没有的话就进行pumpTokenizer的操作。

pumpTokenizer(mode)的函数原型为：
void HTMLDocumentParser::pumpTokenizer(SynchronousMode mode)

在这个里面首先声明了一个对象：PumpSession session(m_pumpSessionNestingLevel);
然后就会进入到一个while的循环，在这个循环里面是需要去判断needsYield的值的。所以我们要看一下这个值是在哪边被改变的？

void checkForYieldBeforeToken(PumpSession& session)
{
    if (session.processedTokens > m_parserChunkSize) {
        // currentTime() can be expensive.  By delaying, we avoided calling
        // currentTime() when constructing non-yielding PumpSessions.
        if (!session.startTime)
            session.startTime = currentTime();


        session.processedTokens = 0；
        double elapsedTime = currentTime() - session.startTime;
        if (elapsedTime > m_parserTimeLimit)
            session.needsYield = true;
    }
    ++session.processedTokens;
}

判断当前的时间如果为0的话，就给session的startTime赋予一个初始的事件，这个时间就等于系统当前的时间。
然后elapsedTime就等于当前的时间减去上一次到达这个函数时的时间，然后当这个时间如果大于一个解析限制的时间的话，就会去对needsYield设置为true。
设置这个是一个超时的判断，如果是进入到了超时的情况下，则在其它的函数中放弃当前的这个操作。

由于在实际的解析中，这个解析的时间非常的快，所以这个值一直为false，并且进入到了循环中，

在循环里面：m_tokenizer->nextToken(m_input.current(), m_token)是一个非常重要的操作。

在这个时候，我们会发现，HTMLDocumentParser::m_input中已经存有解码后的字符串，HTMLTokenizer的对象m_tokenizer在HTMLDocumentParser的构造函数中一并被创建出来。这样的话，解析器已经创建完毕，等待被解析的字符串也已经到位，就等待解析的开始了。

具体的解析过程我们在下面接着分析。

总结一下：
从DocumentWriter这个桥梁将数据传送到HTMLParser模块的时候，我们首先会根据head和meta标签去确定当前网页的编码格式，并且会对输入的数据进行编码格式上的转换，转换后的数据将会被传输到HTMLParser模块去进行下一步的解析。
在解析的过程中，对于处理的时间也是有要求的。系统规定的时间内如果没有解析完成的话，我们就会将这个解析的过程进行抛弃。
在第一次载入一个网页的时候，会根据CSSStyle的方式对当前网页的css布局进行一个初步的选择。
当如果现在的style和载入的网页的style是不一样的时候，就会重新设置render的sytle。
在HTMLDocumentParser.cpp中的HTMLDocumentParser::pumpTokenizer运行的时候，这个时候，经过解码的数据流已经就位，解析器m_tokenizer也随着HTMLDocumentParser的构造函数被创建了出来。

前提条件都已经准备就绪，接下来就去分析HTMLTokenizer怎么去对数据流进行解析。