HTML中chunked解码和gzip解压

最新推荐文章于 2021-06-04 20:54:35 发布

cswhl

最新推荐文章于 2021-06-04 20:54:35 发布

阅读量2.9k

点赞数 1

分类专栏：网络文章标签： socket http

本文链接：https://blog.csdn.net/cswhl/article/details/110443095

版权

网络专栏收录该内容

7 篇文章 0 订阅

订阅专栏

chunked编码

chunked编码的的好处

当访问的时动态页面时，服务器则无法预知内容的大小，因此需要一遍产生数据，一边发送数据，将数据分块发送(服务器通过响应头’Transfer-Encoding: chunked’告诉浏览器它将使用chunked编码传输)。浏览器也不需要等到内容字节全部下载完成,只要接收到一个chunked块就可解析页面，并且可以下载html中定义的页面内容,包括js,css,image等。

chunked编码的格式

其具体格式如下(BNF文法)：

Chunked-Body = chunk //0至多个chunk
last-chunk //最后一个chunk
trailer //尾部
CRLF //结束标记符
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF
chunk-size = 1HEX
last-chunk = 1*(“0”) [ chunk-extension ] CRLF
chunk-extension= *( “;” chunk-ext-name [ “=” chunk-ext-val ] )
chunk-ext-name = token
chunk-ext-val = token | quoted-string
chunk-data = chunk-size(OCTET)
trailer = *(entity-header CRLF)
解释：

1 Chunked-Body表示经过chunked编码后的报文体
报文体可以分为chunk,last-chunk，trailer和结束符四部分。chunk的数量在报文体中最少可以为0，无上限；
2 每个chunk的长度是自指定的
起始的数据必然是16进制数字的字符串，代表后面chunk-data的长度（字节数）。这个16进制的字符串第一个字符如果是“0”，则表示chunk-size为0，该chunk为last-chunk,无chunk-data部分。
3 可选的chunk-extension由通信双方自行确定，如果接收者不理解它的意义，可以忽略。
4 trailer是附加的在尾部的额外头域，通常包含一些元数据（metadata, meta means “about information”），这些头域可以在解码后附加在现有头域之后

RFC2616中附带的解码流程如下：(伪代码）

length := 0         //长度计数器置0

read chunk-size, chunk-extension (if any) and CRLF      //读取chunk-size, chunk-extension和CRLF
while(chunk-size > 0 )  //表明不是last-chunk
{            
    read chunk-data and CRLF            //读chunk-size大小的chunk-data,skip CRLF
    append chunk-data to entity-body     //将此块chunk-data追加到entity-body后
    length := length + chunk-size
    read chunk-size and CRLF          //读取新chunk的chunk-size 和 CRLF
}
read entity-header      //entity-header的格式为name:valueCRLF,如果为空即只有CRLF
while （entity-header not empty)   //即，不是只有CRLF的空行
{
    append entity-header to existing header fields
    read entity-header
}
Content-Length:=length      //将整个解码流程结束后计算得到的新报文体length，作为Content-Length域的值写入报文中
Remove "chunked" from Transfer-Encoding  //同时从Transfer-Encoding中域值去除chunked这个标记

chunked解码

解码目的将分块的chunk-data整合恢复成一块作为报文体，同时记录此块体的长度，length最后的值实际为所有chunk的chunk-size之和。
接收端就可以根据长度值读取数据，以及最后一个长度为0的块来判定接收结束

代码

# 模拟chunked块数据
chunk1 = b'4\r\nWiki\r\n'
chunk2 = b'6\r\npedia \r\n'
chunk3 = b'E\r\nin \r\n\r\nchunks.\r\n'
last_chunk = b'0\r\n'

chunks = chunk1 + chunk2 + chunk3 + last_chunk

# 解码字节串形式的chunked
def decode_chunked(content):
    # 自定义chunked解码
    newContent = b''
    offset = 0
    while True:
        try:
            pos = content.find(b'\r\n', offset)  # 找chunked块的前一个\r\n
            chunk_size = int(content[offset: pos], 16)
            if chunk_size > 0:
                offset = pos + 2
                newContent += content[offset: offset+chunk_size]
                pos = content.find(b'\r\n', offset+chunk_size)   # 找chunked块的后一个\r\n
                offset = pos + 2
            else: break
        except BaseException as ret:
            print(f'没有达到最后一个chunked块!,{ret}')
            break
    return newContent

print(decode_chunked(chunks).decode('utf-8'))

# 解码数据流形式的chunked
from io import BytesIO

class chunk(object):
    '''读取stream,输出解码chunked后的信息'''
    def __init__(self, chunks_data):
        self.content = chunks_data
    
    def decode_chunked(self):
        # chunked解码
        buffer = b''
        with BytesIO(self.content) as stream:
            while True:
                try:
                    line = self.readline(stream)
                    length = int(line[:-2], 16)
                    if length:
                        buffer += stream.read(length)
                        # 读走数据末尾的'\r\n'
                        stream.read(2)
                    else: 
                        # 终止块
                        return buffer
                except BaseException as ret:
                    print(f'没有达到最后一个chunked块!,{ret}')
                    return buffer
                
    def readline(self, input_stream):
        # 读取一行:以'\r\n'结尾
        is_end = False
        buffer = b''
        while True:
            byte = input_stream.read(1)
            if not byte: 
                # EOF返回空字符串
                return buffer

            buffer += byte
            if byte == b'\r':
                is_end = True
            elif is_end: 
                if byte == b'\n':
                    return buffer
                
chunk(chunks).decode_chunked()

如何解压gzip格式的响应体?

gzip文件中存储的也是字节串，只不过是经过压缩过。如果直接用vim打开是乱码。
使用python的gzip模块可以将gzip格式的文件解压为txt文本；或者将gzip形式的字节串解压为原始字节串。

# 将以gzip压缩的响应体解压为原始字节串
import gzip
gz_bytes = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xa5V\xddo\xdb6\x10\x7f\xf7_\xc1r@\xba\x01\xa1\xe4:\x08\xe6\r\x96\x87\xcd[\xb7=u\x0f)\xb0\x01\x03\x02\x8a:Kt(R#)\x7f\x14\xfd\xe3w\x14%EM\xd2-\xa9\x1fl\x91\xc7\xdf}\xdf\x91\xb7z\xf5\xf3\xbb\xcd\xcd_\x7f\xfcB*_\xab\xf5l\x15>Dq]f\x144k\x1d%Bq\xe72j*A\xd73\x04\x00/\xd63BV^z\x05\xebw\x96\x0b\x05\xe47P\r\xd9\x80\xf6`Wi<\x99\x05P\r\x9e\x13Qq\xeb\xc0g\xf4\xfd\xcd[\xb6\xa4\xeb\xf1@\xf3\x1a2\xba\x97ph\x8c\xf5\xa8\xca\xa0\x00\xed\xefIl+}&\xcc\x1e\xec%9\xc8\xc2WY\x01{)\x80u\x9bK"\xb5\xf4\x92+\xe6\x04W\x90\xbd\x99\x88\xae\xbco\x18\xfc\xd3\xca}F\xffd\xef\x7fd\x1bS7\xdc\xcb\\\xc1D\x8f\x84\x0c\x8a\x12\x1e\x99T;\xde4J\nd0\x9a\xddH\x05\x1b\xa3\x8c\x9dp~\xb5\x15\xdb|\xcb\xff\x87\x13\xe1[YN\xd8R\xd7\xdc\nS\xd7F\xa7\xb91w\xccC\xdd(\xee!\xc5\xf0\xb2\xfe@\xd6ej\x98\x14\x01c\xcd\xc1\x81\x8db\x92c\xad\x9ea\xe9\xef5/\xe1\xcbUn\xf9>|\xd9\xe2\xdby\xd2\xe8\xf2\xa5\x0ej\xa3\x81\xa6\xf7L\x8d5\rX\x7f\xcah\x01NX\xd9\x04\xb6\t\xfeW\xf0^\xea\x928\xcf\xad\x87\x82\x94\xadD\xe0%)\x8chkDtj.\x89o\xbd\xb1\x98i<\xe1VT\xd2\x83\xf0\xad\r@\xae\x0bR\x1b\x0b\x83H\xb25\x96\xf4U\x89\xca\x8bVx\xd7\x810\x8e\xa1t\\BI\xb4OI}G,\xa8\x8c\xba\nkM\xb4\x9e\x04\xcf)\xa9,l3*\x9c\xc3\xb8`,\xdd\x10\x93\x04\xff\xe8C\xde\xc8\xe2O\r\xc6\xa6\x83\xa7!j\xc4\xc9\x0f\x80]\xf3\xe6\xbb\xc5\x11\x7f\x83\xcc/\xcc\x05J\xb8\xcf\xc5\xb3U/\x96G\xfc\x9d\xabz\xb1|\xb9\xea\xab\xc5\xf1\xea\\\x9f\xaf\x9et9\xd4\x1e0oZQ\xb1h\xc3\xe8\xed\x1c\xbd\x9d\x9f\xed\xed\xfceZ\xaf1\xbd\xd7g\xa7\xf7\xfa\x85\xbe.\xd1\xd7\xe5\xd9\xbe.\x07_W\xaf\x18\xc3\xabt\x87=e\xec\xf7\xbe\x82\x1a\x08cH\x9ft\x88?)p\x15\x80\x9f\xb6\x07W\x9e\xa7\xcbd\x9e\xcc\xd3\x03\xe4\xdd6\xa9\xa5N\xf0\x8c\x12Yt\xa0\xae\xdb:\r\xa0\x8bAI\'\xfe\xd3\x16|RA\x01\xb5aA,s\xd8\xf2l\x94\x1d\x0b\xcf\xc3\xd1\xa7\x83\x8agH\xc3\x80\x9e\xc3\x8e\xc1\xf5\xce[\xde$\xffe\xc7*\xder\xeb\x83\xd4\x859$\x98\x83R\x99\x9c+\xf2\xf1\xe3x\xa9%\x07\x8b\xee|\xfd\xba\xc7\x12g\x05\xa6\x11tZ\xf4\xe9\xda\xb94r%;G\xd7\x7f\x1f\xaf6i/\xf6\xf57\xaba\xb9J\xe3[<[\xe5\xa68\x8d\xef\xf4\x8ea.X(\x1e~2\xadg\xe10\x96V!\xf7]V\xa2\xe8\x9f\x02\xfd\xb3\\\r\xef\x9fE\xe4\x0bj\xc0\x12k\xf0\x8d\xa59\xd7\x1a\xec\xc8\xd8\xc3#\xa4g@\x16\x83\x8e\xb0v\xbe \xf1\x80u#AF\'c\x02E\xfb\x07T\xaf\'\x8d\xe0~\x17\xac\xad\xb9\xd4\x18|\xbc\xd9q1\xd1\x1a\xe8\x9b\x91:j\xfd\xe4\xd1`\xd1RT\xf3$y\xe0\xc1\x87\x0cJcO]s8D? \x0c\xb8-\xf0\xf0\xe2\x14lxT\x10\xfa\x986\xa0\x9dQm\xd0\xe6\x98\xc3\x82\xc7\x05\xa2\x1f\xd3\xeeeC\x91sq7\x01?"\x8dX\xac\xc21\x19\xfd\x9b\'\xf5\xd6\x8cQ@L\xab%\x8eL\x0e\x07\xa3\x1e=\xd8w\xcb?\xdc\xb6V\xc5b\x8b\xc3\x05W*\tC_\xc8\xc7C\xbeQi:\xddc@\xe5\xbe+\xa8\xb8\x08\xab\xbe\x90\'\xfd\xb0\xe3{\x1e\xa94\xd67\xd6\xb4\x92\xb9Km\x98\xca,\x0c\xdf\xae\xc2\xc7\x9a~\xbe\xacP\x02\x9f\xe5\xe5\xee\xa4\x11\xd6}z\x964\r\xe9\x0c\xcd\xe7m\xeb<$X{\xa96\x1eG\x82\x1f\n\xd3\x15\x9a\xe9\xa6\x86pp!2\x9c0"\xfeb\xe7\xb2<\xbf\x10\x85\xce\xde\\4\x02oX\t\x17\x91\xf1&\x98\x88g\xc1\xc8\x0c\xc5\x86\xb1\xcb\x1a\xe7pT)C\xe1N\x8d\x9b\xad\xd2\xd0\x89\xdd\xa2\x9b\xb5\xff\x05\x00\x88\x9d4{\x0b\x00\x00'
print(gzip.decompress(gz_bytes))

# output
b'<!DOCTYPE html>\n<html lang="en-us" class="ohc">\n\n<head>\n  <title>Oracle Help Center</title>\n\n  <meta charset="UTF-8">\n  <meta name="viewport" content="viewport-fit=cover, width=device-width, initial-scale=1">\n  <meta http-equiv="X-UA-Compatible" content="ie=edge">\n  <meta name="msapplication-TileColor" content="#fcfbfa">\n  <meta name="msapplication-config" content="/sp_common/book-template/ohc-common/img/o-icon/browserconfig.xml">\n  <meta name="msapplication-TileImage" content="/sp_common/book-template/ohc-common/img/o-icon/favicon-270.png">\n  <meta name="msapplication-config" content="none"/>\n  <meta property="description" content="Getting started guides, documentation, tutorials, architectures, and more content for Oracle products and services." />\n  <link rel="shortcut icon" href="css/images/favicon.ico"/>\n  <link rel="icon" type="image/png" sizes="192x192" href="/sp_common/book-template/ohc-common/img/o-icon/favicon-192.png">\n  <link rel="icon" type="image/png" sizes="128x128" href="/sp_common/book-template/ohc-common/img/o-icon/favicon-128.png">\n  <link rel="icon" type="image/png" sizes="32x32" href="/sp_common/book-template/ohc-common/img/o-icon/favicon-32.png">\n  <link rel="apple-touch-icon" sizes="120x120" href="/sp_common/book-template/ohc-common/img/o-icon/favicon-120.png">\n  <link rel="apple-touch-icon" sizes="152x152" href="/sp_common/book-template/ohc-common/img/o-icon/favicon-152.png">\n  <link rel="apple-touch-icon" sizes="180x180" href="/sp_common/book-template/ohc-common/img/o-icon/favicon-180.png">\n<!-- injector:theme -->\n<link rel="stylesheet" href="css/alta/8.0.0/web/alta.min.css" id="css" />\n<!-- endinjector -->\n  <link rel="stylesheet" href="css/demo-alta-site-min.css" type="text/css" />\n  <link rel="stylesheet" href="css/app.css" type="text/css" />\n  <link rel="stylesheet" href="css/bootstrap.min.css" type="text/css" />\n<script>window.ohcglobal || document.write(\'<script src="/en/dcommon/js/global.js">\\x3C/script>\')</script></head>\n\n<body class="oj-web-applayout-body">\n  <div id="globalBody" class="oj-web-applayout-page">\n    <header role="banner" class="layout-header">\n      <ocom-u02 header-title="Help Center"></ocom-u02>\n    </header>\n    <div main="container" class="mainContainer">\n      <documentation-banner></documentation-banner>\n      <category-icons></category-icons>\n      <featured-products></featured-products>\n      <solutions-section></solutions-section>\n      <feedback-section></feedback-section>\n      <footer role="contentinfo">\n        <universal-footer products_az_url="/en/browseall.html"></universal-footer>\n      </footer>\n    </div>\n  </div>\n\n  <script type="text/javascript" src="js/libs/require/require.js"></script>\n  <script type="text/javascript" src="js/main.js"></script>\n  <script async="async" src="//consent.truste.com/notice?domain=oracle.com&c=teconsent&js=bb&cdn=1&pcookie&noticeType=bb&text=true" crossorigin=""></script>\n\n</body>\n\n</html>'

gzip + chunked处理

该模式下，响应体先用gzip压缩,然后再使用chunked编码分块(每个chunk本身没有压缩); 客户端收到响应体应先chunked解码为完整的gzip文件,再使用gzip解压. 测试文件:13_.htm

# 对gzip+chunked的处理:先使用chunked解码,再使用gzip解压                
with open('testFile/13_.htm', 'rb') as f:
    cc = f.read()
    cz = chunk(cc).decode_chunked()
    print(gzip.decompress(cz))

参考资料

1.Chunked transfer encoding

cswhl

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
HTML中chunked解码和gzip解压

chunked编码chunked编码的的好处当访问的时动态页面时，服务器则无法预知内容的大小，因此需要一遍产生数据，一边发送数据，将数据分块发送(服务器通过响应头’Transfer-Encoding: chunked’告诉浏览器它将使用chunked编码传输)。浏览器也不需要等到内容字节全部下载完成,只要接收到一个chunked块就可解析页面，并且可以下载html中定义的页面内容,包括js,css,image等。更多优点如下：1.允许服务器为动态生成的内容维持HTTP持久链接。通常，持久链接需
复制链接

扫一扫