telegtram的通信协议MTproto2.0学习 之(数据封装与Telethon源码分析笔记)
一、基础协议
底层协议支持5种:https://core.telegram.org/mtproto#mtproto-transport
- TCP:代码:telethon\network\connection\connection.py
- Websocket
- Websocket over HTTPS
- HTTP:代码:telethon\network\connection\http.py
- HTTPS
二、封装格式
https://core.telegram.org/mtproto/mtproto-transports
封装格式有4种:
连接建立后,客户端在给服务端发送的前几个字节标识了后续的封装格式,服务器能识别不同的编码封装格式;
2.1 Abridged
最轻量级的封装,使用最少1字节,最多4字节表示封装长度;
客户在建立连接后(TCP建链)首先发送特殊1个标记字节:
0xef
,服务器不会在首次应该中添加这个字节;封装格式如下2种可能:
+-+----...----+
|l| payload |
+-+----...----+
或者
+-+---+----...----+
|h|len| payload +
+-+---+----...----+
首先将数据长度除4(按4字节对齐),如果长度小于127字节(0x01…0x7e),那么使用1个字节标识长度,后面是负载;
如果除4后的长度大于127,一个字节不够了,那么使用4个字节表示:0x7f
+ 长度(3字节小端编码),具体编解码如下:telethon\network\connection\tcpabridged.py
class AbridgedPacketCodec(PacketCodec):
tag = b'\xef'
obfuscate_tag = b'\xef\xef\xef\xef'
def encode_packet(self, data):
length = len(data) >> 2
if length < 127:
length = struct.pack('B', length)
else:
length = b'\x7f' + int.to_bytes(length, 3, 'little')
return length + data
async def read_packet(self, reader):
length = struct.unpack('<B', await reader.readexactly(1))[0]
if length >= 127:
length = struct.unpack(
'<i', await reader.readexactly(3) + b'\0')[0]
return await reader.readexactly(length << 2)
2.2 Intermediate
稍微轻量的封装格式,固定4字节表示封装长度,
客户在建立连接后(TCP建链)首先发送特殊4个标记字节:
0xeeeeeeee
,
+----+----...----+
+len.+ payload +
+----+----...----+
len表示的长度值不需要除4,使用小端编码
telethon\network\connection\tcpintermediate.py
class IntermediatePacketCodec(PacketCodec):
tag = b'\xee\xee\xee\xee'
obfuscate_tag = tag
def encode_packet(self, data):
return struct.pack('<i', len(data)) + data
async def read_packet(self, reader):
length = struct.unpack('<i', await reader.readexactly(4))[0]
return await reader.readexactly(length)
2.3 Padded intermediate
官网说是为了通过ISP的阻拦而添加了一些混淆的填充,
使用固定4字节表示封装长度,
客户在建立连接后(TCP建链)首先发送特殊4个标记字节:
0xdddddddd
,
+----+----...----+----...----+
|tlen| payload | padding |
+----+----...----+----...----+
填充长度是随机的0-15个,可以保证包长不是固定的,但是这个最初的4字节容易暴露!!!
telethon\network\connection\tcpintermediate.py,这里的代码仅仅是为了对齐而添加了一个填充而已:
class RandomizedIntermediatePacketCodec(IntermediatePacketCodec):
"""
Data packets are aligned to 4bytes. This codec adds random bytes of size
from 0 to 3 bytes, which are ignored by decoder.
"""
tag = None
obfuscate_tag = b'\xdd\xdd\xdd\xdd'
def encode_packet(self, data):
pad_size = random.randint(0, 3)
padding = os.urandom(pad_size)
return super().encode_packet(data + padding)
async def read_packet(self, reader):
packet_with_padding = await super().read_packet(reader)
pad_size = len(packet_with_padding) % 4
if pad_size > 0:
return packet_with_padding[:-pad_size]
return packet_with_padding
2.4 Full
全数据类型是最基础的封装类型,这里没有识别码,封装占用12字节,
+----+----+----...----+----+
|len.|seq.| payload |crc.|
+----+----+----...----+----+
长度4字节,序号4字节,校验和4字节;
注意:这里的长度是封装后的长度,也就是数据长+12;
Seqno:是当前TCP上的序号,从0, 1, 2开始,每次重连重置序号;与Message的序号不是一个东西;
class FullPacketCodec(PacketCodec):
tag = None
def __init__(self, connection):
super().__init__(connection)
self._send_counter = 0 # Important or Telegram won't reply
def encode_packet(self, data):
# https://core.telegram.org/mtproto#tcp-transport
# total length, sequence number, packet and checksum (CRC32)
length = len(data) + 12
data = struct.pack('<ii', length, self._send_counter) + data
crc = struct.pack('<I', crc32(data))
self._send_counter += 1
return data + crc
async def read_packet(self, reader):
packet_len_seq = await reader.readexactly(8) # 4 and 4
packet_len, seq = struct.unpack('<ii', packet_len_seq)
body = await reader.readexactly(packet_len - 8)
checksum = struct.unpack('<I', body[-4:])[0]
body = body[:-4]
valid_checksum = crc32(packet_len_seq + body)
if checksum != valid_checksum:
raise InvalidChecksumError(checksum, valid_checksum)
return body
三、相关说明
3.1 快速ACK
These MTProto transport protocols have support for quick acknowledgment.
In this case, the client sets the highest-order length bit in the query packet, and the server responds with a special 4 bytes as a separate packet.
They are the 32 higher-order bits of SHA256 of the encrypted portion of the packet prepended by 32 bytes from the authorization key (the same hash as computed for verifying the message key), with the most significant bit set to make clear that this is not the length of a regular server response packet; if the abridged version is used, bswap is applied to these four bytes.
3.2 错误说明
当数据传输过程中发生错误(缺少 auth key, 泛洪.), 服务器会发送一个4字节有符号整数表示错误代码, 绝对值表示错误具体信息;
例如, error Code 403 表示HTTP协议传输的有问题;
Error 404 (auth key 找不到) DC没有找到对应的认证密钥.
Error 429 (transport flood) 在同IP上连接过来的TCP连接太多了;
Error 444 (invalid DC):当 creating an auth key, connecting to an MTProxy 过程中,如果DC的ID号有问题,会遇到这个错误;
3.3 混淆格式
文档上说当使用了webSocket传输,需要使用此节相关技术;https://core.telegram.org/mtproto/transports#websocket
主要是防止ISP网关拦截,这里使用了一个混淆技术,也就是在TCP连接建立后,在发送MTproto封装数据前,再加上一些字节随机数来迷惑监听者,但是需要注意:
- 首字节不能为0xef
- 前四个字节不能是:0xdddddddd,
0xeeeeeeee
- 也不能是WEB命令:
POST
,GET
,HEAD
等 - 之后的4个字节(索引4-8)不能为:0x00000000,这是因为FULL类型封装使用这个字段表示序号0;
- 封装类型的识别码,应该从偏移位置56开始,这里的识别码是4字节:0xefefefef;后续过程中不能再发送这个识别码了;如果随机填充长度不足4,则应该扩展到4字节,都写0xefefefef;
- 加密密钥:offsets
8-40
, 加密向量: offsets40-56
; - 将8~56这48字节取倒序,作为解密的密钥和向量,见8)
- 解密密钥:offsets
8-40
, 解密向量: offsets40-56
;这里涉及到AES-256-CTR加密算法的原理,我也不懂为啥能做到!! - 后续的数据,发送加密,收到后解密;
class ObfuscatedIO:
header = None
def __init__(self, connection):
self._reader = connection._reader
self._writer = connection._writer
(self.header,
self._encrypt,
self._decrypt) = self.init_header(connection.packet_codec)
@staticmethod
def init_header(packet_codec):
# Obfuscated messages secrets cannot start with any of these
keywords = (b'PVrG', b'GET ', b'POST', b'\xee\xee\xee\xee')
while True:
random = os.urandom(64)
if (random[0] != 0xef and
random[:4] not in keywords and
random[4:8] != b'\0\0\0\0'):
break
random = bytearray(random)
#这里提取了[8, 56]范围的48字节
random_reversed = random[55:7:-1] # Reversed (8, len=48)
# Encryption has "continuous buffer" enabled
# 这里设置了加密密钥和解密密钥
encrypt_key = bytes(random[8:40])
encrypt_iv = bytes(random[40:56])
decrypt_key = bytes(random_reversed[:32])
decrypt_iv = bytes(random_reversed[32:48])
encryptor = AESModeCTR(encrypt_key, encrypt_iv)
decryptor = AESModeCTR(decrypt_key, decrypt_iv)
# 这里设置了0xefefefef
random[56:60] = packet_codec.obfuscate_tag
# 然后计算了一个校验,看看是不是约定的格式
random[56:64] = encryptor.encrypt(bytes(random))[56:64]
return (random, encryptor, decryptor)
async def readexactly(self, n):
return self._decrypt.encrypt(await self._reader.readexactly(n))
def write(self, data):
self._writer.write(self._encrypt.encrypt(data))
备注:这里的隐含的逻辑是TCP封装一般是使用TLV方式,如果格式随时在变,前面的头部分是随机数,可能能逃脱一些监测;
四、负载的格式
payload分为2种:
-
在DH密钥交换阶段,使用是单独的数据格式,可以认为是明文;代码:telethon\network\mtprotoplainsender.py
-
在正常的通信过程中,使用MTProto2.0格式编码;代码:telethon\network\mtprotosender.py
客户端TCP连接建立后,检查自己的session中,如果有了认证密钥,则直接使用此密钥通信,如果没有此密钥,则说明需要登录认证进行密钥交换;代码:telethon\network\authenticator.py
此密钥之所以叫做 Auth-Key,也是因为这个东西除了加密,还用做身份认证;此密钥生成了,与设备相关(客户端)而不是与账号或者TCP连接相关;https://core.telegram.org/mtproto
+----+------+----+------+----...----+
|auth-key-id| msg-key | data |
+----+------+----+------+----...----+
auth-key-id是64比特,也就是8字节,它是Auth-Key做SHA1的后的20字节中的低8字节;
auth-key是DH密钥交换后计算出来的,服务器存储了此数据,可以用来唯一识别用户,因为8字节数据空间已经足够大了;
收到了data之后,需要使用算法解密,解开后可以直接按照TLObject的编码格式执行反序列化,代码:telethon\network\mtprotostate.py
如果发送的数据包中的KEY-id有错误,服务器需要应答错误,相关错误包括:
AUTH_BYTES_INVALID,400,The provided authorization is invalid
AUTH_KEY_DUPLICATED,406,"The authorization key (session file) was used under two different IP addresses simultaneously, and can no longer be used. Use the same session exclusively, or use different sessions"
AUTH_KEY_INVALID,401,The key is invalid
AUTH_KEY_PERM_EMPTY,401,"The method is unavailable for temporary authorization key, not bound to permanent"
AUTH_KEY_UNREGISTERED,401,The key is not registered in the system
AUTH_RESTART,500,Restart the authorization process
AUTH_TOKEN_ALREADY_ACCEPTED,400,The authorization token was already used
AUTH_TOKEN_EXPIRED,400,The provided authorization token has expired and the updated QR-code must be re-scanned
AUTH_TOKEN_INVALID,400,An invalid authorization token was provided
AUTH_TOKEN_INVALID2,400,An invalid authorization token was provided
比如在Telethon客户端未登录前,也会执行RPC查询当前用户状态,会收到:
AUTH_KEY_UNREGISTERED,401,The key is not registered in the system
这样就需要执行登录过程;
五、MTProxy协议格式
与前文中的3.3混淆合适类似,但是填充方式以及加密方式有却别:https://core.telegram.org/mtproto/mtproto-transports#transport-obfuscation
在proxy设置时,需要提供IP,端口和一个16字节的密钥;也许还有可能是17字节,这里多余的第一个字节就是指定了特定的封装格式,一般来说是0xdd
,那就说明需要后续使用padded intermediate
封装格式;
如果是16字节,那么就是默认的FULL封装方式;
与前边不同的另一处是:随机长度固定为64字节,而且第[60,62]处按照小端填充2字节的DC的数字;
加密和解密的密钥设置方式与3.3节也不一样!!这个密钥就是进入服务的关键;
class MTProxyIO:
"""
It's very similar to tcpobfuscated.ObfuscatedIO, but the way
encryption keys, protocol tag and dc_id are encoded is different.
"""
header = None
def __init__(self, connection):
self._reader = connection._reader
self._writer = connection._writer
(self.header,
self._encrypt,
self._decrypt) = self.init_header(
connection._secret, connection._dc_id, connection.packet_codec)
@staticmethod
def init_header(secret, dc_id, packet_codec):
# Validate
is_dd = (len(secret) == 17) and (secret[0] == 0xDD)
is_rand_codec = issubclass(
packet_codec, RandomizedIntermediatePacketCodec)
if is_dd and not is_rand_codec:
raise ValueError(
"Only RandomizedIntermediate can be used with dd-secrets")
secret = secret[1:] if is_dd else secret
if len(secret) != 16:
raise ValueError(
"MTProxy secret must be a hex-string representing 16 bytes")
# Obfuscated messages secrets cannot start with any of these
keywords = (b'PVrG', b'GET ', b'POST', b'\xee\xee\xee\xee')
while True:
random = os.urandom(64)
if (random[0] != 0xef and
random[:4] not in keywords and
random[4:4] != b'\0\0\0\0'):
break
random = bytearray(random)
random_reversed = random[55:7:-1] # Reversed (8, len=48)
# Encryption has "continuous buffer" enabled
encrypt_key = hashlib.sha256(
bytes(random[8:40]) + secret).digest()
encrypt_iv = bytes(random[40:56])
decrypt_key = hashlib.sha256(
bytes(random_reversed[:32]) + secret).digest()
decrypt_iv = bytes(random_reversed[32:48])
encryptor = AESModeCTR(encrypt_key, encrypt_iv)
decryptor = AESModeCTR(decrypt_key, decrypt_iv)
random[56:60] = packet_codec.obfuscate_tag
# 这里填充DC-ID
dc_id_bytes = dc_id.to_bytes(2, "little", signed=True)
random = random[:60] + dc_id_bytes + random[62:]
random[56:64] = encryptor.encrypt(bytes(random))[56:64]
return (random, encryptor, decryptor)
async def readexactly(self, n):
return self._decrypt.encrypt(await self._reader.readexactly(n))
def write(self, data):
self._writer.write(self._encrypt.encrypt(data))
按照代码注释,这个类实现的并不太满意,能不能用,我并没有测试;后续Telethon应该会继续完善;