所谓的码流,就是一串二进制数字的信息流,比如10001010 0001011010101110 10101000这样的信息。显示到页面上,一般是转换成16进制来显示,可能是考虑到用二进制显示太占位置了,就用16进制显示。
转换成16进制显示,它们8个为一组,可以转换成两位一起的16进制表示,比如上面的串就变成8A16 AE A8,前面再加个0x,它是C/C++里规定的16进制串的表示方式的前缀,所以变成了0x 8A 16 AE A8(中间不一定有空格)。
不管是16进制还是二进制,都要配套它的编码类型。编码类型有7BitGSM、AscII、UTF8/16/32等,比较常用的就是这些编码类型。有了这些编码类型,这些二进制的串串对人来说才是有意义的,机器才能把这些编码类型转换成人读地懂的文字。
一个二进制串,比如00001111,有点地方前面的0会省略掉,就变成了1111,比如windows系统的计算器,遇到不足8位的要去高位补0。二进制的位数是右边低位,左边高位,比如上面的串,从右向左数就是bit0– bit7,也就是bit0=1,bit7=0。
关于7bitGSM编码,也就是拿到16进制的显示字符后,可以通过如下例子找到对应的字符。
实例一:
拿到16进制:5B;
转换成二进制:01011011
查询“附表二”对应到字符:Ä
有时候需要7Bit和8bit进行转换,一个字符可以查询“附表二”,但如果是两个字符以上的7Bit编码,需要先转换成其他编码类型后再查表才能得到对应的人能读懂的字符串。转换方法用一个例子说明:
实例二(来自互联网):
举一个具体的例子,字符串3132333435363738是7bit编码,现压缩成8bit编码。 3132333435363738转换为bit为00110001(31) 00110010(32) 00110011(33) 00110100(34) 00110101(35)00110110(36) 00110111(37) 00111000(38),进行转换,过程如下: a,转换31,32的最低位到31的最高位,可以看到31不变,第一个压缩后字节为31 b,转换32,32由于取了最低位,相当于向右移了一位,为00011001,将33的低两位放在右移一位的31高位上,也就为11011001,即D9。 c,转换33,33由于取了低两位,相当于向右移了两位,为00001100,将34的低三位放在右移两位的33高位上,也就为10001100,即8C。 d,转换34,34由于取了低三位,相当于向右移了三位,为00000110,将35的低四位放在右移三位的34高位上,也就为01010110,即56。 e,转换35,35由于取了低四位,相当于向右移了四位,为00000011,将36的低五位放在右移四位的35高位上,也就为10110011,即B3。 f,转换36,36由于取了低五位,相当于向右移了五位,为00000001,将37的低六位放在右移五位的36高位上,也就为11011101,即DD。 g,转换37,37由于取了低六位,相当于向右移了六位,为00000000,将整个38的七位放在右移六位的37高位上,也就是1110000,即70。到此7位编码的3132333435363738压缩为了7字节的31D98C56B3DD70,也就是说压缩掉了一字节。
关于AscII字符编码,如果使用上面的方法把7Bit转换成8Bit后,是AscII的编码,那把二进制或者16进制的表示方式转换成10进制后,再去查询“附表三”就可以了。
实例三:
得到7bit编码的16进制表示:0x3118
转换成8Bit表示(AscII转码):0x3130
8Bit就可以一个字符一个字符地查询了,0x31转换成49,查询“附表三”是字符“1”,0x30转换成48,查询“附表三”是字符“0”,所以它表示的就是“10”。
关于UTF-8,它的编码包括地实在太多了,只能给出一个原理,再具体一些的可以查看维基百科里面对UTF-8的介绍:
UTF-8使用一至四个字节为每个字符编码:
- 128个US-ASCII字符只需一个字节编码(Unicode范围由U+0000至U+007F)。
- 带有附加符号的拉丁文、希腊文、西里尔字母、亚美尼亚语、希伯来文、阿拉伯文、叙利亚文及它拿字母则需要二个字节编码(Unicode范围由U+0080至U+07FF)。
- 其他基本多文种平面(BMP)中的字符(这包含了大部分常用字)使用三个字节编码。
- 其他极少使用的Unicode 辅助平面的字符使用四字节编码。
- 对于UTF-8编码中的任意字节B,如果B的第一位为0,则B为ASCII码,并且B独立的表示一个字符;
- 如果B的第一位为1,第二位为0,则B为一个非ASCII字符(该字符由多个字节表示)中的一个字节,并且不为字符的第一个字节编码;
- 如果B的前两位为1,第三位为0,则B为一个非ASCII字符(该字符由多个字节表示)中的第一个字节,并且该字符由两个字节表示;
- 如果B的前三位为1,第四位为0,则B为一个非ASCII字符(该字符由多个字节表示)中的第一个字节,并且该字符由三个字节表示;
- 如果B的前四位为1,第五位为0,则B为一个非ASCII字符(该字符由多个字节表示)中的第一个字节,并且该字符由四个字节表示;
附表一:USSD datecoding Scheme
Coding Group Bits 7..4 |
Use of bits 3..0 |
0000 | Language using the GSM 7 bit default alphabet |
|
|
| Bits 3..0 indicate the language: |
| 0000 German |
| 0001 English |
| 0010 Italian |
| 0011 French |
| 0100 Spanish |
| 0101 Dutch |
| 0110 Swedish |
| 0111 Danish |
| 1000 Portuguese |
| 1001 Finnish |
| 1010 Norwegian |
| 1011 Greek |
| 1100 Turkish |
| 1101 Hungarian 1110 Polish |
| 1111 Language unspecified |
0001 | 0000 GSM 7 bit default alphabet; message preceded by language indication.
The first 3 characters of the message are a two-character representation of the language encoded according to ISO 639 [12], followed by a CR character. The CR character is then followed by 90 characters of text.
0001 UCS2; message preceded by language indication
The message starts with a two GSM 7-bit default alphabet character representation of the language encoded according to ISO 639 [12]. This is padded to the octet boundary with two bits set to 0 and then followed by 40 characters of UCS2-encoded message. An MS not supporting UCS2 coding will present the two character language identifier followed by improperly interpreted user data.
0010..1111 Reserved |
0010.. | 0000 Czech 0001 Hebrew 0010 Arabic 0011 Russian 0100 Icelandic
0101..1111 Reserved for other languages using the GSM 7 bit default alphabet, with unspecified handling at the MS |
0011 | 0000..1111 Reserved for other languages using the GSM 7 bit default alphabet, with unspecified handling at the MS |
01xx | General Data Coding indication Bits 5..0 indicate the following: |
|
|
| Bit 5, if set to 0, indicates the text is uncompressed Bit 5, if set to 1, indicates the text is compressed using the compression algorithm defined in 3GPP TS 23.042 [13] |
|
|
| Bit 4, if set to 0, indicates that bits 1 to 0 are reserved and have no message class meaning Bit 4, if set to 1, indicates that bits 1 to 0 have a message class meaning: |
|
|
| Bit 1 Bit 0 Message Class: |
| 0 0 Class 0 |
| 0 1 Class 1 Default meaning: ME-specific. |
| 1 0 Class 2 (U)SIM specific message. |
| 1 1 Class 3 Default meaning: TE-specific (see 3GPP TS 27.005 [8]) |
|
|
| Bits 3 and 2 indicate the character set being used, as follows: |
| Bit 3 Bit 2 Character set: |
| 0 0 GSM 7 bit default alphabet |
| 0 1 8 bit data |
| 1 0 UCS2 (16 bit) [10] |
| 1 1 Reserved |
1000 | Reserved coding groups |
1001 | Message with User Data Header (UDH) structure: |
|
|
| Bit 1 Bit 0 Message Class: |
| 0 0 Class 0 |
| 0 1 Class 1 Default meaning: ME-specific. |
| 1 0 Class 2 (U)SIM specific message. |
| 1 1 Class 3 Default meaning: TE-specific (see 3GPP TS 27.005 [8]) |
|
|
| Bits 3 and 2 indicate the alphabet being used, as follows: |
| Bit 3 Bit 2 Alphabet: |
| 0 0 GSM 7 bit default alphabet |
| 0 1 8 bit data |
| 1 0 USC2 (16 bit) [10] |
| 1 1 Reserved |
1010..1101 | Reserved coding groups |
1110 | Defined by the WAP Forum [15] |
1111 | Data coding / message handling |
|
|
| Bit 3 is reserved, set to 0. |
|
|
| Bit 2 Message coding: |
| 0 GSM 7 bit default alphabet |
| 1 8 bit data |
|
|
| Bit 1 Bit 0 Message Class: |
| 0 0 No message class. |
| 0 1 Class 1 user defined. |
| 1 0 Class 2 user defined. |
| 1 1 Class 3 |
| default meaning: TE specific (see 3GPP TS 27.005 [8]) |
附表二:GSM 7 bit DefaultAlphabet Character table:
|
|
|
| b7 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
|
|
|
| b6 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
|
|
|
| b5 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
b4 | b3 | b2 | b1 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
0 | 0 | 0 | 0 | 0 | @ |
| SP | 0 | ¡ | P | ¿ | p |
0 | 0 | 0 | 1 | 1 | £ | _ | ! | 1 | A | Q | a | q |
0 | 0 | 1 | 0 | 2 | $ |
| " | 2 | B | R | b | r |
0 | 0 | 1 | 1 | 3 | ¥ |
| # | 3 | C | S | c | s |
0 | 1 | 0 | 0 | 4 | è |
| ¤ | 4 | D | T | d | t |
0 | 1 | 0 | 1 | 5 | é | % | 5 | E | U | e | u | |
0 | 1 | 1 | 0 | 6 | ù |
| & | 6 | F | V | f | v |
0 | 1 | 1 | 1 | 7 | ì |
| ' | 7 | G | W | g | w |
1 | 0 | 0 | 0 | 8 | ò | ( | 8 | H | X | h | x | |
1 | 0 | 0 | 1 | 9 | Ç |
| ) | 9 | I | Y | i | y |
1 | 0 | 1 | 0 | 10 | LF |
| * | : | J | Z | j | z |
1 | 0 | 1 | 1 | 11 | Ø | 1) | + | ; | K | Ä | k | ä |
1 | 1 | 0 | 0 | 12 | ø | Æ | , | < | L | Ö | l | ö |
1 | 1 | 0 | 1 | 13 | CR | æ | - | = | M | Ñ | m | ñ |
1 | 1 | 1 | 0 | 14 | Å | ß | . | > | N | Ü | n | ü |
1 | 1 | 1 | 1 | 15 | å | É | / | ? | O | § | o | à |
NOTE 1): This code is an escape to an extension of the GSM 7 bit default alphabet table. A receiving entity which does not understand the meaning of this escape mechanism shall display it as a space character. |
附表三:
(ASCII是指:AmericanStandard Code for Information Interchange,美国信息交换标准码。)
值 | 符号 | 值 | 符号 | 值 | 符号 |
0 | 空字符 | 44 | , | 91 | [ |
32 | 空格 | 45 | - | 92 | \ |
33 | ! | 46 | . | 93 | ] |
34 | " | 47 | / | 94 | ^ |
35 | # | 48 ~ 57 | 0 ~ 9 | 95 | - |
36 | $ | 58 | : | 96 | ` |
37 | % | 59 | ; | 97 ~ 122 | a ~ z |
38 | & | 60 | < | 123 | { |
39 | ' | 61 | = | 124 | | |
40 | ( | 62 | > | 125 | } |
41 | ) | 63 | ? | 126 | ~ |
42 | * | 64 | @ | 127 | DEL (Delete键) |
43 | + | 65 - 90 | A ~ Z |
|
|