【JokerのNote】常用字符集与编码方式。

在介绍ASCII、GB2312、UNICODE、UTF-8、UTF-16等等之前,我觉得还是有必要先说下题目,什么是字符集,什么是编码方式。鄙人愚见,字符集就是字符的集合,如ASCII、GB2312、UNICODE等等,而编码方式指的是码值与字符集之间的映射关系,如对于UNICODE字符集,有UTF-8、UTF-16、UTF-32等编码方式。

ASCII

标准ASCII 码也叫基础ASCII码,使用7位二进制数(剩下的1位二进制为0)来表示所有的大写和小写字母,数字0 到9、标点符号,以及在美式英语中使用的特殊控制字符。

 在英语中,用128个符号编码便可以表示所有,但是用来表示其他语言,128个符号是不够的。于是利用字节中闲置的最高位编入新的符号,就出现了扩展ASCII编码。由于不同国家字母多种多样,所以扩展ASCII编码也分很多种,但是不管怎样,所有这些编码方式中,0--127表示的符号是一样的,不一样的只是128--255的这一段,下面只介绍其中两种。

扩展ASCII(非Latin格式)

扩展ASCII(Latin1)

DECOCTHEXBINSymbolHTML NumberHTML NameDescription
1282008010000000€€Euro sign
1292018110000001
1302028210000010‚‚Single low-9 quotation mark
1312038310000011ƒƒƒLatin small letter f with hook
1322048410000100„„Double low-9 quotation mark
1332058510000101……Horizontal ellipsis
1342068610000110††Dagger
1352078710000111‡‡Double dagger
1362108810001000ˆˆˆModifier letter circumflex accent
1372118910001001‰‰Per mille sign
1382128A10001010ŠŠŠLatin capital letter S with caron
1392138B10001011‹‹Single left-pointing angle quotation
1402148C10001100ŒŒŒLatin capital ligature OE
1412158D10001101
1422168E10001110ŽŽLatin capital letter Z with caron
1432178F10001111
1442209010010000
1452219110010001‘‘Left single quotation mark
1462229210010010’’Right single quotation mark
1472239310010011““Left double quotation mark
1482249410010100””Right double quotation mark
1492259510010101••Bullet
1502269610010110––En dash
1512279710010111——Em dash
1522309810011000˜˜˜Small tilde
1532319910011001™™Trade mark sign
1542329A10011010šššLatin small letter S with caron
1552339B10011011››Single right-pointing angle quotation mark
1562349C10011100œœœLatin small ligature oe
1572359D10011101
1582369E10011110žžLatin small letter z with caron
1592379F10011111ŸŸŸLatin capital letter Y with diaeresis
160240A010100000  Non-breaking space
161241A110100001¡¡¡Inverted exclamation mark
162242A210100010¢¢¢Cent sign
163243A310100011£££Pound sign
164244A410100100¤¤¤Currency sign
165245A510100101¥¥¥Yen sign
166246A610100110¦¦¦Pipe, Broken vertical bar
167247A710100111§§§Section sign
168250A810101000¨¨¨Spacing diaeresis - umlaut
169251A910101001©©©Copyright sign
170252AA10101010ªªªFeminine ordinal indicator
171253AB10101011«««Left double angle quotes
172254AC10101100¬¬¬Not sign
173255AD10101101­­­Soft hyphen
174256AE10101110®®®Registered trade mark sign
175257AF10101111¯¯¯Spacing macron - overline
176260B010110000°°°Degree sign
177261B110110001±±±Plus-or-minus sign
178262B210110010²²²Superscript two - squared
179263B310110011³³³Superscript three - cubed
180264B410110100´´´Acute accent - spacing acute
181265B510110101µµµMicro sign
182266B610110110¶¶Pilcrow sign - paragraph sign
183267B710110111···Middle dot - Georgian comma
184270B810111000¸¸¸Spacing cedilla
185271B910111001¹¹¹Superscript one
186272BA10111010ºººMasculine ordinal indicator
187273BB10111011»»»Right double angle quotes
188274BC10111100¼¼¼Fraction one quarter
189275BD10111101½½½Fraction one half
190276BE10111110¾¾¾Fraction three quarters
191277BF10111111¿¿¿Inverted question mark
192300C011000000ÀÀÀLatin capital letter A with grave
193301C111000001ÁÁÁLatin capital letter A with acute
194302C211000010ÂÂÂLatin capital letter A with circumflex
195303C311000011ÃÃÃLatin capital letter A with tilde
196304C411000100ÄÄÄLatin capital letter A with diaeresis
197305C511000101ÅÅÅLatin capital letter A with ring above
198306C611000110ÆÆÆLatin capital letter AE
199307C711000111ÇÇÇLatin capital letter C with cedilla
200310C811001000ÈÈÈLatin capital letter E with grave
201311C911001001ÉÉÉLatin capital letter E with acute
202312CA11001010ÊÊÊLatin capital letter E with circumflex
203313CB11001011ËËËLatin capital letter E with diaeresis
204314CC11001100ÌÌÌLatin capital letter I with grave
205315CD11001101ÍÍÍLatin capital letter I with acute
206316CE11001110ÎÎÎLatin capital letter I with circumflex
207317CF11001111ÏÏÏLatin capital letter I with diaeresis
208320D011010000ÐÐÐLatin capital letter ETH
209321D111010001ÑÑÑLatin capital letter N with tilde
210322D211010010ÒÒÒLatin capital letter O with grave
211323D311010011ÓÓÓLatin capital letter O with acute
212324D411010100ÔÔÔLatin capital letter O with circumflex
213325D511010101ÕÕÕLatin capital letter O with tilde
214326D611010110ÖÖÖLatin capital letter O with diaeresis
215327D711010111×××Multiplication sign
216330D811011000ØØØLatin capital letter O with slash
217331D911011001ÙÙÙLatin capital letter U with grave
218332DA11011010ÚÚÚLatin capital letter U with acute
219333DB11011011ÛÛÛLatin capital letter U with circumflex
220334DC11011100ÜÜÜLatin capital letter U with diaeresis
221335DD11011101ÝÝÝLatin capital letter Y with acute
222336DE11011110ÞÞÞLatin capital letter THORN
223337DF11011111ßßßLatin small letter sharp s - ess-zed
224340E011100000àààLatin small letter a with grave
225341E111100001áááLatin small letter a with acute
226342E211100010âââLatin small letter a with circumflex
227343E311100011ãããLatin small letter a with tilde
228344E411100100äääLatin small letter a with diaeresis
229345E511100101åååLatin small letter a with ring above
230346E611100110æææLatin small letter ae
231347E711100111çççLatin small letter c with cedilla
232350E811101000èèèLatin small letter e with grave
233351E911101001éééLatin small letter e with acute
234352EA11101010êêêLatin small letter e with circumflex
235353EB11101011ëëëLatin small letter e with diaeresis
236354EC11101100ìììLatin small letter i with grave
237355ED11101101íííLatin small letter i with acute
238356EE11101110îîîLatin small letter i with circumflex
239357EF11101111ïïïLatin small letter i with diaeresis
240360F011110000ðððLatin small letter eth
241361F111110001ñññLatin small letter n with tilde
242362F211110010òòòLatin small letter o with grave
243363F311110011óóóLatin small letter o with acute
244364F411110100ôôôLatin small letter o with circumflex
245365F511110101õõõLatin small letter o with tilde
246366F611110110öööLatin small letter o with diaeresis
247367F711110111÷÷÷Division sign
248370F811111000øøøLatin small letter o with slash
249371F911111001ùùùLatin small letter u with grave
250372FA11111010úúúLatin small letter u with acute
251373FB11111011ûûûLatin small letter u with circumflex
252374FC11111100üüüLatin small letter u with diaeresis
253375FD11111101ýýýLatin small letter y with acute
254376FE11111110þþþLatin small letter thorn
255377FF11111111ÿÿÿLatin small letter y with diaeresis

GB2312

GB2312用于汉字处理、汉字通信等系统之间的信息交换,采用双字节编码。对所有字符集分成94个区,每区有94个位。每个区位上只有一个字符,因此可用所在的区和位来对汉字进行编码,称为区位码。依旧拿“王”字来举例,由下图可见,“王”在GB2312字符集中的区位码为0xCDF5。

UTF-8

UTF-8如文章最开始说的,它是UNICODE的其中一种编码方式,最大的一个特点,就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度。

比如汉字“王”的Unicode编码为0x738B。二进制表示为:111 0011 1000 1011。共有15个二进制位,编码UTF-8需要3个字节。

3字节的UTF-8二进制格式为:1110XXXX 10XXXXXX 10XXXXXX。001011填入最低字节,001110填入中间字节,111填入最高字节。最后的UTF-8编码即为:11100111 10001110 10001011,十六进制值为0xE78E8B。


 常用的字符集与编码方式基本也就这样,当看到一个被编码的字符,无法确定属于哪个字符集时候,可以先在最上面的这些表里面查,再或者,如果你想知道一个字符在特定字符集中的值或某种特定编码后的值的时候,可以这么干,这里我用的软件是UE。

下方选择好编码方式以后,输入想要查询的文本,然后在上方选择编辑---十六进制模式,就能看到文本所对应的编码的值了。

 汉字当然也是可以的。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值