Linux系统中查看中文乱码

最新推荐文章于 2024-04-29 13:32:26 发布

wzb56

最新推荐文章于 2024-04-29 13:32:26 发布

阅读量1.6k

点赞数

分类专栏： Linux

Linux 专栏收录该内容

128 篇文章 0 订阅

订阅专栏

Linux系统中查看中文乱码，请问如何解决乱码问题？
解答：
首先我们说下，什么是字符集？
简单的说就是一套文字符号及其编码。常用的字符集有：
GBK 定长双字节不是国际标准，支持的系统不少
UTF-8 非定长 1-4字节广泛支持，MYSQL也使用UTF-8
当然了，字符集还有很多，以后再深入学习这部分内容。

①linux系统下的字符集由变量LANG控制。
　[root@gagarin ~]# echo $LANG
　zh_CN.GB18030

②export LANG="ZH_CN.GB18030"（临时变更字符集，重启后失效）

③在/etc/sysconfig/i18n文件中，行首添加：（配置永久生效）
　LANG="ZH_CN.GB18030"
　并把之前的字符集行注释“#”
　. /etc/sysconfig/i18n（使修改生效）

④echo 'export LANG="ZH_CN.GB18030"' >>/etc/profile（全局系统环境变量配置文件）
　source /etc/profile（生效）
　echo $LANG（查看结果）

⑤工作场景中使用的脚本，为避免中文乱码，有时候也会在脚本里更改字符集：
　#!/bin/sh
　export LANG="ZH_CN.GB18030"
　（脚本内容）

⑥SSH工具（SecureCRT）要与linux的字符集保持一致。
　在SecureCRT的“会话选项”对话框，“终端”→“外观”类里，“字符编码”中的选项要和linux保持一致。
　linux使用的字符集为“ZH_CN.GB18030”，SecureCRT的选项里设置为“默认”即可。

⑦服务端和客户端字符集对应，乱码就可以有效避免。
　中文字符集：ZN_CN.GB18030
　字符集变量：LANG
　字符集配置文件：/etc/sysconfig/i18n

⑧echo命令（单行文本的追加）
　source与.（修改后的变量生效）
　export命令（设置环境变量）

You know what UTF-8 is when you see it?

When we are coding we may often see some encoding specifications in our source codes such as UTF-8,GB2312. Do you know what these encoding mean and why we need them? In this post, Julián Solórzano will introduce the most widely used encoding specification around the world accomodating all different character sets in the world.
UTF-8 is a method for encoding Unicode characters using 8-bit sequences. Unicode is a standard for representing a great variety of characters from many languages. Something like 40 years ago, the standard for information encoding ASCII was created. ASCII consisted originally of 128 characters, including lowercase and uppercase letters, numbers and punctuation, each one encoded using 7 bits.

Then came "extended ASCII" which used all 8 bits to accomodate for more characters like á, é, ü and so on. A lot of different code pages are used to account for those extra 128 character slots, like latin1, windows-1252, etc (i.e there is no unique correspondence chart for those extra 128 characters, it depends on region, language, operating system, etc). It became apparent that neither 128 (7 bit) or 256 (8 bit) slots were enough to represent a very big number of characters consistently, so Unicode was created as a standard to represent characters from nearly all writing systems. It currently consists of more than 1,000,000 code points (they have the prefix "U+"). UTF-8 is a method for encoding these code points. A character in UTF-8 can be made up of one or more bytes. The encoding of the first 128 code points is equivalent to their ASCII counterpart. Further code points are represented using more than one byte. Each further byte in a single character starts with a special bit sequence to signal that it's still the same character. Table from Wikipedia:

For example, the letter á is Unicode code point U+00E1, or 225 in decimal.

225 in binary is 11100001 As 8 bits are needed to represent this number, we have to use 2 bytes to encode it in UTF-8 (because only the first 128 characters use only one byte, i.e. those that only need 7 bits). So, using the first table as reference, we can encode the letter á in UTF-8 like this: 110 00011 10 100001 (or C3 A1 in hexadecimal, as bytes are more commonly written) The bold part is the number 225 and the non-bold part is the bit pattern required by encoding. This way, if you open a text file that contains the bytes c3 a1 and the program interprets the encoding as UTF-8, you will see an á. Otherwise if the program thinks the encoding is latin1 or something like that, you will instead see whatever c3 and a1 mean in that code page, i.e. you will see two characters such as Ã¡.