字符编码

最新推荐文章于 2022-07-21 09:35:04 发布

TheDreamOfGod

最新推荐文章于 2022-07-21 09:35:04 发布

阅读量148

点赞数

本文链接：https://blog.csdn.net/TheDreamOfGod/article/details/84191218

版权

ASCII编码：美国(国家)信息交换标准(代)码，一种使用7个或8个二进制位进行编码的方案，最多可以给256个字符(包括字母、数字、标点符号、控制字符及其他符号)分配(或指定)数值。基本的 ASCII 字符集共有 128 个字符，其中有 96 个可打印字符，包括常用的字母、数字、标点符号等，另外还有 32 个控制字符。标准 ASCII 码使用 7 个二进位对字符进行编码，对应的 ISO 标准为 ISO646 标准。由于标准 ASCII 字符集字符数目有限，在实际应用中往往无法满足要求。为此，国际标准化组织又制定了 ISO2022 标准，它规定了在保持与 ISO646 兼容的前提下将 ASCII 字符集扩充为 8 位代码的统一方法。 ISO 陆续制定了一批适用于不同地区的扩充 ASCII 字符集，每种扩充 ASCII 字符集分别可以扩充 128 个字符，这些扩充字符的编码均为高位为 1 的 8 位代码（即十进制数 128~255 ），称为扩展 ASCII 码。

ANSI编码：为了扩充ASCII编码，以用于显示本国的语言，不同的国家和地区制定了不同的标准，由此产生了 GB2312, BIG5, JIS 等各自的编码标准。这些使用 2 个字节来代表一个字符的各种汉字延伸编码方式，称为 ANSI 编码，又称为"MBCS（Muilti-Bytes Charecter Set，多字节字符集）"。ANSI编码代表本地编码，在简体中文系统下，ANSI 编码代表GBK编码。每个语言下的ANSI编码，都有一套一对一的编码转换器，Unicode变成所有编码转换的中间介质。所有的编码都有一个转换器可以转换到Unicode，而Unicode也可以转换到其他所有的编码。

Unicode编码：参考https://www.cnblogs.com/liupp123/articles/8023861.html、https://www.cnblogs.com/fnlingnzb-learner/p/6163205.html、https://www.cnblogs.com/benbenalin/p/7152570.html、https://www.cnblogs.com/notbecoder/p/4840783.html

Servlet规范要求默认用ISO-8859-1解码请求参数(包括get和post方法)。参考tomcat wiki上的几段话(网址：https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q1)：

Default encoding for GET

The character set for HTTP query strings (that's the technical term for 'GET parameters') can be found in sections 2 and 2.1 the "URI Syntax" specification. The character set is defined to be US-ASCII. Any character that does not map to US-ASCII must be encoded in some way. Section 2.1 of the URI Syntax specification says that characters outside of US-ASCII must be encoded using % escape sequences: each character is encoded as a literal % followed by the two hexadecimal codes which indicate its character code. Thus, a (US-ASCII character code 97 = 0x61) is equivalent to %61. There is no default encoding for URIs specified anywhere, which is why there is a lot of confusion when it comes to decoding these values.

Some notes about the character encoding of URIs:

ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so they are often used interchangeably. Most of the web uses ISO-8859-1 as the default for query strings.
Many browsers are starting to offer (default) options of encoding URIs using UTF-8 instead of ISO-8859-1. Some browsers appear to use the encoding of the current page to encode URIs for links (see the note above regarding browser behavior for POST encoding).

HTML 4.0 recommends the use of UTF-8 to encode the query string.

When in doubt, use POST for any data you think might have problems surviving a trip through the query string.

Default Encoding for POST

ISO-8859-1 is defined as the default character set for HTTP request and response bodies in the servlet specification (request encoding: section 4.9 for spec version 2.4, section 3.9 for spec version 2.5; response encoding: section 5.4 for both spec versions 2.4 and 2.5). This default is historical: it comes from sections 3.4.1 and 3.7.1 of the HTTP/1.1 specification.

Some notes about the character encoding of a POST request:

Section 3.4.1 of HTTP/1.1 states that recipients of an HTTP message must respect the character encoding specified by the sender in the Content-Type header if the encoding is supported. A missing character allows the recipient to "guess" what encoding is appropriate.

Most web browsers today do not specify the character set of a request, even when it is something other than ISO-8859-1. This seems to be in violation of the HTTP specification. Most web browsers appear to send a request body using the encoding of the page used to generate the POST (for instance, the <form> element came from a page with a specific encoding... it is that encoding which is used to submit the POST data for that form).

HTTP Headers

Section 3.1 of the ARPA Internet Text Messages spec states that headers are always in US-ASCII encoding. Anything outside of that needs to be encoded. See the section above regarding query strings in URIs.

tomcat9.0默认用UTF-8对查询字符串参数进行解码，用ISO-8859-1对请求主体(body)参数进行解码。参考tomcat wiki上的一段话(网址：https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q1)：

In Tomcat 8 starting with 8.0.0 (8.0.0-RC3, to be specific), the default value of URIEncoding attribute on the <Connector> element depends on "strict servlet compliance" setting. The default value (strict compliance is off) of URIEncoding is now UTF-8. If "strict servlet compliance" is enabled, the default value is ISO-8859-1.

在tomcat安装目录下的conf/server.xml文件中的

<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" />中添加URIEncoding="UTF-8"属性，就可以指定tomcat对查询字符串参数的默认解码字符集

<Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"
               URIEncoding="UTF-8" />

TheDreamOfGod

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
字符编码

ASCII编码：美国(国家)信息交换标准(代)码，一种使用7个或8个二进制位进行编码的方案，最多可以给256个字符(包括字母、数字、标点符号、控制字符及其他符号)分配(或指定)数值。基本的 ASCII 字符集共有 128 个字符，其中有 96 个可打印字符，包括常用的字母、数字、标点符号等，另外还有 32 个控制字符。标准 ASCII 码使用 7 个二进位对字符进行编码，对应的 ISO 标准为 I...
复制链接

扫一扫