Tomcat关于encoding编码的默认设置以及乱码产生的原因

最新推荐文章于 2022-05-26 14:31:05 发布

desert3

最新推荐文章于 2022-05-26 14:31:05 发布

阅读量342

点赞数

分类专栏： Java.Tomcat 文章标签： java 数据库 web.xml

本文链接：https://blog.csdn.net/desert3/article/details/84157240

版权

Java.Tomcat 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

[color=red]注意：乱码和request的具体实现类有关[/color]，现在已经查到的是RequestDispatcher.forward调用前使用的是org.apache.catalina.connector.RequestFacade类而RequestDispatcher.forward调用后使用的是org.apache.catalina.core.ApplicationHttpRequest，他们内部在ParseParameter的时候，用来解码的默认的编码逻辑不同，使用不同的协议时，影响乱码的因素不同！
具体参考：[url=http://desert3.iteye.com/admin/blogs/1420119]Tomcat源码分析--ServletRequest.getParameterValues内部分析，Request字符集&QueryStringEncoding[/url]

[color=green]乱码的产生[/color]
譬如汉字“中”，以UTF-8编码后得到的是3字节的值%E4%B8%AD，然后通过GET或者POST方式把这3个字节提交到Tomcat容器，如果你不告诉Tomcat我的参数是用UTF-8编码的，那么tomcat就认为你是用ISO-8859-1来编码的，而[color=red]ISO8859-1（兼容URI中的标准字符集US-ASCII）是兼容ASCII的单字节编码并且使用了单字节内的所有空间[/color]，因此Tomcat就以为你传递的用ISO-8859-1字符集编码过的3个字符，然后它就用ISO-8859-1来解码，得到ä¸-，解码后。字符串ä¸-在Jvm是以Unicode的形式存在的，[color=red]而HTTP传输或者数据库保存的其实是字节[/color]，因此根据各终端的需要，你可以把unicode字符串ä¸-用UTF-8编码后得到相应的字节后存储到数据库（3个UTF-8字符），也可以取得这3个字符对应的ISO-8859-1的3个字节，然后用UTF-8重新编码后得到unicode字符“中”[color=red]（特性：把其他任何编码的字节流当作ISO-8859-1编码看待都没有问题）[/color]，然后用response传递给客户端（根据你设置的content-type不同，传递的字节也是不同的！）
[color=red]总结：[/color]
[list]
[*]1，HTTP GET或者POST传递的是字节？数据库保存的也是字节（譬如500MB空间就是500M字节）
[*]2，乱码产生的原因是编码和解码的字符集（方式）不同导致的，即对于几个不同的字节，在不同的编码方案下对应的字符可能不同，也可能在某种编码下有些字节不存在（这也是乱码中？产生的原因）
[*]3，解码后的字符串在jvm中以Unicode的形式存在
[*]4，如果jvm中存在的Unicode字符就是你预期的字符（编码，解码的字符集相同或者兼容），那么没有任何问题，如果jvm中存在的字符集不是你预期的字符，譬如上述例子中jvm中存在的是3个Unicode字符，你也可以通过取得这3个unicode字符对应的3个字节，然后用UTF-8对这3个字节进行编码生成新的Unicode字符：汉字“中”
[*]5，ISO8859-1是兼容ASCII的单字节编码并且使用了单字节内的所有空间，在支持ISO-8859-1的系统中传输和存储其他任何编码的字节流都不会被抛弃。换言之，把其他任何编码的字节流当作ISO-8859-1编码看待都没有问题。
[/list]
下面的代码显示，[color=green]使用不同的编码来Encoder会得到不同的结果，同时如果Encoder和Decoder不一致或者使用的汉字在编码ISO-8859-1中不存在时，都会表现为乱码的形式[/color]！


 try {  

			// 汉字“中”用UTF-8进行URLEncode的时候，得到%e4%b8%ad(对应的ISO-8859-1的字符是ä¸)
			String item = new String(new byte[] { (byte) 0xe4, (byte) 0xb8, (byte) 0xad }, "UTF-8");
			// 中
			System.out.println(item);

			item = new String(new byte[] { (byte) 0xe4, (byte) 0xb8, (byte) 0xad }, "ISO-8859-1");
			// ä¸
			System.out.println(item);

			System.out.println(new BigInteger("253").toByteArray());
			System.out.println(Integer.toBinaryString(253));

			// 中
			item = new String(item.getBytes("ISO_8859_1"), "UTF-8");
			System.out.println(item);
			// ä¸
			item = new String(item.getBytes("UTF-8"), "ISO_8859_1");
			System.out.println(item);

	        // 汉字中以UTF-8编码为     %E4%B8%AD（3字节）
	        System.out.println(URLEncoder.encode("中", "UTF-8"));  
	        // 汉字中以UTF-8编码为     %3F      （1字节 这是由于汉字在ISO-8859-1字符集中不存在，返回的是？在ISO-8859-1下的编码）
	        System.out.println(URLEncoder.encode("中", "ISO-8859-1"));  
	        // 汉字中以UTF-8编码为     %D6%D0        （2字节）
	        System.out.println(URLEncoder.encode("中", "GB2312"));  

	        // 把汉字中对应的UTF-8编码                 %E4%B8%AD 用UTF-8解码得到正常的汉字 中
	        System.out.println(URLDecoder.decode("%E4%B8%AD", "UTF-8"));  
	        // 把汉字中对应的ISO-8859-1编码    %3F       用ISO-8859-1解码得到?
	        System.out.println(URLDecoder.decode("%3F", "ISO-8859-1"));  
	        // 把汉字中对应的GB2312编码                 %D6%D0        用GB2312解码得到正常的汉字 中 
	        System.out.println(URLDecoder.decode("%D6%D0", "GB2312"));  
	        // 把汉字中对应的UTF-8编码                 %E4%B8%AD 用ISO-8859-1解码
	        // 得到字符ä¸（这个就是所谓的乱码，其实是3字节%E4%B8%AD中每个字节对应的ISO-8859-1中的字符）
	        // ISO-8859-1字符集使用了单字节内的所有空间
	        System.out.println(URLDecoder.decode("%E4%B8%AD", "ISO-8859-1"));
	        // 把汉字中对应的UTF-8编码                 %E4%B8%AD 用GB2312解码
	        // 得到字符涓�，因为前2字节 %E4%B8对应的GB2312的字符就是涓，而第3字节%AD在GB2312编码中不存在，故返回？
	        System.out.println(URLDecoder.decode("%E4%B8%AD", "GB2312"));  
	    } catch (UnsupportedEncodingException e) {  
	        // TODO Auto-generated catch block  
	        e.printStackTrace();  
	    }

[color=green]Tomcat关于encoding编码的默认设置以及相关标准:[/color]
[color=red]对于Get请求[/color]，"URI Syntax"规范规定HTTP query strings（又叫GET parameters）使用US-ASCII编码，所有不在这个编码范围内的字符，必须经常[color=red]一定的转码：%61的形式（encode）[/color]。又由于ISO-8859-1 and ASCII对于0x20 to 0x7E范围内的字符是兼容的，大部分的web容器譬如Tomcat容器默认使用ISO-8859-1解码URI中%xx部分的字节。可以使用Connector中的[color=red]URIEncoding[/color]来修改这个默认用来解码URI中%xx部分字节的字符集。[color=green]URIEncoding要和get请求query string中encode的编码一直，或者通过设置Content-Type来告诉容器你使用什么编码来转码url中的字符[/color]

[color=red]POST请求[/color]应该自己通过参数Content-Type指定所使用的编码，由于许多客户端都没有设置一个明确的编码，tomcat就默认使用ISO-8859-1编码。[color=red]注意：用来对URI进行解码的字符集，Request字符集，Response字符集的区别！不同的Request实现中，对于上述3个编码的关系是不同的[/color]

[color=red]对于POST请求[/color]，ISO-8859-1是Servlet规范中定义的HTTP request和response的默认编码。如果[color=red]request或者response[/color]的字符集没有被设定，那么Servlet规范指定使用编码ISO-8859-1，请求和相应指定编码是通过[color=red]Content-Type[/color]响应头来设定的。

如果Get、Post请求没有通过Content-Type来设置编码的话，Tomcat默认使用ISO-8859-1编码。可以使用SetCharacterEncodingFilter来修改Tomcat请求的默认编码设置（encoding：使用的编码， ignore：true，不管客户端是否指定了编码都进行设置， false，只有在客户端没有指定编码的时候才进行编码设置，默认true）
注意：[color=red]一般这个Filter建议放在所有Filter的最前面[/color]（Servlet3.0之前基于filter-mapping在web.xml中的顺序， Servlet3.0之后有参数可以指定顺序），因为一旦从request里面取值后，再进行设置的话，设置无效。[color=red]因为在第一次从request取值时，tomcat会把querystring或者post方式提交的变量，用指定的编码转成从parameters数组，以后直接从这个数组中获取相应参数的值！[/color]

[color=green]到处都使用UTF-8建议操作：[/color]
[list]
[*]1， Set URIEncoding="UTF-8" on your <Connector> in server.xml.[color=red]使得Tomcat Http Get请求使用UTF-8编码[/color]
[*]2， Use a character encoding filter with the default encoding set to UTF-8. [color=red]由于很多请求本身没有指定编码， Tomcat默认使用ISO-8859-1编码作为HttpServletRequest的编码，通过filter修改[/color]
[*]3， Change all your JSPs to include charset name in their contentType. For example, use <%@page contentType="text/html; charset=UTF-8" %> for the usual JSP pages and <jsp:directive.page contentType="text/html; charset=UTF-8" /> for the pages in XML syntax (aka JSP Documents). [color=red]指定Jsp页面使用的编码[/color]
[*]4， Change all your servlets to set the content type for responses and to include charset name in the content type to be UTF-8. Use response.setContentType("text/html; charset=UTF-8") or response.setCharacterEncoding("UTF-8"). [color=red]设置Response返回结果的编码[/color]
[*]5, Change any content-generation libraries you use (Velocity, Freemarker, etc.) to use UTF-8 and to specify UTF-8 in the content type of the responses that they generate.[color=red]指定所有模版引擎佘勇的编码[/color]
[*]6, Disable any valves or filters that may read request parameters before your character encoding filter or jsp page has a chance to set the encoding to UTF-8. [color=red]SetCharacterEncodingFilter一般要放置在第一位，否则可能无效[/color]
[/list]


/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements.  See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License.  You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package filters;


import java.io.IOException;
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;


/**
 * <p>Example filter that sets the character encoding to be used in parsing the
 * incoming request, either unconditionally or only if the client did not
 * specify a character encoding.  Configuration of this filter is based on
 * the following initialization parameters:</p>
 * <ul>
 * <li><strong>encoding</strong> - The character encoding to be configured
 *     for this request, either conditionally or unconditionally based on
 *     the <code>ignore</code> initialization parameter.  This parameter
 *     is required, so there is no default.</li>
 * <li><strong>ignore</strong> - If set to "true", any character encoding
 *     specified by the client is ignored, and the value returned by the
 *     <code>selectEncoding()</code> method is set.  If set to "false,
 *     <code>selectEncoding()</code> is called <strong>only</strong> if the
 *     client has not already specified an encoding.  By default, this
 *     parameter is set to "true".</li>
 * </ul>
 *
 * <p>Although this filter can be used unchanged, it is also easy to
 * subclass it and make the <code>selectEncoding()</code> method more
 * intelligent about what encoding to choose, based on characteristics of
 * the incoming request (such as the values of the <code>Accept-Language</code>
 * and <code>User-Agent</code> headers, or a value stashed in the current
 * user's session.</p>
 *
 * @author Craig McClanahan
 * @version $Id: SetCharacterEncodingFilter.java 939521 2010-04-30 00:16:33Z kkolinko $
 */

public class SetCharacterEncodingFilter implements Filter {


    // ----------------------------------------------------- Instance Variables


    /**
     * The default character encoding to set for requests that pass through
     * this filter.
     */
    protected String encoding = null;


    /**
     * The filter configuration object we are associated with.  If this value
     * is null, this filter instance is not currently configured.
     */
    protected FilterConfig filterConfig = null;


    /**
     * Should a character encoding specified by the client be ignored?
     */
    protected boolean ignore = true;


    // --------------------------------------------------------- Public Methods


    /**
     * Take this filter out of service.
     */
    public void destroy() {

        this.encoding = null;
        this.filterConfig = null;

    }


    /**
     * Select and set (if specified) the character encoding to be used to
     * interpret request parameters for this request.
     *
     * @param request The servlet request we are processing
     * @param result The servlet response we are creating
     * @param chain The filter chain we are processing
     *
     * @exception IOException if an input/output error occurs
     * @exception ServletException if a servlet error occurs
     */
    public void doFilter(ServletRequest request, ServletResponse response,
                         FilterChain chain)
	throws IOException, ServletException {

        // Conditionally select and set the character encoding to be used
        if (ignore || (request.getCharacterEncoding() == null)) {
            String encoding = selectEncoding(request);
            if (encoding != null)
                request.setCharacterEncoding(encoding);
        }

	// Pass control on to the next filter
        chain.doFilter(request, response);

    }


    /**
     * Place this filter into service.
     *
     * @param filterConfig The filter configuration object
     */
    public void init(FilterConfig filterConfig) throws ServletException {

	this.filterConfig = filterConfig;
        this.encoding = filterConfig.getInitParameter("encoding");
        String value = filterConfig.getInitParameter("ignore");
        if (value == null)
            this.ignore = true;
        else if (value.equalsIgnoreCase("true"))
            this.ignore = true;
        else if (value.equalsIgnoreCase("yes"))
            this.ignore = true;
        else
            this.ignore = false;

    }


    // ------------------------------------------------------ Protected Methods


    /**
     * Select an appropriate character encoding to be used, based on the
     * characteristics of the current request and/or filter initialization
     * parameters.  If no character encoding should be set, return
     * <code>null</code>.
     * <p>
     * The default implementation unconditionally returns the value configured
     * by the <strong>encoding</strong> initialization parameter for this
     * filter.
     *
     * @param request The servlet request we are processing
     */
    protected String selectEncoding(ServletRequest request) {

        return (this.encoding);

    }


}

参考：[url=http://wiki.apache.org/tomcat/FAQ/CharacterEncoding]tomcat wiki faq Character Encoding Issues[/url]
[url=http://tomcat.apache.org/tomcat-6.0-doc/config/http.html]Apache Tomcat Configuration Reference - The HTTP Connector[/url]

desert3

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Tomcat关于encoding编码的默认设置以及乱码产生的原因

[color=red]注意：乱码和request的具体实现类有关[/color]，现在已经查到的是RequestDispatcher.forward调用前使用的是org.apache.catalina.connector.RequestFacade类而RequestDispatcher.forward调用后使用的是org.apache.catalina.core.ApplicationHttpRe...
复制链接

扫一扫