Java 之编码与解码--站在源码与协议的基础上解决乱码问题

编码与解码:

编码是用预先规定的方法将文字、数字或其它对象编成数码,或将信息、数据转换成规定的电脉冲信号,在计算机软件中的编码可理解为将字符流按照某种约定的方式转换成字节流的形式,如UTF-8,ISO-8859-1等就是我们约定的编码方式,而解码是编码的逆过程,即将字节流按照某种约定的方式还原为字符流。

结论:

1.将字符转换为字节的方式称为编码

2.将字节转换为字符的方式称为解码

在Java中实现编码与解码:(UTF-8采用三个字节编码一个中文字符)

package encoding;

import java.util.Base64.Decoder;

public class Test {
	public static void main(String[] args) throws Exception {
		
		String str="编码与解码";
		
		//编码
		byte[] b=str.getBytes("UTF-8");
		System.out.println("字节流:");
		for(byte bb:b) {
			System.out.print(bb+" ");
		}
		System.out.println();
		
		b= new byte[] {-25 ,-68, -106 ,-25 ,-96 ,-127 ,-28 ,-72 ,-114 ,-24 ,-89 ,-93 ,-25 ,-96 ,-127};
		
		//解码
		System.out.println("字符流:");
		System.out.println(new String(b,"UTF-8"));
		
	}

}

输出结果如下:

 

由于UTF-8采用三个字节编码一个中文字符,所以字节数组长度为15个,表示了五个中文字符。(真正的字节流应该用八位二进制码表示,这里就不演示了)

可以看到用何种方式编码的字节数组用对应的方式编码能够实现数据格式的转化与还原,这也是为了在网络中传输和在计算机中存储方便而出现的解决方式。

乱码:

乱码可能是很多新手比较头疼的问题,乱码出现的原因往往是因为编码格式与解码格式不一致,或者所用编码格式不支持要编码的字符(如ISO-8859-1不支持中文,所以怎么编码解码都会乱码),接下我们通过java web中常见的乱码场景和解决方案来进一步讲解编码与解码,以及如何解决乱码

Servlet响应乱码:

几个设置编码的函数:

(1) public void setContentType(String type);

/**
     * Sets the content type of the response being sent to the client, if the
     * response has not been committed yet. The given content type may include a
     * character encoding specification, for example,
     * <code>text/html;charset=UTF-8</code>. The response's character encoding
     * is only set from the given content type if this method is called before
     * <code>getWriter</code> is called.
     * <p>
     * This method may be called repeatedly to change content type and character
     * encoding. This method has no effect if called after the response has been
     * committed. It does not set the response's character encoding if it is
     * called after <code>getWriter</code> has been called or after the response
     * has been committed.
     * <p>
     * Containers must communicate the content type and the character encoding
     * used for the servlet response's writer to the client if the protocol
     * provides a way for doing so. In the case of HTTP, the
     * <code>Content-Type</code> header is used.
     *
     * @param type
     *            a <code>String</code> specifying the MIME type of the content
     * @see #setLocale
     * @see #setCharacterEncoding
     * @see #getOutputStream
     * @see #getWriter
     */

-----------------------------------------------------------------------------暴力翻译分割线QAQ-----------------------------------------------------------------------------如果响应没有被提交过,则设置响应回客户端的内容类型。所给的内容类型可能包括一个编码说明,例如"text/html;charset=UTF-8"。如果这个方法被在getWriter之前调用,则响应的编码类型只由这个方法所给的内容类型设定。
这个方法可能被重复调用以改变内容类型和字符编码。这个方法在响应提交之后被调用的所有设置无效。这个方法在响应被提交之后和在getWriter()被调用之后调用对字符编码无效
容器必须传达内容类型和被servlet写出的字符编码,如果协议提供了这么做的方法。至于HTTP协议,Content-Type头是被使用的。

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

简而言之这个函数是用来设置http响应头参数Content-Type的,而关于http头Content-Type参数的详细说明读者可以阅读我转载的博客:https://blog.csdn.net/yanshaoshuai/article/details/83757976,这个参数是用来设置传输的内容类型的,如text/html表示纯html文本类型。

而且我们可以设置文本的内容的编码格式:例如 response.setContentType("text/html;charset=UTF-8");这个参数被用在客户端浏览器解析响应内容时使用。相当于前面说的指定解码时所用编码格式,如果与编码时所用编码格式相同,且该编码格式支持所编码的内容,则必然不会乱码。在java web中响应编码设置?????????????????????

(2)   public void setCharacterEncoding(String charset);

设置被回送给客户端的响应字符编码,例如UTF-8。如果字符编码已经被setContentType()设置过或者被setLocale()设置过,这个方法将会覆盖它。调用setContentType("text/html;charset=UTF-8")和调用setContentType("text/html")+setCharacterEncoding("UTF-8")是等价的。
这个方法可以被重复调用以改变字符编码。这个方法在getWriter()之后调用和在响应提交之后调用都是无效的。
容器必须传达内容类型和被servlet写出的字符编码,如果协议提供了这么做的方法的话。至于HTTP协议,Content-Type头是被使用的。对于HTTP协议,字符编码作为Content-Type头的文本媒介类型部分被传达。注意,如果Servlet没有指定一个内容类型,字符编码不能通过HTTP头被传达。然而,它仍然被用来在使用writer时候的编码文本。

/**
     * Sets the character encoding (MIME charset) of the response being sent to
     * the client, for example, to UTF-8. If the character encoding has already
     * been set by {@link #setContentType} or {@link #setLocale}, this method
     * overrides it. Calling {@link #setContentType} with the
     * <code>String</code> of <code>text/html</code> and calling this method
     * with the <code>String</code> of <code>UTF-8</code> is equivalent with
     * calling <code>setContentType</code> with the <code>String</code> of
     * <code>text/html; charset=UTF-8</code>.
     * <p>
     * This method can be called repeatedly to change the character encoding.
     * This method has no effect if it is called after <code>getWriter</code>
     * has been called or after the response has been committed.
     * <p>
     * Containers must communicate the character encoding used for the servlet
     * response's writer to the client if the protocol provides a way for doing
     * so. In the case of HTTP, the character encoding is communicated as part
     * of the <code>Content-Type</code> header for text media types. Note that
     * the character encoding cannot be communicated via HTTP headers if the
     * servlet does not specify a content type; however, it is still used to
     * encode text written via the servlet response's writer.
     *
     * @param charset
     *            a String specifying only the character set defined by IANA
     *            Character Sets
     *            (http://www.iana.org/assignments/character-sets)
     * @see #setContentType #setLocale
     * @since 2.4
     */

针对这两个方法做一些实验

直接输出响应内容类型和字符编码:

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		// response.setHeader("Content-type", "text/html;charset=UTF-8");
		//response.setCharacterEncoding("UTF-8");
		String str="用字节流输出汉字";
		System.out.println(response.getContentType());
		System.out.println(response.getCharacterEncoding());
		response.getOutputStream().write(str.getBytes());
	}
输出:
null
ISO-8859-1

没有设置Content-Type所以getContentType输出null,getCharacterEncoding源码声明处这样写道:If no character encoding has been specified, <code>ISO-8859-1</code> is returned.即没有设置编码格式默认就是ISO-8859-1

这样输出中文会不会乱码呢?

果然乱码了。因为ISO-8859-1编码格式是不支持中文的。

 

然后进行如下测试:

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		// response.setHeader("Content-type", "text/html;charset=UTF-8");
                //response.setContentType("text/html;charset=UTF-8");
		response.setCharacterEncoding("UTF-8");
		String str="输出中文";
		System.out.println(response.getContentType());
		System.out.println(response.getCharacterEncoding());
		response.getOutputStream().write(str.getBytes());
	}
输出:
null
UTF-8

应该在意料之中吧。但是浏览器访问是否乱码呢?

没错,依旧乱码,原因并不是没写成getBytes("UTF-8"),因为getBytes()用平台默认编码格式解码字符不信你可以用System.getProperty("file.encoding")试一试。

再次出错的原因在于没有设置Content-Type,这就和前面的红字对应上了。

所以要想不出现乱码,就必须设置Content-Type和字符编码,像下面这样:

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		//response.setHeader("Content-type", "text/html;charset=UTF-8");
		response.setContentType("text/html;charset=UTF-8");
		//response.setCharacterEncoding("UTF-8");
		String str="输出中文";
		System.out.println(response.getContentType());
		System.out.println(response.getCharacterEncoding());
		System.out.println(System.getProperty(""));
		response.getOutputStream().write(str.getBytes());
	}
输出:
text/html;charset=UTF-8
UTF-8

网络访问也不会乱码:

只要两个都设置就行了顺序和怎么设置无关紧要(为避免平台编码影响,尽量使用带参数的getBytes("UTF-8")):

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		//response.setHeader("Content-type", "text/html;charset=UTF-8");
		//response.setContentType("text/html;charset=UTF-8");
		response.setCharacterEncoding("UTF-8");
		response.setContentType("text/html");
		
		String str="输出中文";
		System.out.println(response.getContentType());
		System.out.println(response.getCharacterEncoding());
		System.out.println(System.getProperty("file.encoding"));
		response.getOutputStream().write(str.getBytes("UTF-8"));
	}
或:
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		//response.setHeader("Content-type", "text/html;charset=UTF-8");
		//response.setContentType("text/html;charset=UTF-8");
		response.setCharacterEncoding("UTF-8");
		response.setContentType("text/html");
		
		String str="输出中文";
		System.out.println(response.getContentType());
		System.out.println(response.getCharacterEncoding());
		System.out.println(System.getProperty("file.encoding"));
		response.getOutputStream().write(str.getBytes("UTF-8"));
	}

都是可以的

字节流乱码解决了字符流就成问题:

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		response.setHeader("Content-type", "text/html;charset=UTF-8");
		//response.setContentType("text/html;charset=UTF-8");
		response.setCharacterEncoding("UTF-8");
		response.setContentType("text/html");
		
		String str="输出中文";
		System.out.println(response.getContentType());
		System.out.println(response.getCharacterEncoding());
		System.out.println(System.getProperty("file.encoding"));
		response.getWriter().print(str);
	}

上面也能正确输出

 

请求乱码

post方式提交乱码:

测试jsp:

<%@ page language="java" contentType="text/html; charset=UTF-8"
    pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Insert title here</title>
</head>
<body>
	<form action="${pageContext.request.contextPath}/RequestTest" method="post">
	姓名<input type="text" name="姓名">
	<input type="submit" value="提交">
	</form>
</body>
</html>

Servle:(doPost里面调用了doGet)

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		Enumeration<String>names=request.getParameterNames();
		while(names.hasMoreElements()) {
			String name=names.nextElement();
			System.out.println(name+":"+request.getParameter(name));
		}
		System.out.println(request.getCharacterEncoding());
	}

访问:

输出:

 虽然编码方式输出为null,但其实用的是iso8859-1编码方式,可以验证这一点

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		request.setCharacterEncoding("iso8859-1");
		Enumeration<String>names=request.getParameterNames();
		while(names.hasMoreElements()) {
			String name=names.nextElement();
			System.out.println(name+":"+request.getParameter(name));
		}
		System.out.println(request.getCharacterEncoding());
	}

这段代码仅仅只是把编码方式设置为了iso8859-1其它与上面没有变化,再看一下输出:

二者乱码输出的格式一致,间接证明了不设置 编码格式时Tomcat用的是iso8859-1编码格式

解决post请求乱码:

把编码格式设置为UTF-8再试一下:

输出正常

get方式提交乱码:

1)地址栏直接输入中文

直接用上面的RequestTest测试:

意外的是并没有乱码 ,那底层肯定做了某种转换

 

再看看查询字符串是什么

	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		//request.setCharacterEncoding("UTF-8");
		
		Enumeration<String>names=request.getParameterNames();
		while(names.hasMoreElements()) {
			String name=names.nextElement();
			System.out.println(name+":"+request.getParameter(name));
		}
		//System.out.println(URLDecoder.decode(request.getQueryString()));
		System.out.println(request.getQueryString());
		System.out.println(request.getCharacterEncoding());
	}

 

可以看到浏览器对中文字符进行了URL编码,而个人猜测底层将URL编码进行了解码,所以并未乱码,而post方式提交的中文因为编码格式乱码是由于请求参数不在请求头中,而是在请求体中,是按照jsp页面的编码格式进行编码后传送,所以在获取参数时也必须以响应的编码格式解码。

基于这个设想,手动对查询字符串解码应该也能获得正确的内容:

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		//request.setCharacterEncoding("UTF-8");
		
		Enumeration<String>names=request.getParameterNames();
		while(names.hasMoreElements()) {
			String name=names.nextElement();
			System.out.println(name+":"+request.getParameter(name));
		}
		System.out.println(URLDecoder.decode(request.getQueryString()));
		//System.out.println(request.getQueryString());
		System.out.println(request.getCharacterEncoding());
	}

 

==========================================================================================

url编码是一种浏览器用来打包表单输入的格式。浏览器从表单中获取所有的name和其中的值 ,将它们以name/value参数编码(移去那些不能传送的字符,将数据排行等等)作为URL的一部分或者分离地发给服务器。 

URL编码遵循下列规则: 每对name/value由&;符分开;每对来自表单的name/value由=符分开。如果用户没有输入值给这个name,那么这个name还是出现,只是无值。任何特殊的字符(就是那些不是简单的七位ASCII,如汉字)将以百分符%用十六进制编码,当然也包括象 =,&;,和 % 这些特殊的字符。

===========================================================================================

2)jsp里的超链接带中文

有了这个基础,为了让自己的带中文参数的超链接不乱码,在页面先将中文用URLEncoder.encode()编码,在取查询字符串时再用URLDecoder.decode()解码也不会产生错误

<%@page import="java.net.URLEncoder"%>
<%@ page language="java" contentType="text/html; charset=UTF-8"
    pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Insert title here</title>
</head>
<body>
	<form action="${pageContext.request.contextPath}/RequestTest" method="post">
	姓名<input type="text" name="姓名">
	<input type="submit" value="提交">
	<a href="${pageContext.request.contextPath}
		/RequestTest?<%=URLEncoder.encode("姓名")%>=<%=URLEncoder.encode("闫绍帅")%>" >get提交</a>
	</form>
</body>
</html>

 

暂且说到这,没有看过底层原始字节流的转换过程,所以只能基于表象猜测验证,有时间把tomcat源码看看再追加干货吧。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值