urlencode问题

最新推荐文章于 2023-11-24 23:19:37 发布

hurryshb

最新推荐文章于 2023-11-24 23:19:37 发布

阅读量195

点赞数

分类专栏： Java 文章标签： encoding string parameters character url apache

Java 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

1.urlencode和decode

字符的编码和解码在有中文和特殊符号的情况下，常常是一个头疼的问题。url的encode和decode是解决这个问题的一个分支，通过简单的算法将特殊字符编码，其大致算法如下：

The alphanumeric characters “a” through “z”, “A” through “Z” and “0″ through “9″ remain the same.
The special characters “.”, “-”, “*”, and “_” remain the same.
The space character ” ” is converted into a plus sign “+”.
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string “%xy”, where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used。

简单来讲，就是将一个非英文的字符先用一定的编码方式（比如UTF-8)编码得到3个字节，然后每个字节的8位用两个16进制的字符来表示，前面再加上%。java处理伪代码描述如下：

  StringBuilder sb = new StringBuilder();
        for (int i = 0; i < s.length();) {
            int c = (int) s.charAt(i);
            if (c == ' ') {
                sb.append('+');
            }else if( c == '%') {
                sb.append("%25");
            }else if(c 符合前面第4条的描述，是特殊字符){
                byte[] ba = str.getBytes(charset);
                for (int j = 0; j < ba.length; j++) {
                    String ts = Integer.toHexString(b)取后两位;
                    sb.append("%").append(ts);
                }
            }else {
                sb.append(c);
            }
        }
        String result = sb.toString();

通过这样的方式，比如 “a中国” 就会变成”a%E4%B8%AD%E5%9B%BD”,在发送端编码，在接受方使用相反的算法解码即可。但是这里面的几个特殊字符，比如%，常常会带来一些隐晦的问题。

2. 问题一:apache的rewrite

在做统一域名迁移的时候，遇到了一例这样的问题，现象是以前传递过来的一个正确参数现在超长了，排查后发现，由于为了兼容两个域名，我们对于某些url做了一个rewrite，而apache的rewrite模块默认会对%这样的字符转换为%25，再发送rewrite的响应到浏览器，因此，参数就由%252BeNh变成了%25252BeNh,导致超长了。

解决办法，修改apache的rewrite参数，添加一个NE, 如下：

RewriteRule ^/martini/(.*)$    /eve/$1 [L,R,NE]

1	RewriteRule ^/martini/(.*)$ /eve/$1 [L,R,NE]

问题得到解决，更多apache的rewrite配置可以参考：

http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html

. 问题二:容器的自动decode

在eve的tracelog模块中，会将目标url作为一个参数，经过UrlEncoder.encode后，包装到eve的url中，然后发送邮件给客户，这样客户点击时，就可以从eve进行跳转，从而记录下相关访问数据。如目标url为http://www.taobao.com，包装后变成http://eve.alibaba.com/dispatch?targetUrl=http%3A%2F%2Fwww.taobao.com&logid=12345.

但是在遇到了一次问题，当目标url本身也有参数，而且是经过编码的中文参数的时候，就出现了问题。eve的使用方先用gbk字符将目标url编码成了如下样式：

http://www.taobao.com?user=%EE%E2

然后eve像平常一样，将这个url用utf-8再次encode，拼装得到url如下：

http://eve.alibaba.com/dispatch?targetUrl=http%3A%2F%2Fwww.taobao.com%3Fuser%3D%25EE%25E2&logid=12345

结果，客户点击跳转链接，出错了。

在eve的跳转处理servlet中，大概处理逻辑是如下：

因此，最终的解决方案是去掉跳转代码中的decode就可以了，这本来是一个很简单的问题，但因为不知道容器做了一次decode，而带来了一些困扰。

附：

jetty的getParameter decode的调用流程：

--UrlEncoded.decodeUtf8To(..);
--org.eclipse.jetty.http.HttpURI.decodeQueryTo(MultiMap parameters)
--org.eclipse.jetty.server.Request. extractParameters()
--org.eclipse.jetty.server.Request.getParameter(String name)

tomcat的getParameter decode的调用流程：

--org.apache.tomcat.util.http.Parameters.urlDecode
--org.apache.tomcat.util.http.Parameters.processParameters
--org.apache.tomcat.util.http.Parameters.handleQueryParameters
--org.apache.catalina.connector.Request.handleQueryParameters
--org.apache.catalina.connector.Request.parseParameters
--org.apache.catalina.connector.Request.getParameter(String name)
--org.apache.catalina.connector.RequestFacade.getParameter(String name)

另外，encode和decode都需要指定一个字符集，如果UTF-8，GBK，或者ISO-8859-1，tomcat在不指定的情况下，queryString和body都是用ISO-8859-1来做decode的。

2. 容器的编码问题

对于tomcat在getParameter时的字符处理，代码如下：

 String enc = getCharacterEncoding();
        boolean useBodyEncodingForURI = connector.getUseBodyEncodingForURI();
        if (enc != null) {
            parameters.setEncoding(enc);
            if (useBodyEncodingForURI) {
                parameters.setQueryStringEncoding(enc);
            }
        } else {
            parameters.setEncoding                (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
            if (useBodyEncodingForURI) {
                parameters.setQueryStringEncoding
                    (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
            }
        }

该段代码表明，如果设置了request的(可以通过在filter中设置request.setCharacterEncoding(charset)，或者在http头中设置Content-type: application/x-www-form-urlencoded; charset=UTF-8；那么就用指定的编码解析请求body中的paramter。

如果之前响应到浏览器的HTML代码里有类似<meta http-equiv=”Content-Type” content=”text/html; charset=GBK” />,那么此HTML的form表单将以html指定的的编码方式提交数据（但注意，ie虽然以这个编码，但是并不将Content-type: application/x-www-form-urlencoded; charset=UTF-8 这个charset设置上去，所以在容器中request.getCharacterEncoding()是null的。事实上，我们的LocaleFilter就是解决这个问题。)则body的解码使用对应的charset。

对于QueryString是否使用，还要看看useBodyEncodingForURI的设置.这个设置是由connector来设置的，也就是容器的配置，不是每次请求可以变的，如果没有设置，queryString就会以ISO-8859-1的方式来解码（包括urldecode）。

对于浏览器来说，queryStirng如果是在window下，ie就以GBK编码传输，firefox以GBK编码做urlEncode后传输。

然后对于容器来说，一般还有一个URIencoding可以设置，是控制整个uri的编码方式的

3. jetty的处理

1.url的处理

jetty对于url和querystr，在每个HttpConnection初始化的时候，有如下代码

 _uri = StringUtil.__UTF8.equals(URIUtil.__CHARSET)?new HttpURI():new EncodedHttpURI(URIUtil.__CHARSET);

在URIUtil中：

final String __CHARSET=System.getProperty("org.eclipse.jetty.util.URI.charset",StringUtil.__UTF8);

因此我们可以看到，对于url，默认使用UTF-8处理，如果设置了org.eclipse.jetty.util.URI.charset，就用设置的字符编码处理。

2.querystr的处理

先看代码

  if (_uri!=null && _uri.hasQuery())
{
            if (_queryEncoding==null)
                _uri.decodeQueryTo(_baseParameters);
            else
            {
                _uri.decodeQueryTo(_baseParameters,_queryEncoding);
            }
}

可以看到，如果设置了queryEncoding，就会按照设置的编码来解析，在Request中，有方法
publicvoid setQueryEncoding(String queryEncoding);
也可以通过request.setAttribute来设置

publicvoid setAttribute(String name, Object value) {
        if ("org.eclipse.jetty.server.Request.queryEncoding".equals(name))
            setQueryEncoding(value==null?null:value.toString());
}

如果没有设置queryEncoding，会是什么情况呢？
在EncodedHttpUri中，有如下代码

public void decodeQueryTo(MultiMap parameters)
{
if (_query==_fragment)
return;
UrlEncoded.decodeTo(StringUtil.toString(_raw,_query+1,_fragment-_query-1,_encoding),parameters,_encoding);
}

可以看到，会使用_encoding参数，这个就是前面new出EncdodeHttpUri的传入参数，即org.eclipse.jetty.util.URI.charset设置的参数。

因此，对于queryStr，如果请求中设置了_queryEncoding，就用他的编码，否则用系统参数org.eclipse.jetty.util.URI.charset设置的编码，否则用默认编码UTF-8

3.body部分

默认使用 UTF-8 编码，当然可以在使用之前使用 request.set CharacterEncoding 设定编码.

注：网上有资料说POST 参数默认使用 Content-type 中的 Charset 编码，但看了下源码，tomcat是有这个功能的，在getCharacterEncoding的时候，有一个如果为null则去ContentType中取的动作，但jetty好像没有）

hurryshb

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
urlencode问题

1.urlencode和decode字符的编码和解码在有中文和特殊符号的情况下，常常是一个头疼的问题。url的encode和decode是解决这个问题的一个分支，通过简单的算法将特殊字符编码，其大致算法如下：The alphanumeric characters “a” through “z”, “A” through “Z” and “0″ through “9″ remain th
复制链接

扫一扫

专栏目录