URLEncoder和URLDecoder实现转码和解码

最新推荐文章于 2023-03-20 18:16:01 发布

I_Am_Zou

最新推荐文章于 2023-03-20 18:16:01 发布

阅读量3.4k

点赞数 2

分类专栏： java 文章标签： URLEncoder URLDecoder

本文链接：https://blog.csdn.net/I_Am_Zou/article/details/52074721

版权

java 专栏收录该内容

27 篇文章 0 订阅

订阅专栏

在Java开发中，URL跳转经常遇到中文乱码问题。实际上，如果细心的话，我们会发现在访问网页时经常会在URL中看到一些16进制格式的字符串,如:http://xxx.com/s?w=%e7%bc

这其实就是用到Java.net包下的URLEncoder和URLDecoder这两个类来对URL参数实现转码和解码。

1、URLDecoder（解码）

源码上对此解释是：

Utility class for HTML form decoding. This class contains static methods for decoding a String from the <CODE>application/x-www-form-urlencoded</CODE>MIME format.

即这是一个HTML格式的解码工具类。该类包含了对一个字符串解码的静态方法！

从源码可看出给出了两种解码方法：

（1）默认格式解码

 public static String decode(String s) {

        String str = null;

        try {
            str = decode(s, dfltEncName);
        } catch (UnsupportedEncodingException e) {
            // The system should always have the platform default
        }

        return str;
    }

（2）指定格式解码

 public static String decode(String s, String enc)
        throws UnsupportedEncodingException{

        boolean needToChange = false;
        int numChars = s.length();
        StringBuffer sb = new StringBuffer(numChars > 500 ? numChars / 2 : numChars);
        int i = 0;

        if (enc.length() == 0) {
            throw new UnsupportedEncodingException ("URLDecoder: empty string enc parameter");
        }

        char c;
        byte[] bytes = null;
        while (i < numChars) {
            c = s.charAt(i);
            switch (c) {
            case '+':
                sb.append(' ');
                i++;
                needToChange = true;
                break;
            case '%':
                /*
                 * Starting with this instance of %, process all
                 * consecutive substrings of the form %xy. Each
                 * substring %xy will yield a byte. Convert all
                 * consecutive  bytes obtained this way to whatever
                 * character(s) they represent in the provided
                 * encoding.
                 */

                try {

                    // (numChars-i)/3 is an upper bound for the number
                    // of remaining bytes
                    if (bytes == null)
                        bytes = new byte[(numChars-i)/3];
                    int pos = 0;

                    while ( ((i+2) < numChars) &&
                            (c=='%')) {
                        int v = Integer.parseInt(s.substring(i+1,i+3),16);
                        if (v < 0)
                            throw new IllegalArgumentException("URLDecoder: Illegal hex characters in escape (%) pattern - negative value");
                        bytes[pos++] = (byte) v;
                        i+= 3;
                        if (i < numChars)
                            c = s.charAt(i);
                    }

                    // A trailing, incomplete byte encoding such as
                    // "%x" will cause an exception to be thrown

                    if ((i < numChars) && (c=='%'))
                        throw new IllegalArgumentException(
                         "URLDecoder: Incomplete trailing escape (%) pattern");

                    sb.append(new String(bytes, 0, pos, enc));
                } catch (NumberFormatException e) {
                    throw new IllegalArgumentException(
                    "URLDecoder: Illegal hex characters in escape (%) pattern - "
                    + e.getMessage());
                }
                needToChange = true;
                break;
            default:
                sb.append(c);
                i++;
                break;
            }
        }

        return (needToChange? sb.toString() : s);
    }

附：解码规则

 <ul>
 * <li>The alphanumeric characters "{@code a}" through
 *     "{@code z}", "{@code A}" through
 *     "{@code Z}" and "{@code 0}"
 *     through "{@code 9}" remain the same.
 * <li>The special characters "{@code .}",
 *     "{@code -}", "{@code *}", and
 *     "{@code _}" remain the same.
 * <li>The space character "   " is
 *     converted into a plus sign "{@code +}".
 * <li>All other characters are unsafe and are first converted into
 *     one or more bytes using some encoding scheme. Then each byte is
 *     represented by the 3-character string
 *     "<i>{@code %xy}</i>", where <i>xy</i> is the
 *     two-digit hexadecimal representation of the byte.
 *     The recommended encoding scheme to use is UTF-8. However,
 *     for compatibility reasons, if an encoding is not specified,
 *     then the default encoding of the platform is used.
 * </ul>

翻译过来就是：

字母数字字符 "a" 到 "z"、"A" 到 "Z" 和 "0" 到 "9" 保持不变。
特殊字符 "."、"-"、"*" 和 "_" 保持不变。
加号 "+" 转换为空格字符 " "。
将把 "%xy" 格式序列视为一个字节，其中 xy 为 8 位的两位十六进制表示形式。然后，所有连续包含一个或多个这些字节序列的子字符串，将被其编码可生成这些连续字节的字符所代替。可以指定对这些字符进行解码的编码机制，或者如果未指定的话，则使用平台的默认编码机制。

示例如下：

  public static void main(String[] args) throws Exception {
    	String encodedString = “%e7%bc%96%e7%a0%81%e6%a0%bc%e5%bc%8f”;
        URLDecoder.decode(encodedString, "UTF-8");
    }

2、URLEncoder（转码）

源码上对此解释是：

Utility class for HTML form encoding. This class contains static methods for converting a String to the <CODE>application/x-www-form-urlencoded</CODE> MIME format.

即这是一个HTML格式的转码工具类。该类包含了对一个字符串转码的静态方法！

从源码可看出给出了两种转码方法：

（1）默认格式转码

 public static String encode(String s) {

        String str = null;

        try {
            str = encode(s, dfltEncName);
        } catch (UnsupportedEncodingException e) {
            // The system should always have the platform default
        }

        return str;
    }

（2）指定格式转码

  public static String encode(String s, String enc)
        throws UnsupportedEncodingException {

        boolean needToChange = false;
        StringBuffer out = new StringBuffer(s.length());
        Charset charset;
        CharArrayWriter charArrayWriter = new CharArrayWriter();

        if (enc == null)
            throw new NullPointerException("charsetName");

        try {
            charset = Charset.forName(enc);
        } catch (IllegalCharsetNameException e) {
            throw new UnsupportedEncodingException(enc);
        } catch (UnsupportedCharsetException e) {
            throw new UnsupportedEncodingException(enc);
        }

        for (int i = 0; i < s.length();) {
            int c = (int) s.charAt(i);
            //System.out.println("Examining character: " + c);
            if (dontNeedEncoding.get(c)) {
                if (c == ' ') {
                    c = '+';
                    needToChange = true;
                }
                //System.out.println("Storing: " + c);
                out.append((char)c);
                i++;
            } else {
                // convert to external encoding before hex conversion
                do {
                    charArrayWriter.write(c);
                    /*
                     * If this character represents the start of a Unicode
                     * surrogate pair, then pass in two characters. It's not
                     * clear what should be done if a bytes reserved in the
                     * surrogate pairs range occurs outside of a legal
                     * surrogate pair. For now, just treat it as if it were
                     * any other character.
                     */
                    if (c >= 0xD800 && c <= 0xDBFF) {
                        /*
                          System.out.println(Integer.toHexString(c)
                          + " is high surrogate");
                        */
                        if ( (i+1) < s.length()) {
                            int d = (int) s.charAt(i+1);
                            /*
                              System.out.println("\tExamining "
                              + Integer.toHexString(d));
                            */
                            if (d >= 0xDC00 && d <= 0xDFFF) {
                                /*
                                  System.out.println("\t"
                                  + Integer.toHexString(d)
                                  + " is low surrogate");
                                */
                                charArrayWriter.write(d);
                                i++;
                            }
                        }
                    }
                    i++;
                } while (i < s.length() && !dontNeedEncoding.get((c = (int) s.charAt(i))));

                charArrayWriter.flush();
                String str = new String(charArrayWriter.toCharArray());
                byte[] ba = str.getBytes(charset);
                for (int j = 0; j < ba.length; j++) {
                    out.append('%');
                    char ch = Character.forDigit((ba[j] >> 4) & 0xF, 16);
                    // converting to use uppercase letter as part of
                    // the hex value if ch is a letter.
                    if (Character.isLetter(ch)) {
                        ch -= caseDiff;
                    }
                    out.append(ch);
                    ch = Character.forDigit(ba[j] & 0xF, 16);
                    if (Character.isLetter(ch)) {
                        ch -= caseDiff;
                    }
                    out.append(ch);
                }
                charArrayWriter.reset();
                needToChange = true;
            }
        }

        return (needToChange? out.toString() : s);
    }

附：转码规则

<ul>
 * <li>The alphanumeric characters "{@code a}" through
 *     "{@code z}", "{@code A}" through
 *     "{@code Z}" and "{@code 0}"
 *     through "{@code 9}" remain the same.
 * <li>The special characters "{@code .}",
 *     "{@code -}", "{@code *}", and
 *     "{@code _}" remain the same.
 * <li>The space character "   " is
 *     converted into a plus sign "{@code +}".
 * <li>All other characters are unsafe and are first converted into
 *     one or more bytes using some encoding scheme. Then each byte is
 *     represented by the 3-character string
 *     "<i>{@code %xy}</i>", where <i>xy</i> is the
 *     two-digit hexadecimal representation of the byte.
 *     The recommended encoding scheme to use is UTF-8. However,
 *     for compatibility reasons, if an encoding is not specified,
 *     then the default encoding of the platform is used.
 * </ul>

翻译过来就是：

字母数字字符 "a" 到 "z"、"A" 到 "Z" 和 "0" 到 "9" 保持不变。
特殊字符 "."、"-"、"*" 和 "_" 保持不变。
空格字符 " " 转换为一个加号 "+"。
所有其他字符都是不安全的，因此首先使用一些编码机制将它们转换为一个或多个字节。然后每个字节用一个包含 3 个字符的字符串 "%xy" 表示，其中 xy 为该字节的两位十六进制表示形式。推荐的编码机制是 UTF-8。但是，出于兼容性考虑，如果未指定一种编码，则使用相应平台的默认编码。

示例如下：

  public static void main(String[] args) throws Exception {
    	String encodedString = “编码格式”;
        URLEncoder.encode(encodedString, "UTF-8");
    }