转载地址:http://blog.csdn.net/woaigaolaoshi/article/details/51160999
爬虫过程中可能会碰到url中含有普通的%字符的情况,如果直接用URLDecode.decode()则会出现如题的错误,解决方法就是先将’%’编码为’%25’,再对url解码。
public static void main(String[] args) throws Exception{
String test = "http://www.baidu.com?123%";
System.out.println(URLDecoder.decode(test.replaceAll("%", "%25"), "utf8"));
}
输出:
http://www.baidu.com?123%
上述是最简单的一种情况,但是绝大多数情况会掺杂着%为编码的含义,此时只把%替换为%25是不能解出正确的url的,如下:
public static void main(String[] args) throws Exception{
String test = "http://www.baidu.com?%e4%b8%ad%e5%9b%bd123%";//%e4%b8%ad%e5%9b%bd为中国
System.out.println(URLDecoder.decode(test.replaceAll("%", "%25"), "utf8"));
}
输出:
http://www.baidu.com?%e4%b8%ad%e5%9b%bd123%
解决方法:
public class ConverPercent {
public static boolean isHex(char c){
if(((c >= '0') && (c <= '9')) ||
((c >= 'a') && (c <= 'f')) ||
((c >= 'A') && (c <= 'F')))
return true;
else
return false;
}
public static String convertPercent(String str){
StringBuilder sb = new StringBuilder(str);
for(int i = 0; i < sb.length(); i++){
char c = sb.charAt(i);
if(c == '%'){
if(((i + 1) < sb.length() -1) && ((i + 2) < sb.length() - 1)){
char first = sb.charAt(i + 1);
char second = sb.charAt(i + 2);
if(!(isHex(first) && isHex(second)))
sb.insert(i+1, "25");
}
else{
sb.insert(i+1, "25");
}
}
}
return sb.toString();
}
public static void main(String[] args) throws UnsupportedEncodingException{
String test = "http://www.baidu.com?%e4%b8%ad%e5%9b%bd123%";
String url = convertPercent(test);
System.out.println(url);
System.out.println(URLDecoder.decode(url,"utf8"));
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
输出:
http://www.baidu.com?%e4%b8%ad%e5%9b%bd123%25
http://www.baidu.com?中国123%