c++里面没有现成的urlencode和urldecode函数,所以我自己写了两个函数。
网上许多代码都是针对gb2312格式的汉字,我写的这个既有适用于utf8格式的汉字,也有gb的。
下面先说下差别:
比如汉字中国,utf8 的中国编码后:"%e4%b8%ad%e5%9b%bd",
gb2312的中国编码后:"%d6%d0%b9%fa" 。
默认汉字编码为gb,所以如果要按照utf8编码的话,需要用MultiByteToWideChar转变一下格式。
我的编译环境:
vs.net2005下c++。
参考:
http://www.codeguru.com/Cpp/Cpp/cpp_mfc/article.php/c4029
URL Encoding
Chandrasekhar Vuppalapati (view profile) May 22, 2003 |
Environment: VC++, MFC
Introduction前言
写这篇文章的目的是设计一个c++的类,实现 URL encoding的功能。在我前一个项目中,我编写的一个VC++ 6.0的程序需要提交数据,在提交之前需要encoding。我在msdn中查找相关的类和api函数都没有找到,所以我决定自己写一个。
The URLEncoder.exe is a MFC dialog-based application that uses the URLEncode class.
Process
URL encoding 是保证你传给网页的字符都是安全的,有些字符在网络传递数据的过程中有着特殊的意义。比如说,ASCII的13这个字符,你的程序在发送数据时会认为这个字符意味着一行数据的结束。
通常,所有的web程序在客户端和服务器端传输数据使用的都是HTTP 或 HTTPS 协议,服务器端有两种方式接收客户端的数据:
- 通过 HTTP 头(可以通过cookies 或者提交表单)或者
- 作为请求url的一部分
当数据作为url的一部分时,数据就必须经过encode来符合url语法。在服务器端,数据自动decode。看下面这个url ,数据作为一个请求参数串。
Example: http://WebSite/ResourceName?Data=Data
Web Site:域名
Resource Name:ASP或者 Servlet的名字.
Data 是传给服务器的数据. 如果MIME 类型是 .Content-Type: application/x-www-form-urlencoded的话,你就需要encoder你的数据。
RFC 1738
RFC 1738 规范定义了Uniform Resource Locators (URLs) 里面的字符只能是US-ASCII 里面的一部分字符。另一方面,这个限定是因为HTML,可以在文档中使用所有 ISO-8859-1 (ISO-Latin) 。这就导致了你要提交HTML格式的串,需要encode。
ISO-8859-1 (ISO-Latin) Character Set
下表是所有的ISO-8859-1的编码,对应256个字符。表中提供了这些字符的十进制,描述,值和在HTML中安全与否。这些字符大致分为两类,安全和非安全字符。
Character range(decimal) | Type | Values | Safe/Unsafe |
0-31 | ASCII Control Characters | These characters are not printable | Unsafe |
32-47 | Reserved Characters | ' '!?#$%&'()*+,-./ | Unsafe |
48-57 | ASCII Characters and Numbers | 0-9 | Safe |
58-64 | Reserved Characters | :;<=>?@ | Unsafe |
65-90 | ASCII Characters | A-Z | Safe |
91-96 | Reserved Characters | [/]^_` | Unsafe |
97-122 | ASCII Characters | a-z | Safe |
123-126 | Reserved Characters | {|}~ | Unsafe |
127 | Control Characters | ' ' | Unsafe |
128-255 | Non-ASCII Characters | ' ' | Unsafe |
所有的非安全的ASCII码字符都必须encode,比如,值在(32-47, 58-64, 91-96, 123-126)范围的字符.
下表说明这些字符为什么不安全。
Character | Unsafe Reason | Character Encode |
"<" | Delimiters around URLs in free text | %3C |
> | Delimiters around URLs in free text | %3E |
. | Delimits URLs in some systems | %22 |
# | It is used in the World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. | %23 |
{ | Gateways and other transport agents are known to sometimes modify such characters | %7B |
} | Gateways and other transport agents are known to sometimes modify such characters | %7D |
| | Gateways and other transport agents are known to sometimes modify such characters | %7C |
/ | Gateways and other transport agents are known to sometimes modify such characters | %5C |
^ | Gateways and other transport agents are known to sometimes modify such characters | %5E |
~ | Gateways and other transport agents are known to sometimes modify such characters | %7E |
[ | Gateways and other transport agents are known to sometimes modify such characters | %5B |
] | Gateways and other transport agents are known to sometimes modify such characters | %5D |
` | Gateways and other transport agents are known to sometimes modify such characters | %60 |
+ | Indicates a space (spaces cannot be used in a URL) | %20 |
/ | Separates directories and subdirectories | %2F |
? | Separates the actual URL and the parameters | %3F |
& | Separator between parameters specified in the URL | %26 |
How It Is Done
URL encoding 一个字符就是转化为%加上它的ASCII码,比如, US-ASCII 字符集中的空格,十六机制为20,encode以后为%20.
URLEncode: URLEncode 是个 C++ 类, 用于encode字符串。 CURLEncode 类有一下成员.
- isUnsafeString
- decToHex
- convert
- URLEncode
URLEncode() 方法完成 encoding 过程. URLEncode 检查每个字符是否是不安全字符,如果是就转化为%+ASCII码的格式。
参考文档完毕
以下是我写的两个针对utf8汉字格式的函数:
string URLDecode(string &strSrc)
{
string buffer = "";
int len = strSrc.length();
for (int i = 0; i < len; i++)
{
int j = i ;
char ch = strSrc.at(j);
if (ch == '%')
{
char tmpstr[] = "0x0__";
int chnum;
tmpstr[3] = strSrc.at(j+1);
tmpstr[4] = strSrc.at(j+2);
chnum = strtol(tmpstr, NULL, 16);
buffer += chnum;
i += 2;
}
else if(ch == '+')
{
buffer += ' ';
}
else
{
buffer += ch;
}
}
len=MultiByteToWideChar(CP_UTF8, 0, buffer.c_str (), -1, NULL,0);
WCHAR * wszUtf8 = new WCHAR[len+1];
if( wszUtf8 == NULL )
{
strSrc = "";
return strSrc;
}
memset(wszUtf8, 0, len * 2 + 2);
MultiByteToWideChar(CP_UTF8, 0, buffer.c_str (), -1, wszUtf8, len);
len = WideCharToMultiByte(CP_ACP, 0, wszUtf8, -1, NULL, 0, NULL, NULL);
char *szUtf8=new char[len + 1];
if( szUtf8 == NULL )
{
delete[] wszUtf8;
strSrc = "";
return strSrc;
}
memset(szUtf8, 0, len + 1);
WideCharToMultiByte (CP_ACP, 0, wszUtf8, -1, LPSTR(szUtf8), len, NULL,NULL);
strSrc = szUtf8;
delete[] szUtf8;
delete[] wszUtf8;
return strSrc;
}
string URLEncode(string &strSrc)
{
int len=MultiByteToWideChar(CP_ACP, 0, strSrc.c_str (), -1, NULL,0);
WCHAR * wszUtf8 = new WCHAR[len+1];
if( wszUtf8 == NULL )
{
strSrc = "";
return strSrc;
}
memset(wszUtf8, 0, len * 2 + 2);
MultiByteToWideChar(CP_ACP, 0, strSrc.c_str (), -1, wszUtf8, len);
len = WideCharToMultiByte(CP_UTF8, 0, wszUtf8, -1, NULL, 0, NULL, NULL);
char *szUtf8=new char[len + 1];
if( wszUtf8 == NULL )
{
delete[] wszUtf8;
strSrc = "";
return strSrc;
}
memset(szUtf8, 0, len + 1);
WideCharToMultiByte (CP_UTF8, 0, wszUtf8, -1, LPSTR(szUtf8), len, NULL,NULL);
szUtf8[len]=0;
string strResult = szUtf8;
char num[4];
//处理特殊字符,转化为%ASCII码的形式,‘ ’转为‘+’
string::size_type index1 = 0 ;
while ( (index1 =strResult.find_first_of( "/"< >%//^[]`+$,@:;/!#?=&", index1)) != string::npos )
{
memset( num, 0, sizeof(num) );
if( strResult[index1]==' ')
{
num[0] = '+';
strResult.replace (index1, 1, num);
index1+=1;
continue;
}
sprintf_s( num, "%c%2x", '%', (strResult[index1]&255 ));
strResult.replace (index1, 1, num);
index1+=3;
}
strSrc ="";
for(unsigned i=0;i<strResult.length ();i++)
{
memset( num, 0, sizeof(num) );
strResult[i]&=255;
if( strResult[i] <= 32 || strResult[i]>= 123 )
{
strSrc+='%';
sprintf_s( num, "%2x", strResult[i]&255 );
strSrc+=num;
}
else
{
strSrc+=strResult[i];
}
}
delete[] szUtf8;
delete[] wszUtf8;
return strSrc;
}
以下是gb的两个函数:
string URLDecode(string &strSrc)
{
string buffer = "";
int len = strSrc.length();
for (int i = 0; i < len; i++)
{
int j = i ;
char ch = strSrc.at(j);
if (ch == '%')
{
char tmpstr[] = "0x0__";
int chnum;
tmpstr[3] = strSrc.at(j+1);
tmpstr[4] = strSrc.at(j+2);
chnum = strtol(tmpstr, NULL, 16);
buffer += chnum;
i += 2;
}
else if(ch == '+')
{
buffer += ' ';
}
else
{
buffer += ch;
}
}
return buffer;
}
string URLEncode(string &strSrc)
{
string strResult = strSrc;
char num[4];
//处理特殊字符,转化为%ASCII码的形式,‘ ’转为‘+’
string::size_type index1 = 0 ;
while ( (index1 =strResult.find_first_of( "/"< >%//^[]`+$,@:;/!#?=&", index1)) != string::npos )
{
memset( num, 0, sizeof(num) );
if( strResult[index1]==' ')
{
num[0] = '+';
strResult.replace (index1, 1, num);
index1+=1;
continue;
}
sprintf_s( num, "%c%2x", '%', (strResult[index1]&255 ));
strResult.replace (index1, 1, num);
index1+=3;
}
strSrc ="";
for(unsigned i=0;i<strResult.length ();i++)
{
memset( num, 0, sizeof(num) );
strResult[i]&=255;
if( strResult[i] <= 32 || strResult[i]>= 123 )
{
strSrc+='%';
sprintf_s( num, "%2x", strResult[i]&255 );
strSrc+=num;
}
else
{
strSrc+=strResult[i];
}
}
return strSrc;
}