WebClient类的DownloadString方法的缺陷

最新推荐文章于 2020-02-13 20:59:29 发布

weixin_30408675

最新推荐文章于 2020-02-13 20:59:29 发布

阅读量722

点赞数

文章标签： c#

原文链接：http://www.cnblogs.com/ukessi/archive/2009/02/09/1386567.html

版权

问题发现：

用以下代码获取的网页源代码，大部分中文显示正常，一部分成为??

Code
using System;
using System.Collections.Generic;
using System.Net;
using System.Text;
namespace Test_GetUTF8Website
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient client = new WebClient();
            string s = client.DownloadString("http://www.cnblogs.com");
            string res = Encoding.UTF8.GetString(Encoding.Default.GetBytes(s));
            Console.WriteLine(res);
            Console.ReadLine();
        }
    }
}

思考：

先看看DownloadString()的源代码片段：

WebRequest request;
byte[] bytes = this.DownloadDataInternal(address, out request);
string retValue = this.GuessDownloadEncoding(request).GetString(bytes);

DownloadString()首先把数据下载回来，形式是byte[]，然后猜测数据的编码，问题就在于它是猜测- -

然后看看猜测的过程

Code
private Encoding GuessDownloadEncoding(WebRequest request)
{
    try
    {
        string contentType = request.ContentType;
        if (contentType == null)
        {
            return this.Encoding;
        }
        string[] strArray = contentType.ToLower(CultureInfo.InvariantCulture).Split(new char[] { ';', '=', ' ' });
        bool flag = false;
        foreach (string str2 in strArray)
        {
            if (str2 == "charset")
            {
                flag = true;
            }
            else if (flag)
            {
                return Encoding.GetEncoding(str2);
            }
        }
    }
    catch (Exception exception)
    {
        if (((exception is ThreadAbortException) || (exception is StackOverflowException)) || (exception is OutOfMemoryException))
        {
            throw;
        }
    }
    catch
    {
    }
    return this.Encoding;
}

可以看出，它获取request的ContentType来猜测Charset。Everything seems to be OK so far.

问题在于谁来给request的ContentType赋值呢？
在DownloadString()里面，WebRequest声明后到猜测前，只有DownloadDataInternal函数，也是DownloadDataInternal()给request赋值。而DownloadDataInternal()里面是GetWebRequest()给request赋值。

GetWebRequest()里面的CopyHeadersTo()有ContentType的赋值，但是无论是http://www.cnblogs.com还是http://g.cn都无法获得ContentType

现在ContentType找不到，WebClient会以Encoding.Default（在简体中文windows里是GB2132,CodePage=936的Encoding）把byte[]进行编码成字符串，传统的思路是自己获得网页的编码，自己把编码后的字符串解码回byte[]，再用网页的编码编码为字符串
但是诡异的是，如果我们用Encoding.Default把byte[]编码成字符串，再同样用Encoding.Default解码回byte[]，前后两个byte[]的长度会不同。这样就导致我们把DownloadString()返回的字符串解码成byte[]回来，这个byte[]已经是被修改了的。自然我们手动改成正确编码时字符串会出错。

ps:我留意到，错误显示为??的地方，在本来的html里面是中文紧跟符号(尖括号，逗号等），猜测是符号的字节对齐导致编解码的错误，具体有待进一步考证。

解决方法：使用webclient.DownloadData方法获取原汁原味的byte[]，再手动用相应的Encoding来编码

转载于:https://www.cnblogs.com/ukessi/archive/2009/02/09/1386567.html

weixin_30408675

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
WebClient类的DownloadString方法的缺陷

问题发现：用以下代码获取的网页源代码，大部分中文显示正常，一部分成为?? Codeusing System;using System.Collections.Generic;using System.Net;using System.Text;namespace Test_GetUTF8Website{ class Program { staticvoid Mai...
复制链接

扫一扫