用HTTPWebrequest的CharacterSet属性时，结果总是空字符

最新推荐文章于 2023-03-29 20:18:48 发布

quou2002

最新推荐文章于 2023-03-29 20:18:48 发布

阅读量3.9k

点赞数

分类专栏：程序开发-- .Net 文章标签： string internet 测试 google blog 语言

本文链接：https://blog.csdn.net/quou2002/article/details/633130

版权

程序开发-- .Net 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

愿意想通过HTTPWebrequest 和HttpWebResponse得到网页源代码，程序如下：



     
     
      
      
      
       1 
      
      HttpWebRequest httpReq ;

      
       2 
      
      HttpWebResponse httpResp ;

      
       3 
      
      Uri httpUrl 
      
      =
      
       
      
      new
      
       Uri(
      
      "
      
      http://www.microsoft.com
      
      "
      
      );

      
       4 
      
      httpReq 
      
      =
      
       (HttpWebRequest)WebRequest.Create(httpUrl);

      
       5 
      
      

      
       6 
      
      httpResp 
      
      =
      
       (HttpWebResponse)httpReq.GetResponse();

      
       7 
      
      site.ResponseUrl 
      
      =
      
       httpResp.ResponseUri.ToString();
      
      //
      
      这里赋值ResponseUrl

      
       8 
      
      
      
      //
      
          httpReq.KeepAlive = false; 
      
      //
      
      获取或设置一个值，该值指示是否与 Internet 资源建立持久连接
      
      

      
       9 
      
      
      
      

      
      10 
      
      StreamReader reader 
      
      =
      
       
      
      new
      
       StreamReader(httpResp.GetResponseStream(),System.Text.Encoding.Default);

      
      11 
      
      sourceCode 
      
      =
      
       reader.ReadToEnd(); 
      
      //
      
      网页源代码

      
      12

但是遇到了CharacterSet为UTF-8的网站时，得到的源代码为乱码。如果将高亮部分改为“System.Text.Encoding.UTF-8”，这些网站可以正确得到源代码，但是GB2312编码的网站又出错。察看msdn，HTTPWebrequest 的CharacterSet 属性可以返回该网站编码，可是不管哪类网站，此处值总为空字符串。（本意想通过这得到编码字符串，再调用System.Text.Encoding.GetEncoding()的）

搜索了google，类似问题没有。http://channel9.msdn.com/ShowPost.aspx?PostID=166867#166867 上看见有人报告了HttpWebResponse中get_CharacterSet()函数的bug，其实也和我想解决的无关，虽说是个bug，其实无影响。找到HttpWebResponse.cs（HttpWebResponse函数源码），http://dotnet.di.unipi.it/content/sscli/docs/doxygen/fx/bcl/httpwebresponse_8cs-source.html，也没解决问题。

----在blog上放放把，同时也在csdn里问问，以后再说了只能。

=========update next day======

感谢‘net_lover’的回复（http://community.csdn.net/Expert/topic/4633/4633372.xml?temp=.4076044）。我已解决。借鉴了net_lover思路中用Content-Type来判断。具体过程还是参见了HttpWebResponse.cs（HttpWebResponse类实现的源码），代码如下：



     
     
      
      
      
       1 
      
      //
      
      得到CharacterSet
      
      

      
       2 
      
      
      
      private
      
       
      
      string
      
       getEncoding (HttpWebResponse httpResp)

      
       3 
      
      {

      
       4 
      
       
      
      string
      
       contentType 
      
      =
      
       httpResp.ContentType ;
      
      //
      
      类似“Content-Type: text/html; charset=utf-8;”或“Content-Type: text/html; charset=utf-8”或者“Content-Type: text/html”
      
      //
      
      注意utf-8后面有可能没有分号
      
      

      
       5 
      
      
      
       
      
      int
      
       i 
      
      =
      
       contentType.IndexOf(
      
      "
      
      charset=
      
      "
      
      );

      
       6 
      
       
      
      if
      
       (i
      
      >=
      
      0
      
      )

      
       7 
      
       {

      
       8 
      
        i 
      
      +=
      
       
      
      8
      
      ;

      
       9 
      
        
      
      int
      
       j 
      
      =
      
       contentType.IndexOf(
      
      '
      
      ;
      
      '
      
      , i);

      
      10 
      
        
      
      if
      
       (j
      
      >=
      
      i) 

      
      11 
      
        {

      
      12 
      
         
      
      return
      
       contentType.Substring(i,j
      
      -
      
      i).Trim();

      
      13 
      
        }

      
      14 
      
        
      
      return
      
       contentType.Substring(i);

      
      15 
      
       }

      
      16 
      
       
      
      return
      
       
      
      string
      
      .Empty;

      
      17 
      
      }

      
      18

只是现在还是不解，HTTPWebrequest的CharacterSet函数为何在我测试的多个不同语言类型网站时，始终为空字符。明显微软写这个函数的目的就是应该可以直接得到charset的（个人看法）。HttpWebResponse.cs（HttpWebResponse类实现的源码，地址见上）中的CharacterSet函数还涉及到MediaType属性（这个属性值我测试也始终为空字符），具体原因有时间再找答案。
而且我还发现有些网站的ContentType中的charset并非符合RFC规范，例如groups.msn.com，它的utf-8后面就多了一个分号。所以就有了上面代码中的return contentType.Substring(i)一句。