HtmlAgilityPack 抓取中文页面乱码问题的解决方案

最新推荐文章于 2021-05-31 05:34:10 发布

ArvinStudy

最新推荐文章于 2021-05-31 05:34:10 发布

阅读量3.8k

点赞数

分类专栏： Html Agility Pack

Html Agility Pack 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

来自：http://outofmemory.cn/code-snippet/2002/HtmlAgilityPack-zhuaqu-zhongwen-page-luanma-question-jiejuefangan

HtmlAgilityPack是用C#写的开源Html Parser。不过它的某些方面设计不尽完善，比如，按照其正常模式抓取中文网页，往往获得的是乱码。比如，抓取新华网首页(http://xinhua.org)。模仿HtmlAgilityPack示例，爬取代码如下：

  HtmlWeb hw = new HtmlWeb();
  string url = @"http://xinhua.org";
  HtmlDocument doc = hw.Load(url);
  doc.Save("output.html");

获得的页面用ie打开，是乱码。

穿越HtmlAgilityPack的代码迷宫，最后发现问题出在HtmlWeb类的Get(Uri uri, string method, string path, HtmlDocument doc)方法中。该方法有以下代码：

  HttpWebResponse resp;

  try
  {
      resp = req.GetResponse() as HttpWebResponse;
  }
  ……
  if ((resp.ContentEncoding != null) && (resp.ContentEncoding.Length>0))
  {
      respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding);
  }
  else
  {
      respenc = null;
  }
  ……
  Stream s = resp.GetResponseStream();
  if (s != null)
  {
      if (UsingCache)
      {
          // NOTE: LastModified does not contain milliseconds, so we remove them to the file
          SaveStream(s, cachePath, RemoveMilliseconds(resp.LastModified), _streamBufferSize);

          // save headers
          SaveCacheHeaders(req.RequestUri, resp);

          if (path != null)
          {
              // copy and touch the file
              IOLibrary.CopyAlways(cachePath, path);
              File.SetLastWriteTime(path, File.GetLastWriteTime(cachePath));
          }
      }
      else
      {
          // try to work in-memory
          if ((doc != null) && (html))
          {
              if (respenc != null)
              {
                  doc.Load(s, respenc);
              }
              else
              {
                  doc.Load(s, true);
              }
          }
      }
      resp.Close();
  }

其中resp是http请求的response。设置断点发现resp.ContentEncoding为空。于是最后的加载行为便变成了doc.Load(s, true);而这个load方法也可能出了问题，最后得到的是乱码。

解决方法：

不使用HttpWeb，该类不成熟。自己写http请求，代码如下：

HttpWebRequest req;
req = WebRequest.Create(new Uri(@"http://xinhua.org")) as HttpWebRequest;
req.Method = "GET";
WebResponse rs = req.GetResponse();
Stream rss = rs.GetResponseStream();
String url = @"http://xinhua.org";
try
{
    HtmlDocument doc = new HtmlDocument();
    doc.Load(rss);
    doc.Save("output.html");
}
catch (Exception e)
{
    Console.WriteLine(e.Message.ToString());
    Console.WriteLine(e.StackTrace);
}