之前看到有很多朋友在下载网页的时候会出现乱码的问题,也有很多朋友提出了解决方案,但是觉得都不是很正规,比如很常见的使用正则表达式抓取的那个方法.其实我们可以使用WenRequest和reponse的方法来实现.代码如下:
private static string DownloadHtml(string url)
{
string content = string.Empty;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.Timeout = 600000;
request.AllowAutoRedirect = true;
request.ContentType = "application/x-www-form-urlencoded";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader srHtml = new StreamReader(stream,
Encoding.GetEncoding(response.CharacterSet));
content = srHtml.ReadToEnd();
response.Close();
stream.Close();
srHtml.Close();
return content;
}
其实网页的编码就藏在response.CharacterSet里面,不需要使用正则来截取了.