1.网页抓取数据第一步,通过url获取网页信息。
1.1注意:网页的字符集,否则可能会出现乱码的情况。
1.2建议方法:获取网页内容时判断网页的字符集。
1.3参考代码:
public static string GetHtml(string url)
{
string htmlCode;
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Timeout = 30000;
webRequest.Method = "GET";
webRequest.UserAgent = "Mozilla/4.0";
webRequest.Headers.Add("Accept-Encoding", "gzip, deflate");
HttpWebResponse webResponse = (System.Net.HttpWebResponse)webRequest.GetResponse();
//获取目标网站的编码格式
string contentype = webResponse.Headers["Content-Type"];
Regex regex = new Regex("charset\\s*=\\s*[\\W]?\\s*([\\w-]+)", RegexOptions.IgnoreCase);
if (webResponse.ContentEncoding.ToLower() == "gzip")//如果使用了GZip则先解压
{
using (System.IO.Stream streamReceive = webResponse.GetResponseStream())
{
using (var zipStream = new System.IO.Compression.GZipStream(streamReceive, System.IO.Compression.CompressionMode.Decompress))
{
//匹配编码格式
if (regex.IsMatch(contentype))
{
Encoding ending = Encoding.GetEncoding(regex.Match(contentype).Groups[1].Value.Trim());
using (StreamReader sr = new System.IO.StreamReader(zipStream, ending))
{
htmlCode = sr.ReadToEnd();
}
}
else
{
using (StreamReader sr = new System.IO.StreamReader(zipStream, Encoding.UTF8))
{
htmlCode = sr.ReadToEnd();
}
}
}
}
}
else
{
using (System.IO.Stream streamReceive = webResponse.GetResponseStream())
{
using (System.IO.StreamReader sr = new System.IO.StreamReader(streamReceive, Encoding.Default))
{
htmlCode = sr.ReadToEnd();
}
}
}
return htmlCode;
}
2.引用引用HtmlAgilityPack.dll文件,将获取到的网页内容转换为HtmlNode节点。
参考代码:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(网页内容);
HtmlNode rootNode = document.DocumentNode;
3.根据网页节点获取所需内容。
建议:如果安装了Firefox火狐浏览器debug调试器,按F12选中网页节点点击右键“复制xpath”,可以快速得到网页节点路径。
参考代码: HtmlNode cityCodeNode = rootNode.SelectSingleNode("节点路径");
string cityCode = cityCodeNode.InnerHtml;