源码:https://github.com/KaiZons/RenrenyingshiSpider
1.下载HTML:
public static string DownloadHtml(string url)
{
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
request.Timeout = 30 * 1000; // 30s超时
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63";
request.ContentType = "text/html charset=UTF-8"; // 编码可以在浏览器的开发者模式的控制台中输入document.charset 获取
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
if (response.StatusCode != HttpStatusCode.OK)
{
return null;
}
StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
string html = reader.ReadToEnd();
reader.Close();
return html;
}
}
注意点:
UserAgent 是在浏览器中,F12,选择“网络”,然后刷新网页,选中url,查看其UserAgent:
2.筛选数据:使用HTMLAgilityPack包,支持xpath分析,xpath表示使用路径表达式来选取xml节点
static void Main(string[] args)
{
string url = "http://yyetss.com/list-jingsong-all-1.html";
string html = HttpHelper.DownloadHtml(url);
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]/div[2]/div"); //通过xPath。获取div;
foreach (HtmlNode node in nodes)
{
string imgPath = node.SelectSingleNode("a/img").Attributes["src"].Value; //获取每个div下的a/img中的src属性;
string teleplayName = node.SelectSingleNode("div/a/p[1]").InnerText;
string description = node.SelectSingleNode("div/a/p[2]/span").InnerText;
Console.WriteLine($"名称:{teleplayName} 描述:{description} 图片路径:{imgPath}");
}
Console.ReadLine();
}
注意点:
HtmlDocument是HtmlAgilityPack提供的类;
document.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]/div[2]/div");是通过绝对路径访问;
string imgPath = node.SelectSingleNode("a/img").Attributes["src"].Value;是通过相对路径访问;
3.展示结果: