C#利用HtmlAgilityPack实现简单爬虫

最新推荐文章于 2024-06-01 08:58:34 发布

KelonsByCsdn

最新推荐文章于 2024-06-01 08:58:34 发布

阅读量410

点赞数

分类专栏： C# 爬虫文章标签：爬虫 C#

本文链接：https://blog.csdn.net/u013986317/article/details/115283161

版权

C# 同时被 2 个专栏收录

44 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

源码：https://github.com/KaiZons/RenrenyingshiSpider

1.下载HTML：

public static string DownloadHtml(string url)
        {
            HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
            request.Timeout = 30 * 1000; // 30s超时
            request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63";
            request.ContentType = "text/html charset=UTF-8"; // 编码可以在浏览器的开发者模式的控制台中输入document.charset 获取
            using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
            {
                if (response.StatusCode != HttpStatusCode.OK)
                {
                    return null;
                }
                StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
                string html = reader.ReadToEnd();
                reader.Close();
                return html;
            }
        }

注意点：

UserAgent 是在浏览器中，F12，选择“网络”，然后刷新网页，选中url，查看其UserAgent：

2.筛选数据：使用HTMLAgilityPack包，支持xpath分析，xpath表示使用路径表达式来选取xml节点

static void Main(string[] args)
        {
            string url = "http://yyetss.com/list-jingsong-all-1.html";
            string html = HttpHelper.DownloadHtml(url);
            HtmlDocument document = new HtmlDocument();
            document.LoadHtml(html);
            HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]/div[2]/div"); //通过xPath。获取div;
            foreach (HtmlNode node in nodes)
            {
                string imgPath = node.SelectSingleNode("a/img").Attributes["src"].Value; //获取每个div下的a/img中的src属性;
                string teleplayName = node.SelectSingleNode("div/a/p[1]").InnerText;
                string description = node.SelectSingleNode("div/a/p[2]/span").InnerText;
                Console.WriteLine($"名称：{teleplayName} 描述：{description} 图片路径：{imgPath}");
            }
            Console.ReadLine();
        }

注意点：

HtmlDocument是HtmlAgilityPack提供的类；

document.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]/div[2]/div");是通过绝对路径访问；

string imgPath = node.SelectSingleNode("a/img").Attributes["src"].Value;是通过相对路径访问；

3.展示结果：

KelonsByCsdn

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
C#利用HtmlAgilityPack实现简单爬虫

1.下载HTML：public static string DownloadHtml(string url) { HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest; request.Timeout = 30 * 1000; // 30s超时 request.UserAgent = "Mozilla/5.0 (Windows
复制链接

扫一扫