用.Net core写爬虫之HtmlAgilityPack用法详解

最新推荐文章于 2024-08-13 08:29:05 发布

卷儿哥

最新推荐文章于 2024-08-13 08:29:05 发布

阅读量2.4k

点赞数

分类专栏： .NET 文章标签： c# 爬虫 html https

本文链接：https://blog.csdn.net/DahlinSky/article/details/104587512

版权

.NET 专栏收录该内容

48 篇文章 23 订阅

订阅专栏

HtmlAgilityPack用法详解

在上一篇用.Net core写爬虫之HttpClient用法详解中我们已经知道了怎么发送HTTP请求，获取到数据了，那么接下来就是如何解析这些数据，提取我们想要的信息了，在Python中常用的解析库有 PyQuery，BeautifulSoup，lxml等，在.Net中与之对应的库就是HtmlAgilityPack了，它的原理也是利用Xpath语法对Dom树节点进行结构解析，十分简单，还和其他语言通用。

1. HtmlAgilityPack简介

HtmlAgilityPack 简称HAP，是一个用C#语言开发的用来解析html Dom和XML的第三方解析类库，用官网的描述是 HAP is an HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT.基本意思都差不多，HAP的官网地址是 https://html-agility-pack.net/

2. HtmlAgilityPack使用

根据官方文档，它有几个基础的类，包括Parser，Selectors，Manipulation，Traversing，Writer，Utilities，Attributes。通过名字，我也大概知道他们的用法，比如Parser是一个解析类，Selectors是选择器类，Manipulation是节点操作类…等等，具体的官方文档已经交代的很清晰了，而且英文也很简单，基本不用查字典就能明白个八九不离十。

官方文档： https://html-agility-pack.net/documentation

2.1 获取html字符串

解析之前我们先要得到html 文档，我这里用HttpClient简单获取一下。

static string urlRoot = "https://www.haolizi.net/examples/csharp_{0}.html";

/// <summary>
/// 获取html页面
/// </summary>
/// <param name="requestUrl">url地址</param>
/// <returns></returns>
public static async Task<string> HtmlRequest(string requestUrl)
{
	HttpClient httpClient = new HttpClient();
	httpClient.DefaultRequestHeaders.Add("Method", "Get");
	httpClient.DefaultRequestHeaders.Add("KeepAlive", "false"); 
	httpClient.DefaultRequestHeaders.Add("UserAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
	var response = await httpClient.GetStringAsync(requestUrl);
	return response;
}

string requestUrl = string.Format(urlRoot, 1);
Console.WriteLine(requestUrl);
string html = HtmlRequest(requestUrl).Result;

当然我们也可以直接用 HAP自带的方法来加载html 文档：

HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(requestUrl);

2.2 解析html字符串

接下来我们就要解析html Dom文档，从中提取我们需要的元素数据了，要先用浏览器F12定位一下元素标签，如图所示，很快我们就能搞清楚它的Dom树结构，最后我们就可以用Xpath语法将有用的信息提取出来了。截图中画红框的就是我们要提取的元素，可以看到凡是带有超链接的字符串都是我们要提取的内容。

F12 定位Dom文档示意：
提取解析源代码：

/// <summary>
/// 解析提取字段
/// </summary>
/// <param name="htmlStr"></param>
/// <returns></returns>
public  static void GetExampleData(string htmlStr)
{

	#region 字段
	string rootUrl = @"https://www.haolizi.net";
	string name = string.Empty;
	string detailUrl = string.Empty;
	string category = string.Empty;
	string categoryUrl = string.Empty;
	int hotNum = -1;
	int downloadCount = -1;
	int needScore = 0;
	string devLanguage = string.Empty;
	string downloadSize = string.Empty;
	string pubdate = string.Empty;
	string pubPersion = string.Empty;
	string downloadUrl = string.Empty;
	#endregion
	
	HtmlDocument htmlDoc = new HtmlDocument();
	htmlDoc.LoadHtml(htmlStr);
	
	var liNodes = htmlDoc.DocumentNode.SelectNodes("//div[@class='content-box']/ul/li");
	foreach(HtmlNode node in liNodes)
	 {
		List<string> tags = new List<string>();
		
		#region 提取元素
		// 实例标题
		HtmlNode aNode = node.SelectSingleNode("./div[@class='baseinfo']/h3/a");
		 name = aNode.InnerText;
		 detailUrl = rootUrl + aNode.Attributes["href"].Value;
		// 实例种类
		 HtmlNode categoryNode = node.SelectSingleNode("./div[@class='baseinfo']/a");
		 category = categoryNode.InnerText;
		 categoryUrl = rootUrl + categoryNode.Attributes["href"].Value;
		// 下载人气
		 HtmlNode hotNumNode = node.SelectSingleNode("./div[@class='baseinfo']/div[@class='xj']/span[@class='rq']/em");
		 hotNum = Convert.ToInt32(hotNumNode.InnerText);
		// 下载次数
		 HtmlNode downloadCountNode = node.SelectSingleNode("./div[@class='baseinfo']/div[@class='xj']/span[2]");
		 downloadCount = Convert.ToInt32(downloadCountNode.InnerText);
		// 下载所需积分
		 HtmlNode needScoreNode = node.SelectSingleNode("./div[@class='baseinfo']/div[@class='xj']/span[3]");
		 needScore = Convert.ToInt32(needScoreNode.InnerText);
		// 开发语言
		 HtmlNode devLanguageNode = node.SelectSingleNode("./div[@class='sinfo']/div/p[@class='fun']/span[1]");
		 devLanguage = devLanguageNode.NextSibling.InnerText.Replace("&nbsp;", "").Replace("|", "");
		// 下载大小
		 HtmlNode downloadSizeNode = node.SelectSingleNode("./div[@class='sinfo']/div/p[@class='fun']/span[2]");
		 downloadSize = downloadSizeNode.InnerText;
		// 发布时间
		 HtmlNode pubdateNode = node.SelectSingleNode("./div[@class='sinfo']/div/p[@class='fun']/span[3]");
		 pubdate = pubdateNode.InnerText;
		// 发布人
		 HtmlNode pubPersionNode = node.SelectSingleNode("./div[@class='sinfo']/div/p[@class='fun']/span[4]/a");
		 pubPersion = pubPersionNode.InnerText;
		// 相关标签
		 var tagNodes = node.SelectNodes("./div[@class='sinfo']/div/p[@class='fun']/span[contains(@class , 'zwch')]");
		 if (tagNodes != null)
		 {
			 foreach (var tnode in tagNodes)
			 {
				 tags.Add(tnode.SelectSingleNode("./a").InnerText);
				 // Console.WriteLine(name + " tag:" + tnode.SelectSingleNode("./a").InnerText);
			 }
		 }
		#endregion
		
		string jsonStr = JsonConvert.SerializeObject(new {
			Name = name,
			Category = category,
			DevLanguage = devLanguage,
			DownloadCount = downloadCount,
			DownloadSize = downloadSize.Replace("大小：", "").Trim(),
			HotNum = hotNum,
			NeedScore = needScore,
			Pubdate = Convert.ToDateTime(pubdate.Replace("发布时间：", "").Trim()),
			PubPersion = pubPersion
		});
		Console.WriteLine(jsonStr);
	 };
}