.net研究院之爬虫(第三方工具包HtmlAgilityPack)

最新推荐文章于 2023-06-25 10:04:48 发布

qq_33931256

最新推荐文章于 2023-06-25 10:04:48 发布

阅读量260

点赞数

分类专栏： .net研究院

本文链接：https://blog.csdn.net/qq_33931256/article/details/102482605

版权

.net研究院专栏收录该内容

12 篇文章 0 订阅

订阅专栏

为什么要这个？
做个内容站--小说/电影/动漫---阿里云+爬虫+Web
数据搜集爬虫---招标数据爬虫/淘宝数据/招聘信息
竞品分析--抓取竞争对手数据
爬虫违法吗？
不问自取谓之偷；爬虫能拿到的信息都是浏览器能访问到的，就是公开数据；
不要基于盈利(小爬虫都没事儿)；360搜索引擎--被判赔偿--违背了robots

爬虫就是分析--->尝试---->测试--->分析--->尝试---->测试

爬虫攻防：
robot协议(道德防线)：根域名/robots.txt 弱的约定
---服务端请求信息(urlrefer/agent){爬虫模拟好就能突破}
---用户登录{模拟请求带上cookie}
---IP黑名单白名单{代理请求}
---识别爬虫后定期返回验证码{换IP/打码平台}
---js动态加载/动态修改/数据图片化{可以解决}
爬虫：道高一尺魔高一丈，任何信息是无法阻止抓取的

Html下载----数据筛选清洗入库---多线程
数据筛选：正则(麻烦)/indexof+substring+replace/第三方工具包HtmlAgilityPack支持Xpath(本质是正则)
安全控件--ActiveX

1 深度抓取&批量数据高效匹配获取
2 所见非所得属性获取
3 Ajax数据的获取
4 多线程抓取

以下历程时爬取豆瓣影评

源代码文件：

爬取豆瓣的网址如下：https://movie.douban.com/subject/30282387/reviews

Program.cs

using Newtonsoft.Json;
using Ruanmou.Crawler;
using Ruanmou.Crawler.Model;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace TestCrawler
{
    class Program
    {
        static void Main(string[] args)
        {
            //主要的还是URL
            string testCategory = "{\"Id\":73,\"Code\":\"02f01s01T\",\"ParentCode\":\"02f01s\",\"Name\":\"影评\",\"Url\":\"https://movie.douban.com/subject/30282387/reviews\",\"Level\":3}";
            
            Category category = JsonConvert.DeserializeObject<Category>(testCategory);
            ISearch search = new DoubanSearch(category);
            search.Crawler();

            Console.ReadKey();
        }
    }
}

DoubanSearch.cs

using HtmlAgilityPack;
using Ruanmou.Crawler;
using Ruanmou.Crawler.Model;
using Ruanmou.Crawler.Utility;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace TestCrawler
{
    public class DoubanSearch : ISearch
    {
        private Logger logger = new Logger(typeof(DoubanSearch));
        private Category category = null;
        private int time = 0;
        public DoubanSearch(Category _category)
        {
            category = _category;
        }
        public void Crawler()
        {

            try
            { 
                //url是否为空
                if (string.IsNullOrEmpty(category.Url)) 
                {
                    return;
                }
                string html = HttpHelper.DownloadUrl(category.Url);//下载html


                HtmlDocument document = new HtmlDocument();
                document.LoadHtml(html);

                //获取整个节点的所有div，一页总共有20条评论
                string path = "//*[@id='content']/div/div[1]/div[1]/div";
                HtmlNodeCollection node = document.DocumentNode.SelectNodes(path);

                foreach (HtmlNode data in node)
                {
                    FindDoubanCommentSingle(data);
                }
            }
            catch (Exception)
            {
            }
        }
        /// <summary>
        /// 处理显示单个评论
        /// </summary>
        /// <param name="node"></param>
        public void FindDoubanCommentSingle(HtmlNode node)
        {
            try
            {
                HtmlDocument htmlDocument = new HtmlDocument();
                htmlDocument.LoadHtml(node.OuterHtml);
                HtmlNode node1 = htmlDocument.DocumentNode;

                //评论内容的Xpath
                string xpath = "//*[@class='review-short']/div/text()[1]";

                HtmlNode nameNode = node1.SelectSingleNode(xpath);
                string str = nameNode.ParentNode.InnerText;

                time++;
                Console.WriteLine($"爬取数据->{time}");
                Console.WriteLine(str);
                Console.WriteLine();
            }
            catch (Exception)
            {

            }
            
        }
    }
}

结果如下，一共20条

qq_33931256

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
.net研究院之爬虫(第三方工具包HtmlAgilityPack)

为什么要这个？做个内容站--小说/电影/动漫---阿里云+爬虫+Web 数据搜集爬虫---招标数据爬虫/淘宝数据/招聘信息竞品分析--抓取竞争对手数据爬虫违法吗？不问自取谓之偷；爬虫能拿到的信息都是浏览器能访问到的，就是公开数据；不要基于盈利(小爬虫都没事儿)；360搜索引擎--被判赔偿--违背了robots爬虫就是分析---&gt...
复制链接

扫一扫

专栏目录