PuppeteerSharp+AngleSharp的爬虫实战之汽车之家数据抓取

参考了DotNetSpider示例
感觉DotNetSpider太重了,它是一个比较完整的爬虫框架。
对比了以下各种无头浏览器,最终采用PuppeteerSharp+AngleSharp写一个爬虫示例。
和上面的博文一样,都是用汽车之家的https://store.mall.autohome.com.cn/83106681.html这个页面做数据采集示例。

Headless Browsers

A list of (almost) all headless web browsers in existence

A web browser without a graphical user interface, controlled programmatically. Used for automation, testing, and other purposes.

Browser engines

These browser engines fully render web pages or run JavaScript in a virtual DOM

NameAboutSupported LanguagesLicense
Chromium Embedded FrameworkCEF is a open source project based on the Google Chromium project.JavaScriptBSD
ErikHeadless browser on top of Kanna and WebKit.SwiftMIT
jBrowserDriverA Selenium-compatible headless browser which is written in pure Java. WebKit-based. Works with any of the Selenium Server bindings.JavaApache License v2.0
PhantomJS[Unmaintained] PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R(via Selenium)BSD 3-Clause
SplashSplash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT.AnyBSD 3-Clause

Multi drivers

These libraries can control multiple browser engines (typically using Selenium)

NameAboutSupported LanguagesLicense
CasperJSCasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko).JavaScriptMIT
GebGeb is a Groovy interface to WebDriver.GroovyApache
SeleniumSelenium is a suite of tools to automate web browsers across many platforms.JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, RApache
SplinterSplinter is an open source tool for testing web applications using Python. It lets you automate browser actions, such as visiting URLs and interacting with their items.Python-
SSTSST (selenium-simple-test) is a web test framework that uses Python to generate functional browser-based tests.Python-
WatirThe most elegant way to use Selenium WebDriver with ruby.RubyMIT

PhantomJS drivers

These libraries control PhantomJS

NameAboutSupported LanguagesLicense
GhostbusterAutomated browser testing via phantom.js, with all of the pain taken out! That means you get a real browser, with a real DOM, and can do real testing!JavaScriptNot specified
jedi-crawlerLightsabing Node/PhantomJS crawler; scrape dynamic content : without the hassleJavaScriptNot specified
LotteLotte is a headless, automated testing framework built on top of PhantomJS and inspired by Ghostbuster.JavaScriptMIT
phantompyPhantompy is a headless WebKit engine with powerful pythonic api build on top of Qt5 WebkitPythonLGPL-2.1
X-RAYSupports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP)JavaScriptMIT
HorsemanPromise based Node.js module for PhantomJS. Features chainable API, understandable control-flow, support for multiple tabs, and built-in jQuery.JavaScriptMIT

Chromium drivers

These libraries control Chromium

NameAboutSupported LanguagesLicense
AwesomiumChromium-based headless browser engineC++,Free/Commercial
Headless ChromiumChromium feature activated with the --headlesss flag, currently availible in the nightly build of Chromium, not yet releasedC++Opensource
PuppeteerHeadless Chrome Node API from the Chrome DevTools teamJavaScriptApache
PuppeteerSharpPuppeteerSharp is a port of the official Headless Chrome Node.JS Puppeteer APIMIT
chrome-remote-interfaceChrome Debugging Protocol interface for Node.jsJavaScriptMIT
ChromyFeatures chainable API, mobile emulation, fundamental API such as javascript evaluation.JavaScriptMIT
chromedpA faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using the Chrome Debugging Protocol.GoMIT
ChromelessChrome automation made simple. Runs locally or headless on AWS Lambda.JavaScriptMIT

Webkit drivers

These drivers control an in-process instance of Webkit

NameAboutSupported LanguagesLicense
BrowserjetRuns a custom build of webkit, controlled by node.js interface.JavaScriptNot specified
ghost.pyghost.py is a webkit web client written in python.PythonMIT
headless_browserHeadless browser based on WebKit written in C++.C++Not Specified
Jabba-WebkitJabba's headless webkit browser for scraping AJAX-powered webpages.PythonNot specified
Jasmine-Headless-Webkitjasmine-headless-webkit uses the QtWebKit widget to run your specs without needing to render a pixel.Python, JavaScript, RubyFree
Python-WebkitPython-Webkit is a python extension to Webkit to add full, complete access to Webkit's DOMPythonGNU
SpynnerProgrammatic web browsing module with AJAX support for PythonPythonNot specified
WebloopScriptable, headless WebKit with a Go API.GoBSD 3-Clause
wkhtmltopdf wkhtmltox wkhtmltoimageCommand line tool rendering HTML into PDF and other image formats.shell, CLGPLv3
WKZombieFunctional headless browser (with JSON support) for iOS using WebKit and hpple/libxml2.SwiftMIT

Other drivers

These libraries control lesser known browsers or OS-provided web libraries

NameAboutSupported LanguagesLicense
NightmareNightmare is a high-level browser automation library built as an easier alternative to PhantomJS. It runs on the Electron engine.JavaScriptMIT
gropeA RubyCocoa interface to the macOS WebKit FrameworkRubyCocoaMIT
SlimerJSSlimerJS is similar to PhantomJs, except that it runs Gecko, the browser engine of Mozilla Firefox, instead of Webkit (And it is not yet truly headless).JavaScriptMozilla 2.0
SpecterJSA scriptable headless Internet Explorer port of PhantomJS.JavaScriptMIT
trifleJSA headless Internet Explorer browser using the WebBrowser Class with a Javascript API running on the V8 engine.JavaScriptMIT

Fake Browser Engine

These libraries are typically naive or HTML-only browsers

NameAboutSupported LanguagesLicense
AngleSharpHttp Parsing LibraryMIT
GuillotineA headless browser, written in C#LGPL-3.0
benvStub a browser environment in node.js and headlessly test your client-side code.JavaScriptMIT
browser.rbHeadless Ruby browser on top of Nokogiri and TheRubyRacerRubyNot specified
BrowserKitBrowserKit simulates the behavior of a web browser.PHPMIT
DamonJSBot navigating urls and doing tasks.JavaScriptApache
HeadlessHeadless browser support for fast web acceptance testing inMIT
HeadlessBrowserA very miniature headless browser, for testing the DOM on Node.jsJavaScriptNot specified
HtmlUnitHtmlUnit is a "GUI-Less browser for Java programs".JavaApache
JauntJava Web Scraping & Automation APIJavaNot specified
JSDomA JavaScript implementation of the WHATWG DOM and HTML standards, for use with Node.js.JavaScriptMIT
MechanicalSoupA Python library for automating interaction with websites.PythonMIT
mechanizeStateful programmatic web browsing.PythonBSD 3-Clause, ZPL 2.1
node-as-browserCreate a browser-like environment within Node.jsJavaScriptMIT
RoboBrowserA simple, Pythonic library for browsing the web without a standalone web browser.PythonBSD 3-Clause
SimpleBrowserA flexible and intuitive web browser engine designed for automation tasks. Built on the 4 framework.BSD 3-Clause
stanislawNaive, mechanize-like HTML parser/form driver.PythonNot specified
twillTwill is a simple language that interacts with basic HTML pages (no JavaScript support).PythonMIT
WeasyPrintWeasyPrint is a visual rendering engine for HTML and CSS that can export to PDF. It aims to support web standards for printing.PythonBSD 3-Clause
WWW::MechanizeHeadless browser for Perl with many plugins and extensions, notably Test::WWW:Mechanize for testingPerlPerl 5
X-RAYSupports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP)JavaScriptMIT
Xidel (Internet Tools)An XQuery-based cli web scraper for static X/HTML pages and JSON-APIs.FreePascal, XQueryGPL-2
Zombie.jsZombie.js is a lightweight framework for testing client-side JavaScript code in a simulated environment. No browser required.JavaScriptMIT

Runs in a browser

NameAboutSupported LanguagesLicense
DalekJS[unmaintained and recommend TestCafé] Automated cross browser testing with JavaScript.JavaScriptMIT
TestCaféAutomated browser testing for the modern web development stack.JavaScriptMIT
SahiSahi is a cross-browser automation/testing tool with the facility to record and playback scripts.JavaScript, Java, Ruby, PHPApache / Commercial
WatiNWeb Application Testing InApache 2.0

Misc tools

NameAboutSupported LanguagesLicense
browser-launcherDetect and launch browser versions, headlessly or otherwiseJavaScriptMIT

其实如果没有JavaScripts加载数据需求,单独用AngleSharp就可以搞定了。
但涉及到JavaScripts加载数据需求的,就需要上真正的无头浏览器组件才能搞定了。
AngleSharp现在只支持简单的JavaScripts代码执行,稍微复杂点的,都不行,听说以后要完整支持JavaScripts,敬请期待吧!

Code

/*
 * This is a Puppeteer+AngleSharp crawler console app samples
 */
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using AngleSharp;
using AngleSharp.Dom;
using AngleSharp.Parser.Html;
using Newtonsoft.Json;
using PuppeteerSharp;

namespace CrawlerSamples
{
    internal class Program
    {
        private const string Url = "https://store.mall.autohome.com.cn/83106681.html";
        private const int ChromiumRevision = Downloader.DefaultRevision;

        private static async Task Main(string[] args)
        {
            //Download chromium browser revision package
            await Downloader.CreateDefault().DownloadRevisionAsync(ChromiumRevision);

            //Test AngleSharp
            await TestAngleSharp();

            Console.ReadKey();
        }

        private static async Task TestAngleSharp()
        {
            /*
             * Used AngleSharp loading of HTML document
             * TODO: Used WithJavaScript function need install AngleSharp.Scripting.Javascript nuget package
             * Note: that JavaScripts support is an experimental and does not support complex JavaScripts code.
             */
            //IConfiguration config = Configuration.Default.WithDefaultLoader().WithCss().WithCookies().WithJavaScript();
            //IBrowsingContext context = BrowsingContext.New(config);
            //IDocument document = await context.OpenAsync(url);

            //Used PuppeteerSharp loading of HTML document
            var htmlString = await TestPuppeteerSharp();

            /*
             * Parsing of HTML document string
             */
            var parser = new HtmlParser(Configuration.Default);
            var document = parser.Parse(htmlString);

            //Selector carbox element list
            var carboxList = document.QuerySelectorAll("div.customGoodsList div.content div.list li.carbox");

            var carModelList = new List<CarModel>();
            foreach (var carbox in carboxList)
            {
                //Parsing and converting to the car model object.
                var model = CreateModelWithAngleSharp(carbox);
                carModelList.Add(model);

                //Printing to console windows
                var jsonString = JsonConvert.SerializeObject(model);
                Console.WriteLine(jsonString);
                Console.WriteLine();
            }

            Console.WriteLine("Total count:" + carModelList.Count);
        }

        private static async Task<string> TestPuppeteerSharp()
        {
            //Enabled headless option
            var launchOptions = new LaunchOptions { Headless = true };
            //Starting headless browser
            var browser = await Puppeteer.LaunchAsync(launchOptions, ChromiumRevision);

            //New tab page
            var page = await browser.NewPageAsync();
            //Request URL to get the page
            await page.GoToAsync(Url);

            //Get and return the HTML content of the page
            var htmlString = await page.GetContentAsync();

            #region Dispose resources
            //Close tab page
            await page.CloseAsync();

            //Close headless browser, all pages will be closed here.
            await browser.CloseAsync();
            #endregion

            return htmlString;
        }

        private static CarModel CreateModelWithAngleSharp(IParentNode node)
        {
            var model = new CarModel
            {
                Title = node.QuerySelector("a div.carbox-title").TextContent,
                ImageUrl = node.QuerySelector("a div.carbox-carimg img").GetAttribute("src"),
                ProductUrl = node.QuerySelector("a").GetAttribute("href"),
                Tip = node.QuerySelector("a div.carbox-tip").TextContent,
                OrdersNumber = node.QuerySelector("a div.carbox-number span").TextContent
            };

            return model;
        }
    }
}

Result

402416-20180627162713099-235883644.png

Note

注意,第一次运行,这一句代码:

await Downloader.CreateDefault().DownloadRevisionAsync(ChromiumRevision);

会从网络上下载浏览器便捷式安装包download-Win64-536395.zip到你本地,里面解压后是一个Chromium浏览器。这里需要等待一些时间。

Source

https://github.com/VAllens/CrawlerSamples

转载于:https://www.cnblogs.com/VAllen/p/PuppeteerSharp-AngleSharp-CrawlerSamples.html

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值