AngleSharp示例

最新推荐文章于 2024-01-25 09:21:15 发布

XBMY

最新推荐文章于 2024-01-25 09:21:15 发布

阅读量1.3k

点赞数 1

分类专栏： Http

可以分享

本文链接：https://blog.csdn.net/cxb2011/article/details/102910616

版权

Http 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

AngleSharp部署

可以通过Nuget安装如下图所示安装包.安装时请注意相关版本要对应
在这里插入图片描述

API文档

核心API

AngleSharp是用一个有用的标准遵从性API创建的。如果您只关心解析单个文档或样式表，那么始终可以只使用各种解析器，例如HtmlParser或者CssParser…在大多数情况下(解析和使用网页)，我们推荐BrowsingContext位于AngleSharp命名空间。此命名空间还包含一些具有扩展方法的类型，并且Url类，严格遵循WHATWG规范中描述的算法。

这个AngleSharp.Attributes命名空间还具有用于修饰接口(以及枚举和委托)的属性。我们有：

DomAccessorAttribute用于定义特殊的访问器，如getter、setter或删除器。
DomHistoricalAttribute指示过时状态
DomDescriptionAttribute存储DOM部件的描述字符串
DomNoInterfaceObjectAttribute 声明仅进行接口的类型，即没有可用的对象实现。
DomPutForwardsAttribute若要设置相关对象的方法的名称，应在其中转发输入。
DomNameAttribute若要表示原始api名称，请执行以下操作
接口/DOM对象的API已经以这样的方式进行了更改，它仍然是原始的DOMAPI(什么都没有缺少)，但是命名和类型符合.NET，并且(希望)更容易使用。

配置

这个IConfiguration是扩展AngleSharp的主要接口。如果我们不关心，我们就不必向AngleSharp提供一个IConfiguration用于解析文档或URL。在该场景中，将考虑默认配置。

可以通过调用方法来选择此默认配置。SetDefault在Configuration 类，等级。我们只需要传递一个自定义配置的实例作为默认配置。这个Configuration类的默认实现。IConfiguration接口。默认实现通常是提供自定义配置的良好起点。

a的作用IConfiguration对象是提供要为浏览上下文使用或创建的服务的枚举(如果有的话)。而这些服务扩展了DOM，而不仅仅是静态版本。

下面的示例设置Culture新的IConfiguration目的：

var config = Configuration.Default.WithCulture("de-de");

如果没有Culture指定时，将从当前活动线程获取区域性。

请注意，Configuration默认情况下，所有扩展方法都是不可变的，所有扩展方法都将返回当前实例(未修改)或新对象(与最初传递的对象相比修改)。

此外，我们可能创建自己的类，它比不可变的实现更方便、更灵活。我们可以扩展默认实现，也可以实现接口。但是，请注意，扩展方法实际上应该使使用IConfiguration很简单也很直截了当。

实现接口也是可能的，但当然需要更多的工作，因为每个属性或方法都需要重新实现。但是，因为默认实现的属性不是virtual，这可能是提供所需设置的唯一机会。一般来说，实施的理由不多。IConfiguration我们自己。

扩展点

AngleSharp应该创建一个通用的HTML 5解析器，它可以在.NET世界中访问并完全用托管代码编写。但是，一些应用程序可能希望超越解析器。仅解析器本身就需要来自外部的大量帮助才能创建DOM。

为了向AngleSharp的用户提供一个完整的DOM实现，AngleSharp用一个真正的DOM实现扩展了HTML 5解析器。然而，这再次带来了一些额外的依赖关系。如果，例如，HtmlAudioElement应该从它的来源播放音频吗？当然，只需在元素周围编写一个包装器，就可以读取Source监督变化。但是，这个包装器可能与DOM的其他部分不兼容(和/或行为不同)。

通过以接口的形式定义这种外部行为，可以避免这种不一致性。实现这样一个接口的对象可以注册为在AngleSharp中使用。

在本页中，将列出并解释各个扩展点。目前缺少(但计划中的)接口将被草图。

IStylingService

如果样式(也是标记)或者遇到脚本块时，AngleSharp试图找到匹配的引擎。.的实施.IStylingService例如，查看提供的MIME类型并返回关联的IStyleEngine对象或null…如果遇到后者，则另一个IStylingService对象，直到找到了适当的引擎或所有样式服务都已被请求为止。同样的算法适用于IScriptingService…这些服务只描述工厂、轻量级存储库或绑定的功能。

AngleSharp已经包含了一个轻量级的CSS解析器(用于CSS选择器)。这是AngleSharp的设计目标之一。此外，它也是必需的，因为(HTML)DOM在某些方面与CSS有很强的耦合。一个例子是querySelector和querySelectorAll方法。这些方法需要一个CSS解析器(或至少一个CSS选择器解析器)。最后，生成一个能够匹配某些元素的对象。

然而，HTML浏览器可能知道也可能不知道CSS以外的其他样式。当前CSS默认使用(并且没有指定，例如，text/css在type属性)。但是，可以为特定(或多个)此类类型注册样式解析器。按照这种方式，还可以注册自定义CSS解析器，覆盖当前提供的解析器。

这方面的一个例子是官方AngleSharp.Css库，它实现了完整的CSSOM。

IScriptingService

默认情况下，AngleSharp不提供脚本引擎。当然，任何JavaScript引擎都是一个很好的补充，因为JavaScript是Web的编程语言。

然而，目前没有提供正式/综合解决办法的意图。这个AngleSharp.Scripting包含我们组织的项目，是一个示例和实验项目，以演示编写这样一个扩展是多么简单。在这里，我们可以开始考虑允许C#作为一种脚本语言。这当然是可能的。使用scriptcs或任何其他解决方案进行备份–这将是一个很好的补充，这也可能是不同的东西。从长远来看，AngleSharp支持多个脚本引擎是很棒的。

正式的，我们试图建立AngleSharp.Js作为解决办法。这是一个独立于核心的项目，需要高度的维护和大量的努力。如果能参加这次会议，我们将不胜感激。

ISpellCheckService

允许注册单个拼写检查器。每个拼写检查器都在其文化中注册，使其对网页或文本可能使用的任何文化都很敏感。

API允许忽略某些单词，如果单词拼写正确，可以运行查询，或者获得正确拼写单词的建议。现在API已经同步实现，但将来可能切换到更好的版本，即完全异步。

IResourceService

此扩展是在各种接口中实现的，请参见：

IAudioService，为了HtmlAudioElement
IVideoService，为了HtmlVideoElement
IImageService，为了HtmlImageElement
IObjectService，为了HtmlObjectElement
其基本思想是确定包含的资源的某些属性。专业化的实施IResourceInfo携带与资源相关的信息，这将是一个图像的尺寸和更多。如属HtmlAudioElement整个媒体控制器也将被实现，允许媒体流的回放。

如属HtmlObjectElement这直接导致插件，如AdobeFlash或其他。显然，AngleSharp核心不负责这些非常专门的任务。

INavigator

创建INavigator实例的IWindow举个例子。这个INavigator一开始似乎是相当通用的，但是，它确实可以专门化到底层IWindow例如，特别是在访问媒体资源方面(例如，网络摄像头或麦克风)。

AngleSharp并不是为了为更充分和专门的实现留出空间而实现接口的。另外，前面描述的增强外部外设用户体验的能力似乎很有吸引力。

IHistory

这个IHistory接口(通常以创建者的形式)从服务中检索。它描述了创建新DOM的功能。IHistory对象，该对象将与浏览上下文关联。

伊温道
有多个DOM元素可以以更专门和有用的方式实现。一个关键因素是IWindow执行。它不是直接需要的，因为所有DOM交互都涉及到IDocument，它不依赖于IWindow…但是，特别是在脚本环境中，IWindow实例扮演了一个非常重要的角色。

对于更高级的方案，如呈现，则为IWindow界面看起来很重要。因此，给定的服务使用户能够注册自定义。IWindow造物主。请注意，目前这种自定义创建者至少需要从EventTarget(实现IEventTarget)。在未来，这一要求有望被省略。

ILoader

可以完全自定义AngleSharp中的文档或资源。主接口是ILoader，这是两个更专业的装载机的基础。一个叫做IDocumentLoader另一个就是IResourceLoader…而前者则用于在浏览上下文中加载真正的文档(因此存在max)。一IDocumentLoader每IBrowsingContext)，后者用于在IDocument…显然我们最多只需要一个IResourceLoader每IDocument.

这种架构有两大优势：

职责是明确分开的，每个上下文(主要或文档)都可以跟踪它们自己的请求。
在不影响文档加载/表单提交的情况下，很容易关闭资源加载(即使是特定的元素)。
还有一个基本实现称为BaseLoader.

Getting Started

示例

AngleSharp示例地址: https://github.com/AngleSharp/AngleSharp/tree/master/doc

解析一个良好结构文档

AngleSharp可以很好地处理定义良好的文档

var source = @"
<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content=""initial-scale=1, minimum-scale=1, width=device-width"">
  <title>Error 404 (Not Found)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/errors/logo_sm_2.png) no-repeat}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/errors/logo_sm_2_hr.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:55px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>404.</b> <ins>That’s an error.</ins>
  <p>The requested URL <code>/error</code> was not found on this server.  <ins>That’s all we know.</ins>";

//Use the default configuration for AngleSharp
var config = Configuration.Default;

//Create a new context for evaluating webpages with the given config
var context = BrowsingContext.New(config);

//Just get the DOM representation
var document = await context.OpenAsync(req => req.Content(source));

//Serialize it back to the console
Console.WriteLine(document.DocumentElement.OuterHtml);

我们定义一些源码,调用创建一个BrowsingContext,并调用 OpenAsync 方法. 这个OpenAsync方法允许我们解析来自任何类型请求的文档，例如来自Web服务器的文档。回调样式称为“虚拟请求”，它不调用真正的请求，而是保留在代码中。

在这种情况下，我们使用提供的源代码来确定请求响应的内容。然后将响应的内容解析为HTML文档。之后，我们将DOM序列化回一个字符串。最后，我们在控制台中输出这个字符串。

简单文档操作

AngleSharp根据正式的HTML 5规范构造DOM。这也意味着生成的模型是完全交互式的，可以用于简单的操作。下面的示例创建一个文档，并通过插入带有某些文本的另一个段落元素来更改树结构。

static async Task FirstExample()
{
    //Use the default configuration for AngleSharp
    var config = Configuration.Default;

    //Create a new context for evaluating webpages with the given config
    var context = BrowsingContext.New(config);

    //Parse the document from the content of a response to a virtual request
    var document = await context.OpenAsync(req => req.Content("<h1>Some example source</h1><p>This is a paragraph element"));

    //Do something with document like the following
    Console.WriteLine("Serializing the (original) document:");
    Console.WriteLine(document.DocumentElement.OuterHtml);

    var p = document.CreateElement("p");
    p.TextContent = "This is another paragraph.";

    Console.WriteLine("Inserting another element in the body ...");
    document.Body.AppendChild(p);

    Console.WriteLine("Serializing the document again:");
    Console.WriteLine(document.DocumentElement.OuterHtml);
}

解析器将创建一个新的IHtmlDocument实例，然后查询该实例以找到一些匹配的节点。在上面的示例代码中，我们还创建了另一个IElement，也就是IHtmlParagraphElement…然后将此一个追加到Body节点。

获得某些元素

AngleSharp将所有DOM列表公开为IEnumerable 像 IEnumerable为NodeList类，等级。这允许我们将LINQ与一些已经提供的DOM功能结合使用，比如QuerySelectorAll方法。

static async Task UsingLinq()
{
    //Create a new context for evaluating webpages with the default config
    var context = BrowsingContext.New(Configuration.Default);

    //Create a document from a virtual request / response pattern
    var document = await context.OpenAsync(req => req.Content("<ul><li>First item<li>Second item<li class='blue'>Third item!<li class='blue red'>Last item!</ul>"));

    //Do something with LINQ
    var blueListItemsLinq = document.All.Where(m => m.LocalName == "li" && m.ClassList.Contains("blue"));

    //Or directly with CSS selectors
    var blueListItemsCssSelector = document.QuerySelectorAll("li.blue");

    Console.WriteLine("Comparing both ways ...");

    Console.WriteLine();
    Console.WriteLine("LINQ:");

    foreach (var item in blueListItemsLinq)
    {
        Console.WriteLine(item.TextContent);
    }

    Console.WriteLine();
    Console.WriteLine("CSS:");

    foreach (var item in blueListItemsCssSelector)
    {
        Console.WriteLine(item.TextContent);
    }
}

因为All的性质IDocument返回所有IElement包含在文档中的节点，我们可以非常有效地使用LINQ。另一方面，QuerySelectorAll还返回(与All)IHtmlCollection对象。因此，这可以与LINQ以及过滤！此外，此列表已被过滤。

也可以获得与All带有选择符-特殊星号*选择器：

//Same as document.All
var blueListItemsLinq = document.QuerySelectorAll("*").Where(m => m.LocalName == "li" && m.ClassList.Contains("blue"));

获得单一元素

static async Task SingleElements()
{
    //Create a new context for evaluating webpages with the default config
    var context = BrowsingContext.New(Configuration.Default);

    //Create a new document
    var document = await context.OpenAsync(req => req.Content("<b><i>This is some <em> bold <u>and</u> italic </em> text!</i></b>"));

    var emphasize = document.QuerySelector("em");

    Console.WriteLine("Difference between several ways of getting text:");
    Console.WriteLine();
    Console.WriteLine("Only from C# / AngleSharp:");
    Console.WriteLine();
    Console.WriteLine(emphasize.ToHtml());   //<em> bold <u>and</u> italic </em>
    Console.WriteLine(emphasize.TextContent);   // bold and italic

    Console.WriteLine();
    Console.WriteLine("From the DOM:");
    Console.WriteLine();
    Console.WriteLine(emphasize.InnerHtml);  // bold <u>and</u> italic
    Console.WriteLine(emphasize.OuterHtml);  //<em> bold <u>and</u> italic </em>
    Console.WriteLine(emphasize.TextContent);// bold and italic
}

输出命令试图演示从节点获取字符串的几种方法之间的差异。实际上，DOM属性OuterHtml使用ToHtml()生成HTML代码的版本。其他变体都是不同的。当Text()只是一个删除文本的助手(省略不需要的文本内容，如

扩展方法，如ToHtml()和Text()可以在命名空间AngleSharp.Extensions中找到。.

运行JS语句

static async Task SimpleScriptingSample()
{
    //We require a custom configuration
    var config = Configuration.Default.WithJs();

    //Create a new context for evaluating webpages with the given config
    var context = BrowsingContext.New(config);

    //This is our sample source, we will set the title and write on the document
    var source = @"<!doctype html>
        <html>
        <head><title>Sample</title></head>
        <body>
        <script>
        document.title = 'Simple manipulation...';
        document.write('<span class=greeting>Hello World!</span>');
        </script>
        </body>";

    var document = await context.OpenAsync(req => req.Content(source));

    //Modified HTML will be output
    Console.WriteLine(document.DocumentElement.OuterHtml);
}

这段代码只是解析给定的HTML代码，遇到提供的JavaScript并执行它。JavaScript将操作文档，更改文档的标题，并附加更多的HTML以进行解析。控制台印的是改变后的(序列化的)HTML.

更复杂的JavaScript DOM交互

在AngleSharp中使用JavaScript轻松地使用DOM操作，比如创建元素、追加或删除元素。

下面的示例代码执行DOM查询，创建新元素并删除现有元素。

static void ExtendedScriptingSample()
{
    //We require a custom configuration with JavaScript and CSS
    var config = Configuration.Default.WithJs().WithCss();

    //Create a new context for evaluating webpages with the given config
    var context = BrowsingContext.New(config);

    //This is our sample source, we will do some DOM manipulation
    var source = @"<!doctype html>
        <html>
        <head><title>Sample</title></head>
        <style>
        .bold {
        font-weight: bold;
        }
        .italic {
        font-style: italic;
        }
        span {
        font-size: 12pt;
        }
        div {
        background: #777;
        color: #f3f3f3;
        }
        </style>
        <body>
        <div id=content></div>
        <script>
        (function() {
        var doc = document;
        var content = doc.querySelector('#content');
        var span = doc.createElement('span');
        span.id = 'myspan';
        span.classList.add('bold', 'italic');
        span.textContent = 'Some sample text';
        content.appendChild(span);
        var script = doc.querySelector('script');
        script.parentNode.removeChild(script);
        })();
        </script>
        </body>";

    var document = await context.OpenAsync(req => req.Content(source));

    //HTML will have changed completely (e.g., no more script element)
    Console.WriteLine(document.DocumentElement.OuterHtml);
}

原则上，还可以添加其他JavaScript引擎。当然，与基于反射的自动版本相比，手动包装对象提供了更好的性能。不过，AngleSharp.Js库(可在NuGet上获得)展示了将现有JavaScript引擎绑定到AngleSharp的可能性和基础知识。

JavaScript和C#中的事件

以下示例的开头与前两个示例完全相同。我们创建一个自定义配置，其中包含JavaScriptEngine引擎。在启用脚本(在本例中是样式)之后，我们可以解析我们的文档。

此示例的示例文档由一个脚本组成，该脚本调用console.log方法。一次在添加侦听器之前，另一次在添加监听器之后。

当文档完全加载后，将调用侦听器。这是在执行提供的JavaScript之后发生的，因此我们应该在最后看到这个事件。我们还注册了另一个事件侦听器，它将在自定义事件发生后被调用。

public static void EventScriptingExample()
{
    //We require a custom configuration
    var config = Configuration.Default.WithJs();

    //Create a new context for evaluating webpages with the given config
    var context = BrowsingContext.New(config);

    //This is our sample source, we will trigger the load event
    var source = @"<!doctype html>
        <html>
        <head><title>Event sample</title></head>
        <body>
        <script>
        console.log('Before setting the handler!');

        document.addEventListener('load', function() {
        console.log('Document loaded!');
        });

        document.addEventListener('hello', function() {
        console.log('hello world from JavaScript!');
        });

        console.log('After setting the handler!');
        </script>
        </body>";

    var document = await context.OpenAsync(req => req.Content(source));

    //HTML should be output in the end
    Console.WriteLine(document.DocumentElement.OuterHtml);

    //Register Hello event listener from C# (we also have one in JS)
    document.AddEventListener("hello", (s, ev) =>
    {
        Console.WriteLine("hello world from C#!");
    });

    var e = document.CreateEvent("event");
    e.Init("hello", false, false);
    document.Dispatch(e);
}

XBMY

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
AngleSharp示例

AngleSharp示例地址: https://github.com/AngleSharp/AngleSharp/tree/master/docExample CodeThis is a (growing) list of examples for every-day usage of AngleSharp.Parsing a Well-Defined DocumentOf cours...
复制链接

扫一扫