PulsarRPA 教程 3 - 数据提取

PlatonAI

已于 2023-05-28 21:49:05 修改

阅读量616

点赞数

分类专栏： PulsarRPA 文章标签： kotlin 网络爬虫数据挖掘 java 大数据

于 2022-10-31 12:00:43 首次发布

本文链接：https://blog.csdn.net/weixin_48738961/article/details/127534242

版权

PulsarRPA 专栏收录该内容

24 篇文章 12 订阅

订阅专栏

数据提取

PulsarRPA 使用 jsoup 从 HTML 文档中提取数据。 Jsoup 将 HTML 解析为与现代浏览器相同的 DOM。查看 selector-syntax 以查阅所有受支持的 CSS 选择器，这里也有一份中文版标准 CSS 选择器的详细表格。

现代网页源代码变化非常频繁，但是一个网站的网页“看上去”不会变太多，从而保证一致的用户体验。这时候，从视觉特征去审视网页元素就特别有效。为了更好地从视觉特征和数字特征来看待网页，PulsarR 扩展了 CSS，从而可以从视觉特征和数字特征角度解决最复杂的现实问题。

首先准备一个文档：

// 加载一个页面，如果该页面已过期，或者该页面为首次加载，则从互联网上下载该页面
val page = session.load(url, "-expires 1d")
// 将一个网页内容解析为Jsoup文档
val document = session.parse(page)

最简单的数据提取：

// 使用该文档做一些事情
val title = document.selectFirst('.title').text()
val price = document.selectFirst('.price').text()

稍稍复杂一点的例子：

// a with href
val links = document.select("a[href]")

// img with src ending .png
val pngs = document.select("img[src$=.png]")

// div with class=masthead
val masthead = document.select("div.masthead").first()

// direct a after h3
val resultLinks = document.select("h3.r > a")

// finds sibling td element preceded by th who contains text "Best Sellers Rank"
val bsr = document.select("th:contains(Best Sellers Rank) ~ td")

PulsarRPA 扩展了 CSS 和 Jsoup，来解决复杂的现实问题：

为 DOM 的每个 Node 计算了数值特征
扩展 CSS 语法，支持在 CSS 查询中使用数学运算
提供一批实用方法来简化和增强 DOM 操作

PulsarR 为 DOM 的每个 Node 计算了数值特征。目前，PulsarR 支持以下数值特征：

top,       // the top coordinate in pixel of the element
left,      // the left coordinate in pixel of the element
width,     // the width in pixel of the element
height,    // the height in pixel of the element
char,      // number of chars inside this node
txt_nd,    // number of descend text nodes
img,       // number of descend images
a,         // number of descend anchors
sibling,   // number of sibling nodes
child,     // number of children
dep,       // node depth
seq,       // node sequence in the document
txt_dns    // text node density

PulsarRPA 扩展 CSS 语法，支持在 CSS 查询中使用数学运算。

为了弥补现有 CSS 表达式的不足，PulsarR 引入了 :expr(expression) 伪选择器，使得我们可以在 CSS 中做数学运算。

譬如：

/* 选择左侧栏宽 500 的链接 */
a:expr(left < 100 && width == 500)

/* 选择面积超过 1600 的图片 */
img:expr(width * height > 1600)

/* 选择文档中所有 div 元素，这些元素需要包含一张图片，宽和高都在 400 ~ 500 之间 */
div:expr(img == 1 && width > 400 && width < 500 && height > 400 && height < 500)

选择文档中所有 400 x 400 的图片：

val imgs400x400 = document.select("img:expr(width == 400 && height == 400)")

选择文档中所有 div 元素，这些元素宽和高都在 400 ~ 500 之间，并且包含一张图片：

val expr = "img == 1 && width > 400 && width < 500 && height > 400 && height < 500"
val elements = doc.select("div:expr($expr)")

下面的表格列出了标准运算符。

算术运算符

Name	Description
-	The prefix minus operator, like in “-2”
+	The prefix minus operator, like in “+2”
-	The infix minus operator, like in “5-2”
+	The infix plus operator, like in “5+2”
*	The multiplication operator
/	The division operator
^	The power-of operator
%	The modulo operator (remainder)

布尔运算符

Name	Description
=, ==	The equals operator
!=, <>	The not equals operator
!	The prefix not operator, like in !a
>	The greater than operator
>=	The greater equals operator
<	The less than operator
<=	The less equals operator
&&	The and operator
\|\|	The or operator

在 PulsarR 专业版中，我们将会引入更多有趣的数值特征来支持机器学习和人工智能，譬如拓扑结构相关的特征。

在 X-SQL 章节，我们将会详细介绍如何在 X-SQL 中使用 CSS 选择器来选择元素及其属性。一个综合性较强的真实案例是 x-asin.sql（国内镜像），它使用了各种各样的 CSS 选择器来解决最复杂的电商网页数据提取问题，下面是一些片段：

dom_first_attr(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width>400)', 'data-old-hires') as `imgsrc`,
dom_first_attr(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width>400)', 'data-a-dynamic-image') as `dynamicimgsrcs`,
dom_first_slim_html(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width>400)') as `img`,
dom_first_text(dom, '#price tr td:contains(List Price) ~ td') as `listprice`,
dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #price_inside_buybox') as `price`,
dom_first_text(dom, '#price #priceblock_dealprice, #price tr td:contains(Deal of the Day) ~ td') as `withdeal`,
dom_first_text(dom, '#price #dealprice_savings .priceBlockSavingsString, #price tr td:contains(You Save) ~ td') as `yousave`,

PlatonAI

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PulsarRPA 教程 3 - 数据提取

现代网页源代码变化非常频繁，但是一个网站的网页“看上去”不会变太多，从而保证一致的用户体验。这时候，从视觉特征去审视网页元素就特别有效。为了更好地从视觉特征和数字特征来看待网页，PulsarRPA 扩展了 CSS，从而可以从视觉特征和数字特征角度解决最复杂的现实问题。
复制链接

扫一扫