抓取网络html并显示,从多个屏幕网页获取网络抓取信息

RSelenium Basics

07002

> RSelenium无头浏览

07003

> RSelenium Vignette

07004

代码示例 # We want to make this as easy as possible to use

# So we need to install required packages for the user...

#

if (!require(RSelenium)) install.packages("RSelenium")

if (!require(XML)) install.packages("XML")

if (!require(RJSONIO)) install.packages("RSJONIO")

if (!require(stringr)) install.packages("stringr")

# Data

#

mainPage

businessPage

# StartServer

# We assume RSelenium is not setup,so we check if the RSelenium

# server is available,if not we install RSelenium server.

checkForServer()

# OK. now we start the server

RSelenium::startServer()

remDr

# We assume the user has installed Firefox and the Selenium IDE

# https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/

#

# Ok we open firefix

remDr$open(silent = T) # Open up a firefox window...

# Now we open the browser and required URL...

# This is the page that matters...

remDr$navigate(businessPage)

# First things first on the first page,lets get the id's for the radio_button,# name Element,and button. We need all three.

#

radioButton

nameElement

searchButton

# Optional: we can highlight the radio elements returned

# lapply(radioButton,function(x){x$highlightElement()})

# Optional: we can highlight the nameElement returned

# lapply(nameElement,function(x){x$highlightElement()})

# Optional: we can highlight the searchButton returned

# lapply(searchButton,function(x){x$highlightElement()})

# Now we can select and press the third radio button

radioButton[[3]]$clickElement()

# We fill in the required name...

nameElement[[1]]$sendKeysToElement(list("PROAÑO & ASOCIADOS CIA. LTDA."))

# This is subtle but required the page triggers a drop down list,so rather than

# hitting the searchButton,we first select,and hit enter in the drop down menu...

selectElement

selectElement[[1]]$clickElement()

# OK,now we can click the search button,which will cause the next page to open

searchButton[[1]]$clickElement()

# New Page opens...

#

# Ok,so now we first pull the list of buttons...

finPageButton

# Now we can press the required button to open the page we want to get too...

finPageButton[[9]]$clickElement()

# We are now on the required page.

我们现在在目标页面上[见图片]

提取表值…

下一步是提取表值.为此,我们提取.z-listitem css选择器数据.现在我们可以检查以确认我们是否看到了数据行.我们这样做,所以我们现在可以提取返回的值并填充列表或Dataframe. # Ok,now we need to extract the table,we identify and pull out the

# '.z-listitem' and assign to modalWindow

modalWindow

# Now we can extract the lines from modalWindow... Now that each line is

# returned as a single line of text,so we split into three based on the

# line marker "/n'

lineText

lineText

在这里,结果是: > lineText

> lineText

[[1]]

[1] "10"

[2] "OPERACIONES DE INGRESO CON PARTES RELACIONADAS EN PARAÍSOS FISCALES,JURISDICCIONES DE MENOR IMPOSICIÓN Y REGÍMENES FISCALES PREFERENTES"

[3] "0.00"

处理隐藏数据.

Selenium WebDriver和RSelenium只与网页的可见元素进行交互.如果我们尝试读取整个表,我们将只返回可见(未隐藏)的表项.

我们可以通过滚动到表格底部来解决此问题.由于滚动操作,我们强制表填充.然后我们可以提取完整的表格. # Select the .z-listBox-body

modalWindow

# Now we tell the window we want to scroll to the bottom of the table

# This triggers the table to populate all the rows

modalWindow[[1]]$executeScript("window.scrollTo(0,document.body.scrollHeight)")

# Now we can extract the complete table

modalWindow

lineText

lineText

代码的作用.

上面的代码示例是自包含的.我的意思是它应该安装你需要的一切,包括所需的包.一旦依赖的R包安装,R代码将调用checkForServer(),如果未安装Selenium,则调用将安装它.这可能要花点时间

我的建议是你逐步完成代码,因为我没有包含任何延迟(在你想要的生产中),请注意我还没有针对速度进行优化,而是为了一点清晰[从我的角度来看] ……

该代码显示可用于:

> Mac OS X 10.11.5

> RStudio 0.99.893

> R版本3.2.4(2016-03-10) – “非常安全的菜肴”

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值