html table里面 n r,HTML table scraping in R

本文介绍如何使用PhantomJS和R语言配合,通过headless浏览器技术抓取动态渲染的网页数据,特别关注了批量处理多个URL和文件操作的过程。通过scrape.js脚本和R代码,成功抓取并解析了表格数据,适合数据分析师和开发者快速获取网页信息。
摘要由CSDN通过智能技术生成

So the issue here is that the page is being rendered via javascript. Therefore rvest alone will not work. One of the simplest ways to scrape this would be to use a headless web browser. We can use PhantomJS.

First, download the appropriate version of PhantomJS and place the executable (assuming Windows) in your working directory. That is literally, the phantomjs.exe is in the working directory of the R script.

Create a scrape.js file:

// scrape.js

var webPage = require('webpage');

var page = webPage.create();

var fs = require('fs');

var path = 'page.html';

page.open('https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html', function (status) {

var content = page.content;

fs.write(path,content,'w');

phantom.exit();

});

This scrape.js file, once run, will create a page.html file in your working directory. Back in R or RStudio you can do the following:

library(tidyverse)

library(rvest)

# Run scrape.js with PhantomJS to create the file page.html

system("./phantomjs scrape.js")

# Now we should be in business as usual:

read_html('page.html') %>%

html_nodes("table#Tabc") %>%

html_table(header = TRUE) %>%

.[[1]] %>%

as_tibble()

# A tibble: 504 x 38

Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~

1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5

2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5

3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5

4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5

5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5

6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5

7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5

8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5

9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5

10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5

# ... with 494 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` , `Bloque de Potencia 03 (MW)` ,

# `Costo Incremental de generacion Bloque 03 ($/MWh)` , `Bloque de Potencia 04 (MW)` , `Costo Incremental de generacion Bloque 04

# ($/MWh)` , `Bloque de Potencia 05 (MW)` , `Costo Incremental de generacion Bloque 05 ($/MWh)` , `Bloque de Potencia 06

# (MW)` , `Costo Incremental de generacion Bloque 06 ($/MWh)` , `Bloque de Potencia 07 (MW)` , `Costo Incremental de generacion

# Bloque 07 ($/MWh)` , `Bloque de Potencia 08 (MW)` , `Costo Incremental de generacion Bloque 08 ($/MWh)` , `Bloque de Potencia

# 09 (MW)` , `Costo Incremental de generacion Bloque 09 ($/MWh)` , `Bloque de Potencia 10 (MW)` , `Costo Incremental de

# generacion Bloque 10 ($/MWh)` , `Bloque de Potencia 11 (MW)` , `Costo Incremental de generacion Bloque 11 ($/MWh)` , `Reserva

# rodante 10 min (MW)` , `Costo Reserva rodante 10 min ($/MW)` , `Reserva no rodante 10 min (MW)` , `Costo Reserva no rodante 10

# min ($/MW)` , `Reserva rodante suplementaria (MW)` , `Costo Reserva rodante suplementaria ($/MW)` , `Reserva no rodante

# suplementaria (MW)` , `Costo Reserva no rodante suplementaria ($/MW)` , `Reserva regulacion secundaria (MW)` , `Costo Reserva

# regulacion secundaria ($/MW`

Update to Scale to Multiple URLS

First, change the scrape.js file to accept arguments:

// scrape2.js

var webPage = require('webpage');

var page = webPage.create();

var system = require('system');

var args = system.args;

var fs = require('fs');

var path = args[2];

page.open(args[1], function (status) {

var content = page.content;

fs.write(path,content,'w');

phantom.exit();

});

Next, create lists to loop/walk/map over (obviously this could be cleaned up / abstracted to be easier to maintain and require less typing):

urls

'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html',

'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html',

'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html'

)

paths

'page1.html',

'page2.html',

'page3.html'

)

args_list

# We are only using this function for the file creation side-effects,

# so we can use walk instead of map.

# This creates the files: page1.html, page2.html, and page3.html

walk(args_list, ~ system(paste("./phantomjs scrape2.js", .)))

At this point, you'll probably want to throw the scraping stuff into a function:

read_page

read_html(page) %>%

html_nodes("table#Tabc") %>%

html_table(header = TRUE) %>%

.[[1]] %>%

as_tibble()

}

And from there you can reuse the paths list to map your new function over:

paths %>%

map(~ read_page(.)) %>%

bind_rows()

# A tibble: 9,000 x 38

Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~

1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5

2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5

3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5

4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5

5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5

6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5

7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5

8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5

9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5

10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5

# ... with 8,990 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` , `Bloque de Potencia 03 (MW)` ,

# `Costo Incremental de generacion Bloque 03 ($/MWh)` , `Bloque de Potencia 04 (MW)` , `Costo Incremental de generacion Bloque 04

# ($/MWh)` , `Bloque de Potencia 05 (MW)` , `Costo Incremental de generacion Bloque 05 ($/MWh)` , `Bloque de Potencia 06

# (MW)` , `Costo Incremental de generacion Bloque 06 ($/MWh)` , `Bloque de Potencia 07 (MW)` , `Costo Incremental de generacion

# Bloque 07 ($/MWh)` , `Bloque de Potencia 08 (MW)` , `Costo Incremental de generacion Bloque 08 ($/MWh)` , `Bloque de Potencia

# 09 (MW)` , `Costo Incremental de generacion Bloque 09 ($/MWh)` , `Bloque de Potencia 10 (MW)` , `Costo Incremental de

# generacion Bloque 10 ($/MWh)` , `Bloque de Potencia 11 (MW)` , `Costo Incremental de generacion Bloque 11 ($/MWh)` , `Reserva

# rodante 10 min (MW)` , `Costo Reserva rodante 10 min ($/MW)` , `Reserva no rodante 10 min (MW)` , `Costo Reserva no rodante 10

# min ($/MW)` , `Reserva rodante suplementaria (MW)` , `Costo Reserva rodante suplementaria ($/MW)` , `Reserva no rodante

# suplementaria (MW)` , `Costo Reserva no rodante suplementaria ($/MW)` , `Reserva regulacion secundaria (MW)` , `Costo Reserva

# regulacion secundaria ($/MW`

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值