So the issue here is that the page is being rendered via javascript. Therefore rvest alone will not work. One of the simplest ways to scrape this would be to use a headless web browser. We can use PhantomJS.
First, download the appropriate version of PhantomJS and place the executable (assuming Windows) in your working directory. That is literally, the phantomjs.exe is in the working directory of the R script.
Create a scrape.js file:
// scrape.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'page.html';
page.open('https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html', function (status) {
var content = page.content;
fs.write(path,content,'w');
phantom.exit();
});
This scrape.js file, once run, will create a page.html file in your working directory. Back in R or RStudio you can do the following:
library(tidyverse)
library(rvest)
# Run scrape.js with PhantomJS to create the file page.html
system("./phantomjs scrape.js")
# Now we should be in business as usual:
read_html('page.html') %>%
html_nodes("table#Tabc") %>%
html_table(header = TRUE) %>%
.[[1]] %>%
as_tibble()
# A tibble: 504 x 38
Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5
2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5
3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5
4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5
5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5
6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5
7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5
8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5
9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5
10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5
# ... with 494 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` , `Bloque de Potencia 03 (MW)` ,
# `Costo Incremental de generacion Bloque 03 ($/MWh)` , `Bloque de Potencia 04 (MW)` , `Costo Incremental de generacion Bloque 04
# ($/MWh)` , `Bloque de Potencia 05 (MW)` , `Costo Incremental de generacion Bloque 05 ($/MWh)` , `Bloque de Potencia 06
# (MW)` , `Costo Incremental de generacion Bloque 06 ($/MWh)` , `Bloque de Potencia 07 (MW)` , `Costo Incremental de generacion
# Bloque 07 ($/MWh)` , `Bloque de Potencia 08 (MW)` , `Costo Incremental de generacion Bloque 08 ($/MWh)` , `Bloque de Potencia
# 09 (MW)` , `Costo Incremental de generacion Bloque 09 ($/MWh)` , `Bloque de Potencia 10 (MW)` , `Costo Incremental de
# generacion Bloque 10 ($/MWh)` , `Bloque de Potencia 11 (MW)` , `Costo Incremental de generacion Bloque 11 ($/MWh)` , `Reserva
# rodante 10 min (MW)` , `Costo Reserva rodante 10 min ($/MW)` , `Reserva no rodante 10 min (MW)` , `Costo Reserva no rodante 10
# min ($/MW)` , `Reserva rodante suplementaria (MW)` , `Costo Reserva rodante suplementaria ($/MW)` , `Reserva no rodante
# suplementaria (MW)` , `Costo Reserva no rodante suplementaria ($/MW)` , `Reserva regulacion secundaria (MW)` , `Costo Reserva
# regulacion secundaria ($/MW`
Update to Scale to Multiple URLS
First, change the scrape.js file to accept arguments:
// scrape2.js
var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;
var fs = require('fs');
var path = args[2];
page.open(args[1], function (status) {
var content = page.content;
fs.write(path,content,'w');
phantom.exit();
});
Next, create lists to loop/walk/map over (obviously this could be cleaned up / abstracted to be easier to maintain and require less typing):
urls
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html',
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html',
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html'
)
paths
'page1.html',
'page2.html',
'page3.html'
)
args_list
# We are only using this function for the file creation side-effects,
# so we can use walk instead of map.
# This creates the files: page1.html, page2.html, and page3.html
walk(args_list, ~ system(paste("./phantomjs scrape2.js", .)))
At this point, you'll probably want to throw the scraping stuff into a function:
read_page
read_html(page) %>%
html_nodes("table#Tabc") %>%
html_table(header = TRUE) %>%
.[[1]] %>%
as_tibble()
}
And from there you can reuse the paths list to map your new function over:
paths %>%
map(~ read_page(.)) %>%
bind_rows()
# A tibble: 9,000 x 38
Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5
2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5
3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5
4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5
5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5
6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5
7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5
8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5
9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5
10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5
# ... with 8,990 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` , `Bloque de Potencia 03 (MW)` ,
# `Costo Incremental de generacion Bloque 03 ($/MWh)` , `Bloque de Potencia 04 (MW)` , `Costo Incremental de generacion Bloque 04
# ($/MWh)` , `Bloque de Potencia 05 (MW)` , `Costo Incremental de generacion Bloque 05 ($/MWh)` , `Bloque de Potencia 06
# (MW)` , `Costo Incremental de generacion Bloque 06 ($/MWh)` , `Bloque de Potencia 07 (MW)` , `Costo Incremental de generacion
# Bloque 07 ($/MWh)` , `Bloque de Potencia 08 (MW)` , `Costo Incremental de generacion Bloque 08 ($/MWh)` , `Bloque de Potencia
# 09 (MW)` , `Costo Incremental de generacion Bloque 09 ($/MWh)` , `Bloque de Potencia 10 (MW)` , `Costo Incremental de
# generacion Bloque 10 ($/MWh)` , `Bloque de Potencia 11 (MW)` , `Costo Incremental de generacion Bloque 11 ($/MWh)` , `Reserva
# rodante 10 min (MW)` , `Costo Reserva rodante 10 min ($/MW)` , `Reserva no rodante 10 min (MW)` , `Costo Reserva no rodante 10
# min ($/MW)` , `Reserva rodante suplementaria (MW)` , `Costo Reserva rodante suplementaria ($/MW)` , `Reserva no rodante
# suplementaria (MW)` , `Costo Reserva no rodante suplementaria ($/MW)` , `Reserva regulacion secundaria (MW)` , `Costo Reserva
# regulacion secundaria ($/MW`