html table里面 n r,HTML table scraping in R

最新推荐文章于 2023-01-11 12:50:55 发布

李泽维

最新推荐文章于 2023-01-11 12:50:55 发布

阅读量72

点赞数

文章标签： html table里面 n r

本文介绍如何使用PhantomJS和R语言配合，通过headless浏览器技术抓取动态渲染的网页数据，特别关注了批量处理多个URL和文件操作的过程。通过scrape.js脚本和R代码，成功抓取并解析了表格数据，适合数据分析师和开发者快速获取网页信息。

摘要由CSDN通过智能技术生成

So the issue here is that the page is being rendered via javascript. Therefore rvest alone will not work. One of the simplest ways to scrape this would be to use a headless web browser. We can use PhantomJS.

First, download the appropriate version of PhantomJS and place the executable (assuming Windows) in your working directory. That is literally, the phantomjs.exe is in the working directory of the R script.

Create a scrape.js file:

// scrape.js

var webPage = require('webpage');

var page = webPage.create();

var fs = require('fs');

var path = 'page.html';

page.open('https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html', function (status) {

var content = page.content;

fs.write(path,content,'w');

phantom.exit();

});

This scrape.js file, once run, will create a page.html file in your working directory. Back in R or RStudio you can do the following:

library(tidyverse)

library(rvest)

# Run scrape.js with PhantomJS to create the file page.html

system("./phantomjs scrape.js")

# Now we should be in business as usual:

read_html('page.html') %>%

html_nodes("table#Tabc") %>%

html_table(header = TRUE) %>%

.[[1]] %>%

as_tibble()

# A tibble: 504 x 38

Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~

1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5

2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5

3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5

4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5

5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5

6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5

7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5

8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5

9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5

10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5

# ... with 494 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` , `Bloque de Potencia 03 (MW)` ,

# `Costo Incremental de generacion Bloque 03 ($/MWh)` , `Bloque de Potencia 04 (MW)` , `Costo Incremental de generacion Bloque 04

# ($/MWh)` , `Bloque de Potencia 05 (MW)` , `Costo Incremental de generacion Bloque 05 ($/MWh)` , `Bloque de Potencia 06

# (MW)` , `Costo Incremental de generacion Bloque 06 ($/MWh)` , `Bloque de Potencia 07 (MW)` , `Costo Incremental de generacion

# Bloque 07 ($/MWh)` , `Bloque de Potencia 08 (MW)` , `Costo Incremental de generacion Bloque 08 ($/MWh)` , `Bloque de Potencia

# 09 (MW)` , `Costo Incremental de generacion Bloque 09 ($/MWh)` , `Bloque de Potencia 10 (MW)` , `Costo Incremental de

# generacion Bloque 10 ($/MWh)` , `Bloque de Potencia 11 (MW)` , `Costo Incremental de generacion Bloque 11 ($/MWh)` , `Reserva

# rodante 10 min (MW)` , `Costo Reserva rodante 10 min ($/MW)` , `Reserva no rodante 10 min (MW)` , `Costo Reserva no rodante 10

# min ($/MW)` , `Reserva rodante suplementaria (MW)` , `Costo Reserva rodante suplementaria ($/MW)` , `Reserva no rodante

# suplementaria (MW)` , `Costo Reserva no rodante suplementaria ($/MW)` , `Reserva regulacion secundaria (MW)` , `Costo Reserva

# regulacion secundaria ($/MW`

Update to Scale to Multiple URLS

First, change the scrape.js file to accept arguments:

// scrape2.js

var webPage = require('webpage');

var page = webPage.create();

var system = require('system');

var args = system.args;

var fs = require('fs');

var path = args[2];

page.open(args[1], function (status) {

var content = page.content;

fs.write(path,content,'w');

phantom.exit();

});

Next, create lists to loop/walk/map over (obviously this could be cleaned up / abstracted to be easier to maintain and require less typing):

urls

'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html',

'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html',

'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html'

)

paths

'page1.html',

'page2.html',

'page3.html'

)

args_list

# We are only using this function for the file creation side-effects,

# so we can use walk instead of map.

# This creates the files: page1.html, page2.html, and page3.html

walk(args_list, ~ system(paste("./phantomjs scrape2.js", .)))

At this point, you'll probably want to throw the scraping stuff into a function:

read_page

read_html(page) %>%