我正在使用httr软件包从NASA NEX项目的ftp服务器下载气候预测文件。[27]
我的脚本是:
library(httr)
var = c(“pr”, “tasmin”, “tasmax”)
rcp = c(“rcp45”, “rcp85”)
mod= c(“inmcm4”, “GFDL-CM3”)
year=c(seq(2040,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( ‘ftp://ftp.nccs.nasa.gov/BCSD/’, r, ‘/day/atmos/’, v, ‘/r1i1p1/v1.0/’, sep=’’)
for (m in mod) {
for (y in year) {
nfile<- paste0(v,‘day_BCSD’,r,“r1i1p1”,m,’_’,y,’.nc’, sep=’’)
url1<- paste0(url,nfile, sep=’’)
destfile<-paste0(‘mypath’,r,’/’,v,’/’,nfile, sep=’’)
GET(url=url1, authenticate(user=‘NEXGDDP’, password=’’, type = “basic”), write_disk(path=destfile, overwrite = FALSE ))
Sys.sleep(0.5)
}}}}
一段时间后,服务器停止我的连接,出现以下错误:
421您的互联网地址连接太多。
我在这里读到,这是由于连接打开的数量,我应该在每次迭代时关闭它们(我不确定这真的有意义!)。
有没有办法用httr包关闭ftp?
最佳参考
提议的解决方案(摘要答案)
建议的解决方案 - 为httr设置ftp服务器的最大连接数
config(CURLOPT_MAXCONNECTS=5)
Options:
- CURLOPT_MAXCONNECTS: 5
说明
序言:
httr包是curl的包装器。这很重要,因为它抽象了卷曲界面。在这种情况下,我们希望通过httr抽象修改curls配置来修改curl行为。
httr默认处理对同一网站的请求之间的自动连接共享(默认情况下,自动管理curl句柄),跨请求维护cookie以及最新的根级SSL证书存储使用。
在这种情况下,我们不控制FTP服务器,只控制客户端对服务器的请求。因此,我们可以通过httr:config修改curl的默认行为,以减少同时发送的FTP请求的数量。
询问httr curl ftp选项
要检索当前选项,我们可以执行以下命令:
httr_options(“ftp”)
httr libcurl type
49 ftp_account CURLOPT_FTP_ACCOUNT string
50 ftp_alternative_to_user CURLOPT_FTP_ALTERNATIVE_TO_USER string
51 ftp_create_missing_dirs CURLOPT_FTP_CREATE_MISSING_DIRS integer
52 ftp_filemethod CURLOPT_FTP_FILEMETHOD integer
53 ftp_response_timeout CURLOPT_FTP_RESPONSE_TIMEOUT integer
54 ftp_skip_pasv_ip CURLOPT_FTP_SKIP_PASV_IP integer
55 ftp_ssl_ccc CURLOPT_FTP_SSL_CCC integer
56 ftp_use_eprt CURLOPT_FTP_USE_EPRT integer
57 ftp_use_epsv CURLOPT_FTP_USE_EPSV integer
58 ftp_use_pret CURLOPT_FTP_USE_PRET integer
59 ftpport CURLOPT_FTPPORT string
60 ftpsslauth CURLOPT_FTPSSLAUTH integer
196 tftp_blksize CURLOPT_TFTP_BLKSIZE integer
访问libcurl文档,我们可以调用curl_docs(“CURLOPT_FTP_ACCOUNT”)。
修改httr请求配置
您可以使用set_config()修改httr全局卷曲配置,或者只使用with_config()包装您的请求。在这种情况下,我们希望限制到ftp服务器的最大连接数。
从而:
httr_options(“max”)
httr libcurl type
95 max_recv_speed_large CURLOPT_MAX_RECV_SPEED_LARGE number
96 max_send_speed_large CURLOPT_MAX_SEND_SPEED_LARGE number
97 maxconnects CURLOPT_MAXCONNECTS integer
98 maxfilesize CURLOPT_MAXFILESIZE integer
99 maxfilesize_large CURLOPT_MAXFILESIZE_LARGE number
100 maxredirs CURLOPT_MAXREDIRS integer
我们现在可以查找curl_docs(“CURLOPT_MAXCONNECTS”) - 确定这就是我们想要的。
现在我们必须设置它。
config(CURLOPT_MAXCONNECTS=5)
Options:
- CURLOPT_MAXCONNECTS: 5
参考: https://cran.r-project.org/web/packages/httr/httr.pdf [29]
替代RCurl方法
我知道这有点多余,我把它包括在内以提供另一种方法。为什么?由于网络带宽,这里存在一个微妙的问题…运行多个同时发送的FTP会话可能比串行运行它们要慢。我的替代方法是在下面运行R脚本或直接通过Unix shell命令行使用curl。
require(RCurl)
require(stringr)
opts = curlOptions(userpwd = “NEXGDDP:”, netrc = TRUE)
rcpDir = c(“rcp45”, “rcp85”)
varDir = c(“pr”, “tasmin”, “tasmax”)
for (rcp in rcpDir ) {
for (var in varDir ) {
url <- paste0( ‘ftp://ftp.nccs.nasa.gov/BCSD/’, rcp, ‘/day/atmos/’, var, ‘/r1i1p1/v1.0/’, sep = ‘’)
print(url)
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE, .opts = opts)
filelist <- unlist(str_split(filenames, “\n”))
filelist <- filelist[!filelist == “”]
filesavg <- str_detect(filelist,
“inmcm4_20[4-8]0|GFDL-CM3_20[4-8]0”)
filesavg <- filelist[filesavg]
filesavg
urlsavg <- str_c(url, filesavg)
for (file in seq_along(urlsavg)) {
fname <- str_c("data/", filesavg[file])
if (!file.exists(fname)) {
print(urlsavg[file])
bin <- getBinaryURL(urlsavg[file], .opts = opts)
writeBin(bin, fname)
Sys.sleep(1)
}
}
}
}
代码输出
require(RCurl)
require(stringr)
opts = curlOptions(userpwd = “NEXGDDP:”, netrc = TRUE)
rcpDir = c(“rcp45”, “rcp85”)
varDir = c(“pr”, “tasmin”, “tasmax”)
for (rcp in rcpDir ) {
- for (var in varDir ) {
-
url <- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', rcp, '/day/atmos/', var, '/r1i1p1/v1.0/', sep = '')
-
print(url)
-
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE, .opts = opts)
-
filelist <- unlist(str_split(filenames, "\n"))
-
filelist <- filelist[!filelist == ""]
-
filesavg <- str_detect(filelist,
-
"inmcm4_20[4-8]0|GFDL-CM3_20[4-8]0")
-
filesavg <- filelist[filesavg]
-
filesavg
-
urlsavg <- str_c(url, filesavg)
-
for (file in seq_along(urlsavg)) {
-
fname <- str_c("data/", filesavg[file])
-
if (!file.exists(fname)) {
-
print(urlsavg[file])
-
bin <- getBinaryURL(urlsavg[file], .opts = opts)
-
writeBin(bin, fname)
-
Sys.sleep(1)
-
}
-
}
- }
- }
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc”
[1] “ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc”
其它参考1
(不确定这应该是一个答案,但我不能在评论中添加所有这些)
总而言之,两种替代解决方案将我的方法与Technophobe提出的解决方案相结合。我把这两个的最终代码放在这里,以防对遇到相同问题的人有所帮助。
httr方法:
library(httr)
#configure a proxy, in case you are in a office/university network
set_config(use_proxy(url=‘http://~in_case_you_need_a_proxy’, port=paste_here_port_no)
#limit the number of simultaneous connections as suggested by Technofobe
#default is 5
config(CURLOPT_MAXCONNECTS=3)
var = c(“pr”,“tasmax”,“tasmin”)
rcp = c(“rcp45”, “rcp85”)
mod= c(“inmcm4”, “GFDL-CM3”)
year=c(seq(2036,2050,1), seq(2061,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( ‘ftp://ftp.nccs.nasa.gov/BCSD/’, r, ‘/day/atmos/’, v, ‘/r1i1p1/v1.0/’, sep=’’)
for (m in mod) {
for (y in year) {
nfile<- paste0(v,‘day_BCSD’,r,“r1i1p1”,m,’_’,y,’.nc’, sep=’’)
url1<- paste0(url,nfile, sep=’’)
destfile<-paste0(‘D:/destination_path/’,r,’/’,v,’/’,nfile, sep=’’)
GET(url=url1, authenticate(user=‘NEXGDDP’, password=’’, type = “basic”), write_disk(path=destfile, overwrite = FALSE ))
gc()
Sys.sleep(1)
}}}}
使用RCurl的替代方法
library(RCurl)
opts = curlOptions(proxy=‘http://~in_case_you_need_a_proxy:paste_here_port_no’, userpwd = “NEXGDDP:”, netrc = TRUE)
var = c("pr","tasmax","tasmin")
rcp = c(“rcp45”, “rcp85”)
mod= c(“inmcm4”, “GFDL-CM3”)
year=c(seq(2036,2050,1), seq(2061,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( ‘ftp://ftp.nccs.nasa.gov/BCSD/’, r, ‘/day/atmos/’, v, ‘/r1i1p1/v1.0/’, sep=’’)
for (m in mod) {
for (y in year) {
nfile<- paste0(v,‘day_BCSD’,r,“r1i1p1”,m,’_’,y,’.nc’, sep=’’)
url1<- paste0(url,nfile, sep=’’)
destfile<-paste0(‘D:/destination_path/’,r,’/’,v,’/’,nfile, sep=’’)
bin <- getBinaryURL(url1, .opts = opts)
writeBin(bin, destfile)
Sys.sleep(1)
gc()
}}}}
这两种方法都经过测试和研究。第二个可能仍然受到421错误问题的影响,但是出现的次数非常有限(我下载了900多个文件,总共大约600 GB)。希望这是在这个领域工作的其他人的一个很好的参考。