RSelenium包抓取链家网（下：数据存储与容错管理）_r语言 error in rbind(deparse.level, ...) : 串列参数有错:所有-CSDN博客

本文链接：https://blog.csdn.net/Joyliness/article/details/78817758

续

前一篇RSelenium包抓取链家网（上：模拟点击与页面抓取）的关注点放在了网页自动点击问题上。虽然代码也可以抓到完整数据，但其前提是没有出现报错或警告使得抓取中断。由于LinkinfoFunc和HouseinfoFunc都是封装函数，一旦被中断，在中断之前所抓取的数据都无法写入数据框或列表中。

当待抓取的数据量较大，耗时较长，难免会出现网络中断等各种问题。因此，本篇在前一篇的基础上，添加了一个for循环，引入tryCatch函数执行简单的报错处理。两篇比较变动如下：

页面准备、Step1的代码未变
Step2删去数据框result，将数据存储的任务移交到Step 3
Step3添加for循环，引入tryCatch函数

另，文中重复代码不再注释。

页面准备

library(rvest)
library(stringr)
library(RSelenium)
remDr <- remoteDriver(browserName = "chrome")
base1 <- "https://hui.lianjia.com/ershoufang/"
base2 <- c("danshui", "huiyangqu", "nanzhanxincheng")
url   <- paste(base1, base2, "/", sep = "")

Step1：封装函数LinkinfoFunc

LinkinfoFunc <- function(remDr, url) {
  result <- data.frame()
  remDr$open()
  for (i in seq_along(url)) {
    remDr$navigate(url[i])
    j = 0
    while (TRUE) {
      j = j + 1
      destination <- remDr$getPageSource()[[1]] %>% read_html()
      link        <- destination %>% html_nodes("li.clear div.title a") %>% html_attr("href")
      pageinfo    <- destination %>% html_nodes("div.house-lst-page-box") %>% 
                     html_attr("page-data") %>% str_extract_all(., ":[\\d]+") %>% 
                     unlist() %>% gsub(":", "", .)
      totalpage   <- pageinfo[1]
      curpage     <- pageinfo[2]
      data        <- data.frame(link, stringsAsFactors = FALSE)
      result      <- rbind(result, data)
      if (curpage != totalpage) {
        cat(sprintf("第【%d】个地区第【%d】页抓取成功", i, j), sep = "\n")
        remDr$executeScript("arguments[0].click();", 
                            list(remDr$findElement("css", "div.house-lst-page-box a.on+a")))
      } else {
        cat(sprintf("第【%d】个地区第【%d】页抓取成功", i, j), sep = "\n")
        break
      }
    }
    cat(sprintf("第【%d】个地区抓取成功", i), sep = "\n")
  }
  remDr$close()
  cat("All work is done!", sep = "\n")
  return(result)
}

执行函数

linkinfo  <- LinkinfoFunc(remDr, url) %>% unlist()
# 执行函数LinkinfoFunc，得到linkinfo（list形式）

Step2：封装函数HouseinfoFunc

HouseinfoFunc <- function(link) {
    destianation <- read_html(link, encoding = "UTF-8")
    location     <- destianation %>% html_nodes("a.no_resblock_a") %>% html_text()
    unit         <- destianation %>% html_nodes(".price span.unit") %>% html_text()
    totalprice   <- destianation %>% html_nodes(".price span.total:nth-child(1)") %>%
                    html_text() %>% paste(., unit, sep = "")
    downpayment  <- destianation %>% html_nodes(".taxtext span") %>% html_text() %>% .[1]
    persquare    <- destianation %>% html_nodes("span.unitPriceValue") %>% html_text()
    area         <- destianation %>% html_nodes(".area .mainInfo") %>% html_text()
    title        <- destianation %>% html_nodes(".title h1") %>% html_text()
    subtitle     <- destianation %>% html_nodes(".title div.sub") %>% html_text()
    room         <- destianation %>% html_nodes(".room .mainInfo") %>% html_text()
    floor        <- destianation %>% html_nodes(".room .subInfo") %>% html_text()
    data         <- data.frame(location, totalprice, downpayment, persquare, 
                               area, title, subtitle, room, floor)
    return(data)
}

Step3：for循环和tryCatch函数捕获异常

result <- list()
# 或result <- vector("list", length(linkinfo))
# 建立空列表result，后用于盛装数据
for (i in seq_along(linkinfo)) {
  if (!(linkinfo[i] %in% names(result))) {
  # 若当前linkinfo[i]的数据尚未写入result，则继续执行后面的命令
  # 用于判断当前待抓取的linkinfo[i]是否已经抓取过，避免重复抓取
    cat(paste("Doing", i, linkinfo[i], "..."))
    # 输出“当前实施抓取的链接”的提示信息
    ok <- FALSE
    # 设定初始逻辑值    
    counter <- 0
    # 设定尝试连接次数counter的初始值
    while (ok == FALSE & counter < 5) {
    # ok的设置用于处理正常的linkinfo[i]，初始值为FALSE，抓到数据则变为TRUE，并跳出while循环
    # counter的设置用于处理异常的linkinfo[i]，初始值为1，出现错误时重新返回while循环进行第2次...第5次重新连接
      counter <- counter + 1
      # 开始第一次连接
      output <- tryCatch({                  
        HouseinfoFunc(linkinfo[i])
        # 对linkinfo[i]执行函数HouseinfoFunc，实施抓取
      },
      error=function(e){
      # 若出现error
        Sys.sleep(2)
        # 休息2s
        e
        # 输出报错信息
      }
      )
      if ("error" %in% class(output)) {
      # 若output的输出类型是error
        cat("NA...")
        # 则提示“NA...”，返回while循环进行第2次...第5次重新连接
      } else {
      # 若output的输出结果是抓取到的数据，不是error
        ok <- TRUE
        # 则逻辑值变为TRUE,跳出while循环
        cat("Done.")
        # 输出“完成”提示
      }
    }
    cat("\n")
    result[[i]] <- output
    # for循环每进行一次，得到output的值，将其写入result中
    names(result)[[i]] <- linkinfo[i]
    # 将result中第i个向量命名为相应的房屋链接（网址）
  }
} 
# 这一步收集到的result（list形式）包括404 not found页面返回的错误信息，也包括目标数据，需要进一步将二者分离

数据分离和提取

result <- lapply(result, function(x) {
  if (unlist(x) %>% length() == 9) {
    return(x)
  } else {
    return(NULL)
  }
})
# 将result中的向量逐个展开，由于目标数据包含9个变量，因此目标向量展开后，长度应该等于9
# 利用此特性留下目标向量，将非目标向量值设为NULL
result <- result[!sapply(result, is.null)]
# 将值为NULL的向量移除，result只剩下目标向量
houseinfo <- do.call(rbind, result)
# 将目标向量作rbind操作，得到houseinfo（data.frame形式）
View(houseinfo)
write.table(houseinfo, row.names = FALSE, sep = ",", "houseinfo.csv")
# View()函数查看数据并导出到本地

总结

虽然Step 2的封装函数LinkinfoFunc可以抓到所有房屋链接linkinfo（list形式），但每一条链接都有时效性，房源下架以后页面就会返回404 not found，不仅该页内容无法抓取，后续抓取也会中断。因此，Step 3的函数主要实现：
1. for循环遍历每一条linkinfo
2. 对于每一条linkinfo，都执行一次tryCatch函数，判断并捕获异常的linkinfo[i]
3. 对于异常的linkinfo[i]，执行while循环尝试重新抓取，共尝试5次，每次等待2秒
4. 若循环被手动中断，if (!(linkinfo[i] %in% names(result)))语句可以排除已经抓取过的链接，再次执行for循环时，直接从中断处继续抓取
5. 对抓不到数据的异常链接，其写入数据框默认为NULL（Step 3中设置的是写入报错信息），后续作rbind操作时，应剔除这些无效NULL值，否则会出现长度不一致无法rbind的报错
运行结果与报错情况示例：

【对于正常连接、正常抓取的链接，for循环提示如下】

Doing 1 https://hui.lianjia.com/ershoufang/105101098943.html ...Done.
Doing 2 https://hui.lianjia.com/ershoufang/105101085455.html ...Done.

【对于（尝试5次）不能正常连接、不能正常抓取的链接，for循环提示及Warining告警如下】

Doing 3 https://hui.lianjia.com/ershoufang/105101261413.html ...NA...NA...NA...NA...NA...
Doing 4 https://hui.lianjia.com/ershoufang/105112912491.html ...NA...NA...NA...NA...NA...
Warning messages:
1: closing unused connection 11 (https://hui.lianjia.com/ershoufang/105101261413.html) 
2: closing unused connection 10 (https://hui.lianjia.com/ershoufang/105101261413.html) 
3: closing unused connection 9 (https://hui.lianjia.com/ershoufang/105101261413.html) 
4: closing unused connection 8 (https://hui.lianjia.com/ershoufang/105101261413.html) 
5: closing unused connection 7 (https://hui.lianjia.com/ershoufang/105101261413.html)

【（Step 3）得到的result同时包含目标向量（前2个）和非目标向量（后2个）】

$`https://hui.lianjia.com/ershoufang/105101098943.html`
          location totalprice downpayment   persquare       area                                            title
1 花样年别样城一期       96万   首付29万  9343元/平米 102.76平米 惠州南站   南北通透  满五唯一  诚心诚售 随时看房
                                                  subtitle   room         floor
1 此房满五唯一， 税费少，中高楼层，户型方正， 业主诚心出售 3室2厅 中楼层/共11层

$`https://hui.lianjia.com/ershoufang/105101085455.html`
  location totalprice downpayment   persquare      area                                          title
1   鹏城里      105万   首付32万  9981元/平米 105.2平米 满五唯一 临深片区 区政府中 心地段 地铁14号线旁
                                                subtitle   room         floor
1 此房满五唯一，无增值税个税，客厅出阳台看花园，无遮挡。 3室2厅 中楼层/共28层

$`https://hui.lianjia.com/ershoufang/105101261413.html`
<simpleError in open.connection(x, "rb"): HTTP error 404.>

$`https://hui.lianjia.com/ershoufang/105112912491.html`
<simpleError in open.connection(x, "rb"): HTTP error 404.>

【此时直接对result作rbind操作，会出现如下报错】

Error in rbind(deparse.level, ...) : 
  串列参数有错：所有变数的长度都应该是一样的

参考资料：
Iterating rvest scrape function gives: “Error in open.connection(x, ”rb“) : Timeout was reached”