RCurl汽车之家抓取

最新推荐文章于 2021-03-24 02:55:45 发布

**码上人生**

最新推荐文章于 2021-03-24 02:55:45 发布

阅读量1.7k

点赞数

分类专栏：数据挖掘文章标签： RCurl 抓取

本文链接：https://blog.csdn.net/qq_16365849/article/details/51201439

版权

数据挖掘专栏收录该内容

27 篇文章 0 订阅

订阅专栏

汽车之家抓取

library(RCurl)

## Loading required package: bitops

#install.packages("XML")
library(XML)
library(reshape)

#伪装报头
myheader=c(
"User-Agent"="Mozilla/5.0(Windows;U;Windows NT 5.1;zh-CN;rv:1.9.1.6",
"Accept"="text/htmal,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language"="en-us",
"Connection"="keep-alive",
"Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7"
)

#1）微型车抓取测试
a00url <- "http://www.autohome.com.cn/a00/"
temp <- getURL(a00url, httpheader=myheader, .encoding = "gb2312")

#转码
temp1 <- iconv(temp, "gb2312", "UTF-8")
Encoding(temp1)

## [1] "UTF-8"

#选择UTF-8进行网页的解析
k <- htmlParse(temp1, asText = T, encoding = "UTF-8")

#查看doc的内容时显示有乱码，但没关系，table的解析结果没有乱码
tables <- readHTMLTable(k, header = F)
#getNodeSet(k,'//div[@class="uibox"]')

#汽车公司
#getNodeSet(k,'//div[@class="h3-tit"]/text()')

#汽车车型（greylink，灰色链接即非上市车型）
model<-getNodeSet(k,'//a[contains(@class,"greylink")]/text()')

#汽车车型（包含上市），这个Xpath会出现同辆车重复4次这种情况，我没找到最好的xpath=-=
model<-getNodeSet(k,'//li/h4/a/text()')
class(model)

## [1] "XMLNodeSet"

#从XMLNodeSet转化为character格式
a00 <- sapply(model, xmlValue)
class(a00)

## [1] "character"

a00 <- as.data.frame(a00)
a00$tips <- rep("a00/", length(a00$a00))

#重命名列名
a00 <- rename(a00, c(a00="model", tips="tips"))
data1 <- a00
write.csv(a00, file = "E:\\新技术\\爬虫\\汽车之家/微型车.csv")

#2)################各车型的URL#######################
#微型车 http://www.autohome.com.cn/a00/
#小型车 http://www.autohome.com.cn/a0/
#紧凑型车 http://www.autohome.com.cn/a/
#中型车 http://www.autohome.com.cn/b/
#中大型车 http://www.autohome.com.cn/c/
#豪华车 http://www.autohome.com.cn/d/
#MPV http://www.autohome.com.cn/mpv/
#跑车 http://www.autohome.com.cn/s/
#皮卡 http://www.autohome.com.cn/p/
#微面 http://www.autohome.com.cn/mb/
#轻客 http://www.autohome.com.cn/qk/
#小型suv http://www.autohome.com.cn/suva0/
#紧凑型suv http://www.autohome.com.cn/suva/
#中型suv http://www.autohome.com.cn/suvb/
#中大型suv http://www.autohome.com.cn/suvc/
#全尺寸suv http://www.autohome.com.cn/suvd/

series<-c("a0/","a/","b/","c/","d/","mpv/","s/","p/","mb/","qk/","suva0/","suva/","suvb/","suvc/","suvd/")

#构建urllist，若写成function的话，貌似不需要构建urllist
urllist <- 0
for(i in 1:length(series)){
  url <- "http://www.autohome.com.cn/"
  urllist[i] <- paste0(url, series[i], sep="")
}

#构建抓取循环
for (i in 1:length(series)){
  url<-paste0("http://www.autohome.com.cn/",series[i],sep="")
  temp<-getURL(url,httpheader=myheader,.encoding="gb2312")
  temp1<-iconv(temp,"gb2312","UTF-8") #转码
  k<-htmlParse(temp1,asText=T,encoding="UTF-8") #选择UTF-8进行网页的解析
  model<-getNodeSet(k,'//li/h4/a/text()')
  table<-sapply(model,xmlValue) #从XMLNodeSet转化为character格式
  table<-as.data.frame(table)
  table$tips<-rep(series[i],length(table$table))
  table<-rename(table,c(table="model",tips="tips")) #重命名列名
  data2<-table
  data1<-rbind(data1,data2)
}

#导出结果后再去重处理下吧.
write.csv(data1, file="E:\\新技术\\爬虫\\汽车之家/auto全车型.csv")

**码上人生**

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
RCurl汽车之家抓取

汽车之家抓取junjun2016年4月20日参考：http://blog.sina.com.cn/s/blog_6f2336820102v13n.html汽车之家抓取library(RCurl)## Loading required package: bitops#install.packages("XML")library(XML)librar
复制链接

扫一扫