R语言心得-网页文本分析

最新推荐文章于 2022-06-23 14:52:21 发布

weixin_34082854

最新推荐文章于 2022-06-23 14:52:21 发布

阅读量171

点赞数

文章标签： r语言

原文链接：http://www.cnblogs.com/zhoufan/p/5111996.html

版权

R语言心得是在阅读《R语言与网站分析》并使用的过程中产生的想法。

旨在通过博文寻找一起学习进步的小伙伴。

网页数据爬取和文本分析一直是一个很酷炫的事，可是代码一长，难免有这样那样的问题。

下面几个是几个案例的成功实现，仅供参考

###例1：前戏，爬取下述网站的价格字段（中文读取有问题）####
url<-"http://product.dangdang.com/1310721937.html#ddclick?act=click&pos=1310721937_3_1_m&cat=4008120&key=&qinfo=&pinfo=&minfo=4219_9_58&ninfo=&custid=&permid=20160105154317363605090772959201669&ref=&rcount=&type=&t=1451984626000"
url.html<-htmlParse(url,encoding="utf-8")
url.html
url.xpath<-getNodeSet(url.html,"//*[@class='d_price']/span[@id='promo_price']")
##教材P396好好理解，辅助：F12，Elements,放大镜###
url.xpath
price<-xmlValue(url.xpath[[1]])
price<-as.numeric(substr(price,2,nchar(price)))
price

###例2：网上示范案例###
url1<-"http://data.caixin.com/macro/macro_indicator_more.html?id=F0001&cpage=2&pageSize=30&url=macro_indicator_more.html#top";
url<-htmlParse(url1,encoding="UTF-8")
test <- getNodeSet(url,'//meta[@name]')#xpath语法找到html部件#显示的中文正常
#读取结点的内容：xmlValue内部参数只能是一个字符串
test_text_list<-sapply(test, xmlValue)#提取内容，多个的化以向量形式存储
test_text<-xmlValue(test[[2]])#把test的第2个中的内容提取出来=test_text_list[2].注意，即时test只有一组数据也要使用test[[1]],不可以直接使用test（不是字符串）
#读取结点的属性：xmlGetAttr内部参数只能是一个字符串
content1<-xmlGetAttr(test[[1]], "content")#读取test[[1]]中的content内容。注意直接用test不可以。#显示的中文不正常
content1<-iconv(content1,"UTF-8","gbk")#解决中文正常显示问题

###例3：书本P396-397###
url<-"http://category.dangdang.com/cid4008120-pg1.html"
url.html<-htmlParse(url,encoding="UTF-8")
xpath<-"//*[@name='lb']/div[@class='inner']/a"
url.node<-getNodeSet(url.html,xpath)
xmlGetAttr(url.node[[1]],'href')

####正文部分：已完成####
library(XML);
url<-c()
for(i in 1:9){
url<-c(url,paste("http://category.dangdang.com/cid4008120-pg",i,".html",sep=""))
}
read_url<-function(url){
url_vector<-c()
i<-1
for(i_url in url){
i_url.html<-htmlParse(i_url,encoding="UTF-8")
url.xpath<-getNodeSet(i_url.html,"//*[@name='lb']/div[@class='inner']/a")
url.i<-c()
for(j in 1:length(url.xpath)){
url.i<-c(url.i,xmlGetAttr(url.xpath[[j]],'href'))
}
url_vector<-c(url_vector,url.i)
i<-i+1
}
url_vector
}
urls<-read_url(url)

read_xml<-function(urls){
id_vector<-c()
price_vector<-c()
i<-1
for(i in 1:99){
url.html<-htmlParse(urls[i],encoding="UTF-8")
id.xpath<-getNodeSet(url.html,"//*[@id='stock_span']")
id<-xmlGetAttr(id.xpath[[1]],'prd_id')
price.xpath<-getNodeSet(url.html,"//*[@id='salePriceTag']")
#price.xpath<-getNodeSet(url.html,"//*[@id='promo_price']")
price<-xmlGetAttr(price.xpath[[1]],'ddsp')
price
id
id_vector[i]<-id
price_vector[i]<-price
i<-i+1
}
data.frame(id=id_vector,price=price_vector)
}
zzz<-read_xml(urls)
zzz

转载于:https://www.cnblogs.com/zhoufan/p/5111996.html

weixin_34082854

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
R语言心得-网页文本分析

R语言心得是在阅读《R语言与网站分析》并使用的过程中产生的想法。旨在通过博文寻找一起学习进步的小伙伴。网页数据爬取和文本分析一直是一个很酷炫的事，可是代码一长，难免有这样那样的问题。下面几个是几个案例的成功实现，仅供参考###例1：前戏，爬取下述网站的价格字段（中文读取有问题）####url<-"http://product.dangdang.com/1310721937.h...
复制链接

扫一扫