如何读取并计算大数据文本文件

【问题】

I am new to Revolution r, so have this basic question. I am trying to open a large CSV file. 13GB. It is dataset from kaggle competition. 

R is not able to open it, so I turned towards Revolution r enterprise. Can you please help as to how can I read a CSV file on my system and can convert it into xdf format and load in Revolution R enterprise to run further analysis. 

My file path is “C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv”

I tried something like this but got error.

sampleDataDir <- rxGetOption("Kaggle")  
inputFile <- file.path("C:\\Users\\admin\\Desktop\\Kaggle\\dog\_1\_both\_marked.csv", "dog\_1\_both\_marked.csv")  
outputFile <- file.path(tempdir(), "basicClaims.xdf")  
rxTextToXdf(inFile = inputFile, outFile = outputFile, overwrite = TRUE)  
rxGetInfo(data = outputFile, getVarInfo = TRUE, numRows = 100000)  
file.remove(outputFile)

【回答】

R 可以分段读取大文件,也可以并行处理,但代码很繁琐而且,性能非常差。R 擅长的是数学统计类运算,对于这种结构化大文本文件的运算,R 并不是一个好工具,用 SPL 会更方便些。比如:

1、游标打开大数据文本文件

A
1=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()

2、查询:

A
1=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()
2=A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F")

3、分组汇总:

A
1=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()
2=A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary)

4、排序:

A
1=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()
2=A1.sortx(BIRTHDAY)

··· ···

具体内容可以参考集算器教程【文本数据】小节

  • 0
    点赞
  • 0
    收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值