transwarpR研究笔记(持续中)

正在学习transwarpR, 一开始尝试使用开源的sparkr函数发现不能直接用,困惑多多。一下是学习的记录。

kmeans

运行成功!

Step 1 创建 hdfs 路径与上传文件
$ hdfs dfs -mkdir /user/Rest/kmeans 创建hdfs文件夹
$ hdfs dfs -put data/kmeans.txt /user/Rest/kmeans/ 上传本地文件到hdfs

Step 2 根据 hdfs 上的文件创建外表( external table)
$ transwarp -t -h <inceptor-server IP>
> create external table kmeans(F1 string, F2 string, F3 string, F4 string, F5 string,
F6 string, F7 string) row format delimited fields terminated by ' ' location
'/user/Rtest/kmeans';

Step 3 使用 hdfs 和 SQL 两种方法来执行 kmeans 算法
>km<-txTextFile(sc,path="/user/test/kmeans/kmeans.txt",10)
>txKmeans(inputData=km,centers=10,iter.max=5,nstart=1,sep=" ")

SVM

SVM训练

transwarpR的svm调用总是出错,是因为数据结构错误,第一列是label 0-1的值才可以。

# transwarpR SVM wrong!
svmdata <- txTextFile(sc, path='/tmp/datascaled.txt', minSplits =10)#datascaled 是0-1数据
(inputData=svmdata,sep=" ",iter.max=10)

报错如下:

15/04/13 14:01:49 INFO FileInputFormat: Total input paths to process : 1
15/04/13 14:01:49 INFO SharkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:121
15/04/13 14:01:49 INFO SharkContext: Job finished: first at GeneralizedLinearAlgorithm.scala:121, took 0.010427414 s
15/04/13 14:01:49 INFO SharkContext: Starting job: count at DataValidators.scala:37
15/04/13 14:01:49 INFO SharkContext: Job finished: count at DataValidators.scala:37, took 0.273154369 s
15/04/13 14:01:49 ERROR DataValidators: Classification labels should be 0 or 1. Found 100 invalid labels
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  org.apache.spark.SparkException: Input validation failed.

txSVM的源码如下

> txSVM
function (inputData, sep, iter.max = .Machine$integer.max) 
{
    require(SparkR)
    res <- inSVM(inputData, sep, iter.max)
    structure(list(intercept = res$intercept, weight = res$weight), 
        class = "txSVM")
}
<environment: namespace:SparkR>

发现调用了inSVM,但是找不到这个函数。

发现出错是因为数据结构错误,第一列是label 0-1的值才可以。再进行测试成功!
格式如下:

[ruser01@host05 data]$ head svmtraindata.txt
0 0.0535077518444709 0.075163698302942 0.0356734429963183 0.0649726646754262 0.0802592151166999 0.048412235030713 0.0152913757412867 0.075163698302942 0.050959993437592 0.0675204230823051 0.0789853359132604 0.050959993437592 0.0152913757412867 0.075163698302942 0.0458644766238341 0.0675204230823051 0.0815330943201394
1 0.075163698302942 0.0356734429963183 0.0649726646754262 0.0802592151166999 0.0891763695407762 0.0152913757412867 0.075163698302942 0.050959993437592 0.0675204230823051 0.0789853359132604 0.0891763695407762 0.0152913757412867 0.075163698302942 0.0458644766238341 0.0675204230823051 0.0815330943201394 0.0891763695407762
# transwarpR SVM ok!
svmdata <- txTextFile(sc, path='/tmp/svmtestdata01.txt', minSplits =10)#datascaled 是0-1数据
txSVM(inputData=svmdata,sep=" ",iter.max=10)

结果:

> svmmodel <- txSVM(inputData=svmdata,sep=" ",iter.max=10)
15/04/13 17:46:03 INFO SharkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:121
15/04/13 17:46:03 INFO SharkContext: Job finished: first at GeneralizedLinearAlgorithm.scala:121, took 0.011357668 s
15/04/13 17:46:03 INFO SharkContext: Starting job: count at DataValidators.scala:37
15/04/13 17:46:03 INFO SharkContext: Job finished: count at DataValidators.scala:37, took 0.106446621 s
15/04/13 17:46:03 ERROR DataValidators: Classification labels should be 0 or 1. Found 990 invalid labels
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  org.apache.spark.SparkException: Input validation failed.
> svmdata <- txTextFile(sc, path='/tmp/svmtraindata.txt', minSplits =10)#datascaled 是0-1数据
> svmmodel <- txSVM(inputData=svmdata,sep=" ",iter.max=10)
15/04/13 17:46:10 INFO FileInputFormat: Total input paths to process : 1
15/04/13 17:46:10 INFO SharkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:121
15/04/13 17:46:10 INFO SharkContext: Job finished: first at GeneralizedLinearAlgorithm.scala:121, took 0.008440158 s
15/04/13 17:46:10 INFO SharkContext: Starting job: count at DataValidators.scala:37
15/04/13 17:46:11 INFO SharkContext: Job finished: count at DataValidators.scala:37, took 0.201416346 s
15/04/13 17:46:11 INFO SharkContext: Starting job: count at GradientDescent.scala:147
15/04/13 17:46:11 INFO SharkContext: Job finished: count at GradientDescent.scala:147, took 0.088323363 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.085295251 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.086834834 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.096076692 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.08626758 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.087121957 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.084390574 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.08545617 s
15/04/13 17:46:12 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:12 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.08593105 s
15/04/13 17:46:12 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:12 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.083459259 s
15/04/13 17:46:12 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:12 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.083601028 s
The model is:
$intercept
[1] 0.01536595

$weight
 [1] 0.008492136 0.012761007 0.013263176 0.015358918 0.008434623 0.014440761 0.012009752
 [8] 0.012496932 0.011796871 0.016320325 0.008644178 0.014011241 0.010635220 0.012469466
[15] 0.011434735 0.015619591 0.008198688

SVM预测

#训练测试用同一数据
svmdata <- txTextFile(sc, path='/tmp/svmtestdata.txt', minSplits =10)
txSVMPredict(model=svmmodel,inputData=svmdata,sep=" ",outputFilePath="./svm/svmpredictdata")

结果:

15/04/13 17:48:39 INFO FileInputFormat: Total input paths to process : 1
15/04/13 17:48:39 INFO SharkContext: Starting job: count at NativeMethodAccessorImpl.java:-2
15/04/13 17:48:39 INFO SharkContext: Job finished: count at NativeMethodAccessorImpl.java:-2, took 0.285311085 s
15/04/13 17:48:39 INFO SharkContext: Starting job: collect at NativeMethodAccessorImpl.java:-2
15/04/13 17:48:39 INFO SharkContext: Job finished: collect at NativeMethodAccessorImpl.java:-2, took 0.037396875 s
The label: 
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [42] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [83] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[124] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[165] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[206] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[247] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[288] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[329] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[370] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[411] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[452] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[493] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[534] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[575] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[616] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[657] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[698] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[739] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[780] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[821] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[862] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[903] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[944] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[985] 1 1 1 1 1 1 1 1
Start to write result to ./svm/svmpredictdata 
15/04/13 17:48:40 INFO SharkContext: Starting job: saveAsTextFile at NativeMethodAccessorImpl.java:-2
15/04/13 17:48:40 INFO SharkContext: Job finished: saveAsTextFile at NativeMethodAccessorImpl.java:-2, took 0.834582747 s
Writing Finished.
$label
[1] "Java-Object{MappedRDD[353] at map at GeneralizedLinearAlgorithm.scala:62}"

attr(,"class")
[1] "txSVMPredict"

由于用的随机数据结果比较坑,但是,运行出来了!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值