transwarpR研究笔记（持续中）

最新推荐文章于 2023-10-09 15:52:25 发布

chenlongzhen_tech

最新推荐文章于 2023-10-09 15:52:25 发布

阅读量1.3k

点赞数

分类专栏： R 文章标签： r语言

本文链接：https://blog.csdn.net/sadfasdgaaaasdfa/article/details/45025037

版权

R 专栏收录该内容

27 篇文章 5 订阅

订阅专栏

正在学习transwarpR，一开始尝试使用开源的sparkr函数发现不能直接用，困惑多多。一下是学习的记录。

kmeans

运行成功！

Step 1 创建 hdfs 路径与上传文件
$ hdfs dfs -mkdir /user/Rest/kmeans 创建hdfs文件夹
$ hdfs dfs -put data/kmeans.txt /user/Rest/kmeans/ 上传本地文件到hdfs

Step 2 根据 hdfs 上的文件创建外表（ external table）
$ transwarp -t -h <inceptor-server IP>
> create external table kmeans(F1 string, F2 string, F3 string, F4 string, F5 string,
F6 string, F7 string) row format delimited fields terminated by ' ' location
'/user/Rtest/kmeans';

Step 3 使用 hdfs 和 SQL 两种方法来执行 kmeans 算法
>km<-txTextFile(sc,path="/user/test/kmeans/kmeans.txt",10)
>txKmeans(inputData=km,centers=10,iter.max=5,nstart=1,sep=" ")

SVM

SVM训练

transwarpR的svm调用总是出错,是因为数据结构错误，第一列是label 0-1的值才可以。

# transwarpR SVM wrong!
svmdata <- txTextFile(sc, path='/tmp/datascaled.txt', minSplits =10)#datascaled 是0-1数据
(inputData=svmdata,sep=" ",iter.max=10)

报错如下：

15/04/13 14:01:49 INFO FileInputFormat: Total input paths to process : 1
15/04/13 14:01:49 INFO SharkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:121
15/04/13 14:01:49 INFO SharkContext: Job finished: first at GeneralizedLinearAlgorithm.scala:121, took 0.010427414 s
15/04/13 14:01:49 INFO SharkContext: Starting job: count at DataValidators.scala:37
15/04/13 14:01:49 INFO SharkContext: Job finished: count at DataValidators.scala:37, took 0.273154369 s
15/04/13 14:01:49 ERROR DataValidators: Classification labels should be 0 or 1. Found 100 invalid labels
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  org.apache.spark.SparkException: Input validation failed.

txSVM的源码如下

> txSVM
function (inputData, sep, iter.max = .Machine$integer.max) 
{
    require(SparkR)
    res <- inSVM(inputData, sep, iter.max)
    structure(list(intercept = res$intercept, weight = res$weight), 
        class = "txSVM")
}
<environment: namespace:SparkR>

发现调用了inSVM，但是找不到这个函数。

发现出错是因为数据结构错误，第一列是label 0-1的值才可以。再进行测试成功！
格式如下：

[ruser01@host05 data]$ head svmtraindata.txt
0 0.0535077518444709 0.075163698302942 0.0356734429963183 0.0649726646754262 0.0802592151166999 0.048412235030713 0.0152913757412867 0.075163698302942 0.050959993437592 0.0675204230823051 0.0789853359132604 0.050959993437592 0.0152913757412867 0.075163698302942 0.0458644766238341 0.0675204230823051 0.0815330943201394
1 0.075163698302942 0.0356734429963183 0.0649726646754262 0.0802592151166999 0.0891763695407762 0.0152913757412867 0.075163698302942 0.050959993437592 0.0675204230823051 0.0789853359132604 0.0891763695407762 0.0152913757412867 0.075163698302942 0.0458644766238341 0.0675204230823051 0.0815330943201394 0.0891763695407762

# transwarpR SVM ok!
svmdata <- txTextFile(sc, path='/tmp/svmtestdata01.txt', minSplits =10)#datascaled 是0-1数据
txSVM(inputData=svmdata,sep=" ",iter.max=10)

结果：

> svmmodel <- txSVM(inputData=svmdata,sep=" ",iter.max=10)
15/04/13 17:46:03 INFO SharkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:121
15/04/13 17:46:03 INFO SharkContext: Job finished: first at GeneralizedLinearAlgorithm.scala:121, took 0.011357668 s
15/04/13 17:46:03 INFO SharkContext: Starting job: count at DataValidators.scala:37
15/04/13 17:46:03 INFO SharkContext: Job finished: count at DataValidators.scala:37, took 0.106446621 s
15/04/13 17:46:03 ERROR DataValidators: Classification labels should be 0 or 1. Found 990 invalid labels
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  org.apache.spark.SparkException: Input validation failed.
> svmdata <- txTextFile(sc, path='/tmp/svmtraindata.txt', minSplits =10)#datascaled 是0-1数据
> svmmodel <- txSVM(inputData=svmdata,sep=" ",iter.max=10)
15/04/13 17:46:10 INFO FileInputFormat: Total input paths to process : 1
15/04/13 17:46:10 INFO SharkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:121
15/04/13 17:46:10 INFO SharkContext: Job finished: first at GeneralizedLinearAlgorithm.scala:121, took 0.008440158 s
15/04/13 17:46:10 INFO SharkContext: Starting job: count at DataValidators.scala:37
15/04/13 17:46:11 INFO SharkContext: Job finished: count at DataValidators.scala:37, took 0.201416346 s
15/04/13 17:46:11 INFO SharkContext: Starting job: count at GradientDescent.scala:147
15/04/13 17:46:11 INFO SharkContext: Job finished: count at GradientDescent.scala:147, took 0.088323363 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.085295251 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.086834834 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.096076692 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.08626758 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.087121957 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.084390574 s
15/04/13 17:46:11 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:11 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.08545617 s
15/04/13 17:46:12 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:12 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.08593105 s
15/04/13 17:46:12 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:12 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.083459259 s
15/04/13 17:46:12 INFO SharkContext: Starting job: reduce at GradientDescent.scala:162
15/04/13 17:46:12 INFO SharkContext: Job finished: reduce at GradientDescent.scala:162, took 0.083601028 s
The model is:
$intercept
[1] 0.01536595

$weight
 [1] 0.008492136 0.012761007 0.013263176 0.015358918 0.008434623 0.014440761 0.012009752
 [8] 0.012496932 0.011796871 0.016320325 0.008644178 0.014011241 0.010635220 0.012469466
[15] 0.011434735 0.015619591 0.008198688

SVM预测

#训练测试用同一数据
svmdata <- txTextFile(sc, path='/tmp/svmtestdata.txt', minSplits =10)
txSVMPredict(model=svmmodel,inputData=svmdata,sep=" ",outputFilePath="./svm/svmpredictdata")

结果：

15/04/13 17:48:39 INFO FileInputFormat: Total input paths to process : 1
15/04/13 17:48:39 INFO SharkContext: Starting job: count at NativeMethodAccessorImpl.java:-2
15/04/13 17:48:39 INFO SharkContext: Job finished: count at NativeMethodAccessorImpl.java:-2, took 0.285311085 s
15/04/13 17:48:39 INFO SharkContext: Starting job: collect at NativeMethodAccessorImpl.java:-2
15/04/13 17:48:39 INFO SharkContext: Job finished: collect at NativeMethodAccessorImpl.java:-2, took 0.037396875 s
The label: 
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [42] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [83] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[124] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[165] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[206] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[247] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[288] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[329] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[370] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[411] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[452] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[493] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[534] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[575] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[616] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[657] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[698] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[739] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[780] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[821] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[862] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[903] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[944] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[985] 1 1 1 1 1 1 1 1
Start to write result to ./svm/svmpredictdata 
15/04/13 17:48:40 INFO SharkContext: Starting job: saveAsTextFile at NativeMethodAccessorImpl.java:-2
15/04/13 17:48:40 INFO SharkContext: Job finished: saveAsTextFile at NativeMethodAccessorImpl.java:-2, took 0.834582747 s
Writing Finished.
$label
[1] "Java-Object{MappedRDD[353] at map at GeneralizedLinearAlgorithm.scala:62}"

attr(,"class")
[1] "txSVMPredict"

由于用的随机数据结果比较坑，但是，运行出来了！

chenlongzhen_tech

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
transwarpR研究笔记（持续中）

正在学习transwarpR，一开始尝试使用开源的sparkr函数发现不能直接用，困惑多多。一下是学习的记录。kmeans运行成功！Step 1 创建 hdfs 路径与上传文件$ hdfs dfs -mkdir /user/Rest/kmeans 创建hdfs文件夹$ hdfs dfs -put data/kmeans.txt /user/Rest/kmeans/ 上传本地文件到hdfsSte
复制链接

扫一扫