1. Spark部分问题及解决
1.1 问题1
Spark部署在远程服务器,只有IP和Port,解决:直接用就ip就可以.master("spark://master:7077")
…,我还查来好久…,可以将配置文件单独搞一个类
package MLModel;
import org.apache.spark.sql.SparkSession;
public class UtilityForSparkSession {
public static SparkSession mySession() {
SparkSession spark = SparkSession.builder()
.appName("RFTest971642")
//.master("local[*]")
.master("spark://master:7077")
//.config("spark.sql.warehouse.dir", "E:/Exp/")
.getOrCreate();
return spark;
}
}
1.2 问题二: 报错找不到scala.Serializable的类文件
Error:(19, 45) java: 无法访问scala.Serializable
找不到scala.Serializable的类文件
解决办法:原先在pom里加了很多依赖都没用…只能暴力将spark的安装目录下的JARs(Spark的所有jar包都在这里)文件夹复制到本地,再将整个目录添加到了project Structure->Modules->dependencies->添加文件夹…解决
1.3 问题三: 报错没有此方法NoSuchMethodError
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.mutable.Buffer$.empty()Lscala/collection/GenTraversable;
解决办法: scala版本不匹配,回退scala2.13.0版本为2.11.12,并同步pom中的依赖
1.4 问题四: 报错: 找不到文件(需要训练的文件为本地的.csv文件)
解决:百度了下,文件目录没有任何问题(不知道什么原因),所以直接将文件上传到HDFS上(完整的代码见最后)
// 从本地往HDFS上传文件
@Test
public void copyFile() throws IOException {
//本地文件路径
String localSrc = "/home/train_data.csv";
//在HDFS上新建的目录,新建方法见文章最后
String hdfsDst = "/ML/";
Path src = new Path(localSrc);
Path dst = new Path(hdfsDst);
//本地文件不存在
if (!(new File(localSrc)).exists()) {
System.out.println("Error: local dir \t" + localSrc
+ "\t not exists.");
return;
}
//hdfs路径不存在
if (!hdfs.exists(dst)) {
System.out.println("Error: dest dir \t" + dst.toUri()
+ "\t not exists.");
return;
}
String dstPath = dst.toUri() + "/" + src.getName();
//System.out.println(dstPath);// "/test1/3931.jpg"
//判断上传的文件 hdfs的目录下是否存在
if (hdfs.exists(new Path(dstPath))) {
System.out.println("Warn: dest file \t" + dstPath
+ "\t already exists.");
}else{
//本地文件上传hdfs
hdfs.copyFromLocalFile(src, dst);
// list all the files in the current direction
//遍历文件
FileStatus files[] = hdfs.listStatus(dst);
System.out.println("Upload to \t" + conf.get("fs.default.name")
+ hdfsDst);
for (FileStatus file : files) {
System.out.println(file.getPath());
}
}
}
1.5 问题五:当你要上传文件到HDFS时,报错hadoop.security.AccessControlException
org.apache.hadoop.security.AccessControlException: Permission denied: user=master, access=WRITE, inode="/":hadoop:supergroup:drwxr-xr-x
解决办法:即使用hdfs命令创建文件夹,并给该文件夹上权限
[hadoop@master bin]$ hdfs dfs -mkdir /ML
[hadoop@master bin]$ hdfs dfs -chmod 777 /ML
1.6 问题六:保存模型时报错
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 44.0 failed 4 times, most recent failure: Lost task 0.3 in stage 44.0 (TID 862, 10.108.22.222, executor 2): java.io.IOException: Mkdirs failed to create
解决办法:直接将模型保存到HDFS上完事
//保存模型
model.write().overwrite().save("hdfs://master:9000/ML/RandomForestTestModel");
//保存管道
pipeline.write().overwrite().save("hdfs://master:9000/ML/PipLineRandomForestTestModel");
//载入模型
//PipelineModel rfModelLoad = PipelineModel.load("hdfs://master:9000/ML/RandomForestTestModel");
//同样的操作
//Dataset<Row> predictions = rfModelLoad.transform(test);
//展示前5条结果
// predictions.select("prediction", "Label","indexedFeatures","rawPrediction", "probability").show(5);
2. 整体使用Pipeline的RF代码
package MLModel;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.RandomForestClassificationModel;
import org.apache.spark.ml.classification.RandomForestClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.feature.VectorIndexer;
import org.apache.spark.ml.feature.VectorIndexerModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org