通过Spark-submit在xshell提交命令行,如果集群配置了keberos的话需要在打包的jar中进行认证,认证文件上传到节点并且需要分发到每一个节点,节点之间需要无密码ssh登录。
因为是通过Spark-submit提交程序,所以在代码当中的SparkConf设置为
.setMaster("yarn-cluster")
如果提交显示classnotfound可能是当前用户没有权限操作打包在集群上的jar包,或者是缺少命令--master yarn-cluster 。在这里,--master yarn-cluster 要和.setMaster("yarn-cluster")一致,不然会导致节点之间Connection的异常,而xshell一直显示Accepted。
以下是spark-submit提交的命令:
spark-submit \
--class sparkDemo.WordCount \
--master yarn-cluster \
--num-executors 5 \
--driver-memory 5g \
--driver-cores 4 \
--executor-memory 10g \
--executor-cores 5 \
hdfs://1.2.3.4:8020/bulkload/Jars/sub/SparkDemo.jar
"param1" "param2" "param3"
其中,param是可以传到spark程序main方法的args[]。
以下是需要打包的类:
package sparkDemo;
import java.util.Arrays;
import kerberos.KerberosService;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
import common.HadoopUtil;
public class ParseAttachment {
public static void main(String[] args) {
SparkConf conf_ = new SparkConf()
.setMaster("yarn-cluster")
.setAppName("parseAttachment");
JavaSparkContext sc = new JavaSparkContext(conf_);
JavaRDD<String> text = sc.textFile(HadoopUtil.hdfs_url + "/bulkload/Spark_in");
System.out.println("ok");
JavaRDD<String> words = text.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
@Override
public Iterable<String> call(String line) throws Exception {
return Arrays.asList(line.split(" "));//把字符串转化成list
}
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Tuple2<String, Integer> call(String word) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<String, Integer>(word, 1);
}
});
JavaPairRDD<String, Integer> results = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Integer call(Integer value1, Integer value2) throws Exception {
// TODO Auto-generated method stub
return value1 + value2;
}
});
JavaPairRDD<Integer, String> temp = results.mapToPair(new PairFunction<Tuple2<String,Integer>, Integer, String>() {
private static final long serialVersionUID = 1L;
@Override
public Tuple2<Integer, String> call(Tuple2<String, Integer> tuple)
throws Exception {
return new Tuple2<Integer, String>(tuple._2, tuple._1);
}
});
JavaPairRDD<String, Integer> sorted = temp.sortByKey(false).mapToPair(new PairFunction<Tuple2<Integer,String>, String, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Tuple2<String, Integer> call(Tuple2<Integer, String> tuple)
throws Exception {
// TODO Auto-generated method stub
return new Tuple2<String, Integer>(tuple._2,tuple._1);
}
});
sorted.foreach(new VoidFunction<Tuple2<String,Integer>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Tuple2<String, Integer> tuple) throws Exception {
System.out.println("word:" + tuple._1 + " count:" + tuple._2);
}
});
sc.close();
}
}
至于spark的依赖包:spark-assembly-1.5.2-hadoop2.6.0.jar 导进去就可以了。
注:
1.如果报错:the directory item limit is exceed: limit=1048576。因为一般使用spark进行批处理,所以输出的结果文件数量可能非常多,事实上Linux会限制一个文件夹的文件数量,这时候参考:https://blog.csdn.net/sparkexpert/article/details/51852944修改hdfs-site.xml吧;
2.因为一般处理的文件数量很庞大,所以在代码规范上一定要注意,比如读取文件的流要及时手动关闭,或者通过参数:spark.yarn.executor.memoryOverhead来调节内存分配;
3.报错:YarnScheduler: Lost executor。修改执行器等待参数(10分钟): --conf spark.core.connection.ack.wait.timeout=600。