- 出现这个错误:org.apache.spark.SparkException: A master URL must be set in your configuration
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://myhost:7077")
- saprk中的master配置,可以在spark-defult.conf里面配置,也可以在spark.env文件中配置
export JAVA_HOME=/opt/jdk1.8.0_91
export SCALA_HOME=/usr/share/scala
export SPARK_MASTER_IP=192.168.40.128
export SPARK_WORKER_MEMORY=10G
- 遇到org.apache.spark.SparkException: java.lang.ClassNotFoundException: test$1错误
可以将代码中的SparkConf中添加setJar方法,代码必须打包成jar包然后再将jar包的path添加进去。如下所示:
String[] jar_path = {"C:\\Users\\wuminglang\\IdeaProjects\\test\\target\\spark_remote-1.0-SNAPSHOT.jar"};
SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount").setMaster("spark://192.168.40.128:7077").setJars(jar_path);
- 使用DataFrame的java接口时。使用map遇到的问题,如下:
// SQL can be run over RDDs that have been registered as tables.
DataFrame teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19");
List<String> teenagerNames = teenagers.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
// 提示错误如下:
/* The method map(Function1, ClassTag) in the type DataFrame is not applicable for the arguments (new Function(){}) */
解决方案:
Java 6 & 7
List<String> teenagerNames = teenagers.javaRDD().map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}}).collect();
Java 8
List<String> t2 = teenagers.javaRDD().map(
row -> "Name: " + row.getString(0)).collect();
Once you call javaRDD() it works just like any other RDD map function. This works with Spark 1.3.0 and up.
5. 在未启动hadoop时出现了异常如下(textFile读取hdfs上的数据时没有读取到数据导致):
16/10/28 11:23:37 INFO SparkContext: Created broadcast 0 from textFile at test.java:43
Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
解决方案:
主要原因在于下面的textFile读取文件时未读取到数据,导致上面的错误。因此需要看文件路径或者hdfs是否启动。
JavaRDD<String> lines = ctx.textFile("hdfs://192.168.40.128:9000/user/input/ips_unpack_2016_08_05.txt", 1);
在我的环境下启动hadoop
$HADOOP_HOME/sbin/start-all.sh
6 . 在spark的java的API中。当对一个文件进行map处理时,最后输出的为指针位置,主要原因为在打印出的信息时是一个对象,主要是返回类型的确定。
public class sparkLearn {
public static void main(String[] args) {
String[] jar_path = {"C:\\Users\\wuminglang\\IdeaProjects\\test\\target\\spark_remote-1.0-SNAPSHOT.jar"};
SparkConf conf = new SparkConf().setAppName("LearningSpark").setMaster("spark://192.168.40.128:7077").setJars(jar_path);
JavaSparkContext java_sc = new JavaSparkContext(conf);
JavaRDD<String> lines = java_sc.textFile("hdfs://192.168.40.128:9000/user/input/ips_unpack_2016_08_05.txt");
System.out.println(lines.count()); // 统计读入文件的行数
// 采用flatMap进行map操作
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>(){
@Override
public Iterable<String> call(String x) {
List xx = Arrays.asList(x.split(","));
return(xx);
}
});
JavaRDD line = lines.map(x -> x.split(",")); // scala输出的类型为Array<Array[String]>
List out = words.top(10);
System.out.println(out);
List<String[]> l = line.take(50); // 返回的list
String[] ll = l.get(0); // 返回的string list
System.out.println(ll[0]); // 查看list中的元素
java_sc.stop();
}
}