代码:
public class Wordcount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("sparkBoot").setMaster("local");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
//使用外部数据创建RDD 来自于HDFS、本地文件(全部的节点都可以),或者其他Hadoop支持的文件系统
//未安装hadoop时 /config 默认来自本地文件系统,可省略file://
//安装hadoop时 默认来自hdfs 指明来自本地文件系统需使用file:///config
JavaRDD<String> lines = sparkContext.textFile("/config").cache();
lines.map(new Function<String, String>() {
/* (non-Javadoc)
* @see org.apache.spark.api.java.function.Function#call(java.lang.Object)
*/
public String call(String arg0) throws Exception {
// TODO Auto-generated method stub
return arg0;
}
});
指定读取目录为:/config
打包后在未安装hadoop 只安装spark的环境上运行
spark-submit --master local --class spark_test1.Wordcount spark-test1-0.0.1-SNAPSHOT.jar
报错:
20/12/22 11:07:44 INFO SparkContext: Created broadcast 0 from textFile at Wordcount.java:31
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/config
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:297)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
在主机上创建/config后运行成功
在安装hadoop 后再次运行
spark-submit --master local --class spark_test1.Wordcount spark-test1-0.0.1-SNAPSHOT.jar
报错:
20/12/22 17:05:12 INFO SparkContext: Created broadcast 0 from textFile at Wordcount.java:31
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://master:9000/config
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:297)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
此时默认使用hdfs。
将sparkContext.textFile("/config")改为sparkContext.textFile(“file:///config”)
使用本地文件系统,执行成功。
或
执行hadoop fs -put /root/.kube/config / 上传本地文件至hdfs后,也可执行成功。
SparkContext.textFile(path)方法使用总结:
1.未安装hadoop时,默认使用本地文件系统,文件路径可省略file://;
2.安装hadoop时,默认使用hdfs。若要加载本地文件,需在文件路径前面加上file://。