1.从Spark2.2.X开始spark支持的JDK已经不支持1.7以下的版本的,需要将JDK进行升级
2.将1.4中的conf复制到2.2.1的conf中,启动和关闭没有什么区别
3.重点坑让我折腾了一天
运行程序报错 java.io.EOFException: Unexpected end of input stream
17/12/27 14:50:17 INFO scheduler.DAGScheduler: ShuffleMapStage 13 (map at PidCount.scala:86) failed in 2.215 s due to Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 18, 10.26.238.178, executor 1): java.io.EOFException: Unexpected end of input stream
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:271)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
Google了好久说是读取的文件有误,1.文件损坏,2.是
val textFile=sc.textFile(inputFile+"/*.gz");
索取的文件为空 也会报io错误
研究半天也没有找到如何让它为空继续进行,只好给他创建了一个为空的但大小不是0b的.gz文件,然后不报错了
谷歌上还有人说 如果你用的是Hive的话在
Starting from Spark 2.1 you can ignore corrupt files by enabling the spark.sql.files.ignoreCorruptFiles option. Add this to your spark-submit or pyspark command:
--conf spark.sql.files.ignoreCorruptFiles=true
因为我不是用的Hive感觉好像添加这个后还是无法过滤空的文件
希望对你有帮助吧