1.问题描述
RDD转换为DataFrame,运行命令:
val spark=SparkSession.builder().appName("RDD2DataFrameSpark").master("local[2]").getOrCreate()
//RDD==>DataFrame
val rdd= spark.sparkContext.textFile("datas/info.txt")
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
val infoDF=rdd.map( _.split(",")).map(line=>Info(line(0).toInt,line(1),line(2).toInt)).toDF()
infoDF.show()
报错:
18/10/01 17:11:56 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at _1001MoocSparkSQL.RDD2DataFrameSparkSQL$$anonfun$2.apply(RDD2DataFrame.scala:24)
at _1001MoocSparkSQL.RDD2DataFrameSparkSQL$$anonfun$2.apply(RDD2DataFrame.scala:24)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2.原因:
这个问题实质就是数组越界导致的。原因是读取的数据文件info.txt有问题,读取后,每一行无法解析为需要的格式。
3.解决:
重新造一个info.txt,即可!
1,jason,50
2,lisa,40
3,sam,50