在spark中使用Dataset[Row]也就是DataFrame做处理时,如果改变了数据结构,并用implicit Encoder做隐式转换,会出现数据格式无效,在后续处理的时候无法使用 row.getAs[T](fieldname)来获取数据。
可能是在spark处理的时候丢失了行数据的格式,在dataset上获取schema可以获得隐式转换中的数据结构,但是处理行数据的时候,schema值为null。也因此,如果把数据保存成文本形式再读取,不会出错,直接进行RDD或者Dataset操作的时候,会报错:java.lang.UnsupportedOperationException: fieldIndex on a Row without schema is undefined.
对于丢失了信息的row数据,重新加上schema值就可以解决这个问题。
new GenericRowWithSchema(row.toSeq.toArray, schema)
正常做数据转换 并更改了数据格式
//读取Spark Application 的配置信息
val sparkConf = new SparkConf()
//设置SparkApplication名称
.setAppName("ModuleSpark Application")
.setMaster("local[2]")
val sc = SparkContext.getOrCreate(sparkConf)
val spark = SparkSession.builder.config(sparkConf).getOrCreate()
val hadoopConf = sc.hadoopConfiguration
//创建RDD 正常转化
val rdd: RDD[Row] = sc.parallelize(Array[Row](Row(1L),Row(2L),Row(3L),Row(4L),Row(5L)))
val schemads1 = StructType(Array(StructField("num",LongType,true)))
val ds1 = spark.createDataFrame(rdd,schemads1)
println("dataset1:")
ds1.show()
//从row中正常读取数据,生成新的Dataset结构
val schemads2 = StructType(Array(StructField("num1",LongType,true),
StructField("num2",LongType,true)))
//def map[U](func : scala.Function1[T, U])(implicit evidence$6 : org.apache.spark.sql.Encoder[U]) : org.apache.spark.sql.Dataset[U]
val ds2 = ds1.map(
num => {
Row(num.getAs[Long]("num") + 10,num.getAs[Long]("num"))
}
)(RowEncoder.apply(schemads2))
println("dataset2: ")
println(ds2.show())
输出结果:
dataset1:
+---+
|num|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
dataset2:
+----+----+
|num1|num2|
+----+----+
| 11| 1|
| 12| 2|
| 13| 3|
| 14| 4|
| 15| 5|
+----+----+
对dataset2进行操作
ds2.map(
num => {
val num1 = num.getAs[Long]("num1")+10
val num2 = num.getAs[Long]("num2")+10
Row(num1,num2)
}
)(RowEncoder.apply(schemads2)).show()
报错信息
java.lang.UnsupportedOperationException: fieldIndex on a Row without schema is undefined.
at org.apache.spark.sql.Row$class.fieldIndex(Row.scala:342)
at org.apache.spark.sql.catalyst.expressions.GenericRow.fieldIndex(rows.scala:166)
at org.apache.spark.sql.Row$class.getAs(Row.scala:333)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:166)
at com.aotain.iptv.contentmatch.SchemaTest$$anonfun$main$2.apply(SchemaTest.scala:53)
at com.aotain.iptv.contentmatch.SchemaTest$$anonfun$main$2.apply(SchemaTest.scala:52)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
使用GenericRowWithSchema添加数据格式
println("dataset3: ")
ds2.map(
num => {
val newrow = new GenericRowWithSchema(num.toSeq.toArray, schemads2)
val num1 = newrow.getAs[Long]("num1")+10
val num2 = newrow.getAs[Long]("num2")+10
Row(num1,num2)
}
)(RowEncoder.apply(schemads2)).show()
输出结果:
+----+----+
|num1|num2|
+----+----+
| 21| 11|
| 22| 12|
| 23| 13|
| 24| 14|
| 25| 15|
+----+----+
不过这个方法尚未验证过复杂类型结构能不能解析,而且更常规的做法是创建一个class存储数据。