Spark DataFrame列拆分与合并
1. 列拆分
hdfs目录下/home/test/testSplitTxt.txt文件内容
a,b,c,d
123,345,789,5
34,45,90,9878
读取text文件
scala> val df = sqlContext.read.text("/home/test/testSplitTxt.txt")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.show
+-------------+
| value|
+-------------+
| a,b,c,d|
|123,345,789,5|
|34,45,90,9878|
+-------------+
1.1 方法一:生成字段结构再与数据关联
生成字段属性
import org.apache.spark.ml.attribute.{Attribute, NumericAttribute}
val fields = Array("a","b","c","d")
val attributes: Array[Attribute] = {
Array.tabulate(fields.length)(i => NumericAttribute.defaultAttr.withName(fields(i)))
}
将属性与字段关联
import org.apache.spark.sql.functions._
val fieldCols = attributes.zipWithIndex.map(x => {
val assembleFunc = udf {
str: String =>
str.split(",")(x._2)
}
assembleFunc(df("value")).as(x._1.name.get, x._1.toMetadata())
})
显示数据
scala> df.select(col("*") +: fieldCols: _*).show
+-------------+---+---+---+----+
| value| a| b| c| d|
+-------------+---+---+---+----+
| a,b,c,d| a| b| c| d|
|123,345,789,5|123|345|789| 5|
|34,45,90,9878| 34| 45| 90|9878|
+-------------+---+---+---+----+
scala> df.select(fieldCols: _*).show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| a| b| c| d|
|123|345|789| 5|
| 34| 45| 90|9878|
+---+---+---+----+
1.2 方法二:拆分为数组后遍历
按指定分隔符拆分为数组列
scala> val splitDF = df.withColumn("splitCols",split(col("value"), ","))
splitDF: org.apache.spark.sql.DataFrame = [value: string, splitCols: array<string>]
scala> splitDF.show
+-------------+------------------+
| value| splitCols|
+-------------+------------------+
| a,b,c,d| [a, b, c, d]|
|123,345,789,5|[123, 345, 789, 5]|
|34,45,90,9878|[34, 45, 90, 9878]|
+-------------+------------------+
遍历数组生成各列
scala> splitDF.select(col("splitCols").getItem(0).as("a"),col("splitCols").getItem(1).as("b"),col("splitCols").getItem(2).as("c"),col("splitCols").getItem(3).as("d")).show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| a| b| c| d|
|123|345|789| 5|
| 34| 45| 90|9878|
+---+---+---+----+
或者按指定分隔符拆分value列,生成splitCols列
var newDF = df.withColumn("splitCols", split(col("value"), ","))
attributes.zipWithIndex.foreach(x => {
newDF = newDF.withColumn(x._1, $"splitCols".getItem(x._2))
})
2. 列合并
DataFrame中的数据为
scala> df.show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| a| b| c| d|
|123|345|789| 5|
| 34| 45| 90|9878|
+---+---+---+----+
2.1 方法一: 使用concat_ws函数
获取df各列的Column对象
scala> val columnArr = df.columns.map { colName => df.col(colName)}
columnArr: Array[org.apache.spark.sql.Column] = Array(a, b, c, d)
将各列使用指定分隔符拼接并获取
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
scala> df.select(concat_ws(",", columnArr: _*).cast(StringType).as("value")).show
+-------------+
| value|
+-------------+
| a,b,c,d|
|123,345,789,5|
|34,45,90,9878|
+-------------+
2.2 方法二:使用map函数
使用map函数遍历各列拼接(会将dataFrame转化为rdd)
scala> val rdd = df.map(_.toSeq.foldLeft("")(_ + "," + _).substring(1))
rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at map at <console>:36
scala> rdd.take(3)
res17: Array[String] = Array(a,b,c,d, 123,345,789,5, 34,45,90,9878)