spark sql组装JSON数组
原始数据
活动 | 行业 | 个数 | 占比 |
---|---|---|---|
活动A | 行业A | 60 | 30% |
活动A | 行业B | 20 | 10% |
活动A | 行业C | 120 | 60% |
活动B | 行业A | 6 | 30% |
活动B | 行业B | 2 | 10% |
活动B | 行业C | 12 | 60% |
结果数据
活动 | 结果 |
---|---|
活动A | [{“行业”:“行业A”,“个数”:60,“占比”:0.3},{“行业”:“行业B”,“个数”:20,“占比”:0.1},{“行业”:“行业C”,“个数”:120,“占比”:0.6}] |
活动B | [{“行业”:“行业A”,“个数”:6,“占比”:0.3},{“行业”:“行业B”,“个数”:2,“占比”:0.1},{“行业”:“行业C”,“个数”:12,“占比”:0.6}] |
知识积累
- to_json: 将一个结构体,数组,Map类型的Column转为json
/**
* Converts a column containing a `StructType`, `ArrayType` or
* a `MapType` into a JSON string with the specified schema.
* Throws an exception, in the case of an unsupported type.
*
* @param e a column containing a struct, an array or a map.
*
* @group collection_funcs
* @since 2.1.0
*/
def to_json(e: Column): Column =
to_json(e, Map.empty[String, String])
- struct: 将一列或多列转为结构体
/**
* Creates a new struct column.
* If the input column is a column in a `DataFrame`, or a derived column expression
* that is named (i.e. aliased), its name would be retained as the StructField's name,
* otherwise, the newly generated StructField's name would be auto generated as
* `col` with a suffix `index + 1`, i.e. col1, col2, col3, ...
*
* @group normal_funcs
* @since 1.4.0
*/
@scala.annotation.varargs
def struct(cols: Column*): Column = withExpr { CreateStruct(cols.map(_.expr)) }
- collect_set:将多行转为一行,返回去重后的数组
/**
* Aggregate function: returns a set of objects with duplicate elements eliminated.
*
* @note The function is non-deterministic because the order of collected results depends
* on order of rows which may be non-deterministic after a shuffle.
*
* @group agg_funcs
* @since 1.6.0
*/
def collect_set(e: Column): Column = withAggregateFunction { CollectSet(e.expr) }
实现
- 原始数据封装一个DataFrame,命名为sourceDF
sourceDF
.withColumn("json", struct(col("行业"), col("个数"), col("占比")))
.groupBy("活动")
.agg(to_json(collect_set(col("json"))).as("res_json"))
.show()