sparksql将多列组装为JSON数组

最新推荐文章于 2024-05-16 22:05:01 发布

木子民

最新推荐文章于 2024-05-16 22:05:01 发布

阅读量1.1k

点赞数

分类专栏： spark 文章标签： json scala big data

本文链接：https://blog.csdn.net/weixin_42278668/article/details/125056821

版权

spark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

spark sql组装JSON数组

原始数据

活动	行业	个数	占比
活动A	行业A	60	30%
活动A	行业B	20	10%
活动A	行业C	120	60%
活动B	行业A	6	30%
活动B	行业B	2	10%
活动B	行业C	12	60%

结果数据

活动	结果
活动A	[{“行业”:“行业A”,“个数”:60,“占比”:0.3},{“行业”:“行业B”,“个数”:20,“占比”:0.1},{“行业”:“行业C”,“个数”:120,“占比”:0.6}]
活动B	[{“行业”:“行业A”,“个数”:6,“占比”:0.3},{“行业”:“行业B”,“个数”:2,“占比”:0.1},{“行业”:“行业C”,“个数”:12,“占比”:0.6}]

知识积累

to_json: 将一个结构体，数组，Map类型的Column转为json

/**
   * Converts a column containing a `StructType`, `ArrayType` or
   * a `MapType` into a JSON string with the specified schema.
   * Throws an exception, in the case of an unsupported type.
   *
   * @param e a column containing a struct, an array or a map.
   *
   * @group collection_funcs
   * @since 2.1.0
   */
  def to_json(e: Column): Column =
    to_json(e, Map.empty[String, String])

struct: 将一列或多列转为结构体

/**
   * Creates a new struct column.
   * If the input column is a column in a `DataFrame`, or a derived column expression
   * that is named (i.e. aliased), its name would be retained as the StructField's name,
   * otherwise, the newly generated StructField's name would be auto generated as
   * `col` with a suffix `index + 1`, i.e. col1, col2, col3, ...
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def struct(cols: Column*): Column = withExpr { CreateStruct(cols.map(_.expr)) }

collect_set：将多行转为一行，返回去重后的数组

/**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * @note The function is non-deterministic because the order of collected results depends
   * on order of rows which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(e: Column): Column = withAggregateFunction { CollectSet(e.expr) }

实现

原始数据封装一个DataFrame，命名为sourceDF

sourceDF
      .withColumn("json", struct(col("行业"), col("个数"), col("占比")))
      .groupBy("活动")
      .agg(to_json(collect_set(col("json"))).as("res_json"))
      .show()