spark 处理json文件时几种常用函数

  • get_json_object()
  • from_json()
  • to_json()
  • explode()
  • selectExpr()

准备阶段

1.首先,创建一个没有任何嵌套的JSon Schema 和样例类

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._


case class DeviceData (id: Int, device: String)

 将json转化为Dataframe

    val evetDS = Seq(
      (0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }"""),
      (1, """{"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }"""),
      (2, """{"device_id": 2, "device_type": "sensor-ipad", "ip": "88.36.5.1", "cca3": "ITA", "cn": "Italy", "temp": 18, "signal": 25, "battery_level": 5, "c02_level": 1372, "timestamp" :1475600500 }"""),
      (3, """{"device_id": 3, "device_type": "sensor-inest", "ip": "66.39.173.154", "cca3": "USA", "cn": "United States", "temp": 47, "signal": 12, "battery_level": 1, "c02_level": 1447, "timestamp" :1475600502 }"""),
      (4, """{"device_id": 4, "device_type": "sensor-ipad", "ip": "203.82.41.9", "cca3": "PHL", "cn": "Philippines", "temp": 29, "signal": 11, "battery_level": 0, "c02_level": 983, "timestamp" :1475600504 }"""),
      (5, """{"device_id": 5, "device_type": "sensor-istick", "ip": "204.116.105.67", "cca3": "USA", "cn": "United States", "temp": 50, "signal": 16, "battery_level": 8, "c02_level": 1574, "timestamp" :1475600506 }"""),
      (6, """{"device_id": 6, "device_type": "sensor-ipad", "ip": "220.173.179.1", "cca3": "CHN", "cn": "China", "temp": 21, "signal": 18, "battery_level": 9, "c02_level": 1249, "timestamp" :1475600508 }"""),
      (7, """{"device_id": 7, "device_type": "sensor-ipad", "ip": "118.23.68.227", "cca3": "JPN", "cn": "Japan", "temp": 27, "signal": 15, "battery_level": 0, "c02_level": 1531, "timestamp" :1475600512 }"""),
      (8, """ {"device_id": 8, "device_type": "sensor-inest", "ip": "208.109.163.218", "cca3": "USA", "cn": "United States", "temp": 40, "signal": 16, "battery_level": 9, "c02_level": 1208, "timestamp" :1475600514 }"""),
      (9, """{"device_id": 9, "device_type": "sensor-ipad", "ip": "88.213.191.34", "cca3": "ITA", "cn": "Italy", "temp": 19, "signal": 11, "battery_level": 0, "c02_level": 1171, "timestamp" :1475600516 }"""),
      (10, """{"device_id": 10, "device_type": "sensor-igauge", "ip": "68.28.91.22", "cca3": "USA", "cn": "United States", "temp": 32, "signal": 26, "battery_level": 7, "c02_level": 886, "timestamp" :1475600518 }"""),
      (11, """{"device_id": 11, "device_type": "sensor-ipad", "ip": "59.144.114.250", "cca3": "IND", "cn": "India", "temp": 46, "signal": 25, "battery_level": 4, "c02_level": 863, "timestamp" :1475600520 }"""),
      (12, """{"device_id": 12, "device_type": "sensor-igauge", "ip": "193.156.90.200", "cca3": "NOR", "cn": "Norway", "temp": 18, "signal": 26, "battery_level": 8, "c02_level": 1220, "timestamp" :1475600522 }"""),
      (13, """{"device_id": 13, "device_type": "sensor-ipad", "ip": "67.185.72.1", "cca3": "USA", "cn": "United States", "temp": 34, "signal": 20, "battery_level": 8, "c02_level": 1504, "timestamp" :1475600524 }"""),
      (14, """{"device_id": 14, "device_type": "sensor-inest", "ip": "68.85.85.106", "cca3": "USA", "cn": "United States", "temp": 39, "signal": 17, "battery_level": 8, "c02_level": 831, "timestamp" :1475600526 }"""),
      (15, """{"device_id": 15, "device_type": "sensor-ipad", "ip": "161.188.212.254", "cca3": "USA", "cn": "United States", "temp": 27, "signal": 26, "battery_level": 5, "c02_level": 1378, "timestamp" :1475600528 }"""),
      (16, """{"device_id": 16, "device_type": "sensor-igauge", "ip": "221.3.128.242", "cca3": "CHN", "cn": "China", "temp": 10, "signal": 24, "battery_level": 6, "c02_level": 1423, "timestamp" :1475600530 }"""),
      (17, """{"device_id": 17, "device_type": "sensor-ipad", "ip": "64.124.180.215", "cca3": "USA", "cn": "United States", "temp": 38, "signal": 17, "battery_level": 9, "c02_level": 1304, "timestamp" :1475600532 }"""),
      (18, """{"device_id": 18, "device_type": "sensor-igauge", "ip": "66.153.162.66", "cca3": "USA", "cn": "United States", "temp": 26, "signal": 10, "battery_level": 0, "c02_level": 902, "timestamp" :1475600534 }"""),
      (19, """{"device_id": 19, "device_type": "sensor-ipad", "ip": "193.200.142.254", "cca3": "AUT", "cn": "Austria", "temp": 32, "signal": 27, "battery_level": 5, "c02_level": 1282, "timestamp" :1475600536 }""")
    ).toDF("id", "device")

一.get_json_object()

hive中也有此函数

//json_string:包含JSON数据的字符串。
//path:JSON路径,用于指定要提取的JSON对象的位置
get_json_object(json_string, path)


//使用
    val jsdf: DataFrame = evetDS.select(
      $"id",
      get_json_object($"device","$.device_type").alias("device_type"),
      get_json_object($"device","$.ip").alias("id"),
      get_json_object($"device","$.cca3").alias("cca3"),
      get_json_object($"device","$.cn").alias("cn"),
      get_json_object($"device","$.temp").alias("temp"),
      get_json_object($"device","$.signal").alias("signal"),
      get_json_object($"device","$.battery_level").alias("battery_level"),
      get_json_object($"device","$.c02_level").alias("c02_level"),
      get_json_object($"device","$.timestamp").alias("timestamp")
    )
    jsdf.printSchema()
    jsdf.show()

        结果:

二. from_json

//jsonColumn: 包含 JSON 字符串的列。
//schema: JSON 数据的结构定义,可以是 StructType 或字符串形式的 schema。
//options: 可选参数,用于指定额外的 JSON 解析选项。
from_json(jsonColumn, schema[, options])
     
---------------------------------------------------------------------

      val jsonSchema = new StructType()
      .add("battery_level", LongType)
      .add("c02_level",LongType)
      .add("cca3",StringType)
      .add("cn", StringType)
      .add("device_id",LongType)
      .add("device_type", StringType)
      .add("signal", LongType)
      .add("ip",StringType)
      .add("temp", LongType)
      .add("timestamp", TimestampType)


    val frame: DataFrame =
      evetDS.select($"id", from_json($"device", jsonSchema).as("colum1"))
        .select($"id",$"colum1.*")
//        .select($"id",$"colum1.device_type",$"colum1.cca3")
        .filter($"temp" > 10 and $"signal" > 15)

      frame.show()

 结果

三. to_json

   //structColumn: 要转换为 JSON 的结构化列(struct、map 或 array)。
   //options: 可选参数,用于指定 JSON 生成选项。


     to_json(structColumn[, options])



    val framejson: DataFrame =
      evetDS.select($"id", from_json($"device", jsonSchema).as("colum1"))
        .select($"id",$"colum1.device_type",$"colum1.cca3")
    val jsonDF: DataFrame = framejson.select(to_json(struct($"*")))
    jsonDF.show(false)

        结果

 四.explode()

explode只能处理数组或map,且不能两列一起explode,json文件读取后为字符串,利用from_json给json中的数组确定类型

   println("---------准备工作----------")

 val op2RDD: RDD[String] = sc.textFile("in/op2.log")
    val op2DF: DataFrame = op2RDD.toDF("json")


     val op2StrDf: DataFrame = op2DF.select(
      get_json_object($"json", "$.address").alias("address"),
      get_json_object($"json", "$.area").alias("area"),
      get_json_object($"json", "$.classess").alias("classess"),
      get_json_object($"json", "$.level").alias("level"),
      get_json_object($"json", "$.schoolName").alias("schoolName"),
      get_json_object($"json", "$.teachers").alias("teachers")
    )

   
    println("-------------编写数组结构------------")
    val classschema: ArrayType = ArrayType(
      StructType(
        Array(
          StructField("classesName", StringType),
          StructField("num", StringType)
        )
      )
    )

    val teachdschema: ArrayType = ArrayType(
        StructType(
        Array(
          StructField("name", StringType),
          StructField("year", StringType)
        )
      )
    )


    println("-------------展开两列------------")
    val tpDF: DataFrame = op2StrDf.select(
      $"address", $"area", from_json($"classess", classschema).as("classess"), $"level", $"schoolName"
      , from_json($"teachers", teachdschema).as("teachers"))


    val tp2Df: DataFrame = tpDF.select(
      $"address", $"area", $"classess", $"level",
      $"schoolName",
      explode($"teachers")
    )

    val colDF: DataFrame = tp2Df.select(
      $"address", $"area", $"classess", $"level",
      $"schoolName", $"col.name", $"col.year"
    )

    val resultDF: DataFrame = colDF.select(
      $"address", $"area", explode($"classess"), $"level",
      $"name", $"year"
    )




结果

tpDF.show

colDF.show

 resultDF.show()

 五.selectExpr()

selectExpr() 是 Spark DataFrame 的一个非常有用的方法,它允许你使用 SQL 表达式来选择和转换列。这个方法结合了 select() 和 SQL 表达式的灵活性,使得数据转换更加直观和简洁。

 

  • 6
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark常用的数据源格式有以下几种: 1. CSV(逗号分隔值):CSV是一种常见的数据格式,每行数据由逗号分隔。Spark可以通过使用csv()函数加载CSV文件,并将其转化为DataFrame进行处理和分析。 2. JSON(JavaScript对象表示法):JSON是一种轻量级的数据交换格式,具有易于阅读和编写的特点。Spark可以使用json()函数JSON文件加载成DataFrame。 3. Parquet:Parquet是一种列式存储格式,它能够提供更高的压缩率和更好的查询性能。Spark可以使用parquet()函数加载Parquet文件,并对其进行处理和分析。 4. ORC(Optimized Row Columnar):ORC是一种优化的行列存储格式,它在查询性能和压缩率方面表现优秀。Spark可以使用orc()函数加载ORC文件,并对其进行处理和分析。 5. Avro:Avro是一种数据序列化系统,它使用JSON格式定义数据结构,支持动态模式演化。Spark可以使用avro()函数加载Avro文件,并将其转化为DataFrame进行处理和分析。 6. JDBC(Java数据库连接):Spark可以通过JDBC连接不同类型的关系型数据库,如MySQL、PostgreSQL、Oracle等。可以使用Spark的JDBC连接器读取和写入关系型数据库中的数据。 7. Hadoop文件系统(HDFS):Spark可以直接读取和写入Hadoop分布式文件系统中的文件,如HDFS、S3等。可以使用Spark文件系统API来操作这些文件。 总之,Spark支持各种常见的数据源格式,可以通过不同的API函数来加载和处理这些数据,以及与其他数据源进行交互。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值