数据湖之iceberg系列(五)-Spark实时处理数据

1 接收网络数据  将数据实时写入到iceberg表中
开启nc 服务用于模拟数据输出

nc -lk 9999

2 spark实时读取数据将数据写入到iceberg表中

// 获取spark对象
    val spark = SparkSession.builder()
      .config("spark.sql.catalog.hadoop_prod.type", "hadoop") // 设置数据源类别为hadoop
      .config("spark.sql.catalog.hadoop_prod", classOf[SparkCatalog].getName)
      // 指定Hadoop数据源的根目录
      .config("spark.sql.catalog.hadoop_prod.warehouse", "hdfs://linux01:8020//doit/iceberg/warehouse/") // 设置数据源位置
      .appName(this.getClass.getSimpleName)
      .master("local[*]")
      .getOrCreate()
  // 接收数据 
    val lines = spark.readStream.format("socket").option("host", "linux01").option("port", 9999).load()
    // 处理数据成DF 
    import  spark.implicits._
    val data: DataFrame = lines.map(row => row.getAs[String]("value")).map(s => {
      val split: Array[String] = s.split(",")
      (split(0).toInt, split(1),split(2).toInt)
    }).toDF("id", "name","age")
    // 指定hadoop表位置
    val tableIdentifier: String = "hdfs://linux01:8020/doit/iceberg/warehouse/default/tb_user"
    // 将数据写入到hadoop类型的表中
    val query = data.writeStream.outputMode("append").format("iceberg").option("path", tableIdentifier).option("checkpointLocation", "/").start()
    query.awaitTermination()
    spark.close()
3 spark读取iceberg表中的数据

Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .config("spark.sql.catalog.hadoop_prod.type", "hadoop") // 设置数据源类别为hadoop
      .config("spark.sql.catalog.hadoop_prod", classOf[SparkCatalog].getName)
      // 指定Hadoop数据源的根目录
      .config("spark.sql.catalog.hadoop_prod.warehouse", "hdfs://linux01:8020//doit/iceberg/warehouse/") // 设置数据源位置
      .appName(this.getClass.getSimpleName)
      .master("local[*]")
      .getOrCreate()
 
    val lines = spark.read.format("iceberg").load("hdfs://linux01:8020/doit/iceberg/warehouse/default/tb_user")
    lines.printSchema()
    lines.createTempView("tb_user")
    // 展示表所有的文件和所有的快照信息
    spark.sql("select * from hadoop_prod.default.tb_user.files").show()
    spark.sql("select * from hadoop_prod.default.tb_user.snapshots").show()
   // 查询指定快照的数据
    val lines2= spark.read.format("iceberg").option("snapshot-id", 9146975902480919479L).load("hdfs://linux01:8020/doit/iceberg/warehouse/default/tb_user")
    lines2.show()
 //   lines.show()
 
    spark.close()
 
  }
结果如下 

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

+-------+--------------------+-----------+------------+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------+-------------+------------+
|content|           file_path|file_format|record_count|file_size_in_bytes|        column_sizes|        value_counts|   null_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|
+-------+--------------------+-----------+------------+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------+-------------+------------+
|      0|hdfs://linux01:80...|    PARQUET|           1|               833|[1 -> 46, 2 -> 53...|[1 -> 1, 2 -> 1, ...|[1 -> 0, 2 -> 0, ...|[1 -> !   , 2 -> ...|[1 -> !   , 2 -> ...|        null|          [4]|        null|
|      0|hdfs://linux01:80...|    PARQUET|           1|               835|[1 -> 47, 2 -> 53...|[1 -> 1, 2 -> 1, ...|[1 -> 0, 2 -> 0, ...|[1 ->   , 2 -> ...|[1 ->   , 2 -> ...|        null|          [4]|        null|
|      0|hdfs://linux01:80...|    PARQUET|           1|               840|[1 -> 47, 2 -> 53...|[1 -> 1, 2 -> 1, ...|[1 -> 0, 2 -> 0, ...|[1 ->   , 2 -> ...|[1 ->   , 2 -> ...|        null|          [4]|        null|
|      0|hdfs://linux01:80...|    PARQUET|           1|               842|[1 -> 47, 2 -> 54...|[1 -> 1, 2 -> 1, ...|[1 -> 0, 2 -> 0, ...|[1 ->   , 2 -> ...|[1 ->   , 2 -> ...|        null|          [4]|        null|
|      0|hdfs://linux01:80...|    PARQUET|           1|               842|[1 -> 47, 2 -> 54...|[1 -> 1, 2 -> 1, ...|[1 -> 0, 2 -> 0, ...|[1 ->   , 2 -> ...|[1 ->   , 2 -> ...|        null|          [4]|        null|
|      0|hdfs://linux01:80...|    PARQUET|           1|               849|[1 -> 47, 2 -> 55...|[1 -> 1, 2 -> 1, ...|[1 -> 0, 2 -> 0, ...|[1 ->   , 2 -> ...|[1 ->   , 2 -> ...|        null|          [4]|        null|
+-------+--------------------+-----------+------------+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------+-------------+------------+

+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|        committed_at|        snapshot_id|          parent_id|operation|       manifest_list|             summary|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|2020-12-05 15:13:...|4974727741303617264|               null|   append|hdfs://linux01:80...|[spark.app.id -> ...|
|2020-12-05 15:13:...|6649969826606152854|4974727741303617264|   append|hdfs://linux01:80...|[spark.app.id -> ...|
|2020-12-05 15:14:...|9146975902480919479|6649969826606152854|   append|hdfs://linux01:80...|[spark.app.id -> ...|
|2020-12-05 15:26:...|3789248833638708269|9146975902480919479|   append|hdfs://linux01:80...|[spark.app.id -> ...|
|2020-12-05 15:27:...| 145534978715502615|3789248833638708269|   append|hdfs://linux01:80...|[spark.app.id -> ...|
|2020-12-05 15:43:...| 677713801965958716| 145534978715502615|   append|hdfs://linux01:80...|[spark.app.id -> ...|
|2020-12-05 15:44:...|3022463020588869964| 677713801965958716|   append|hdfs://linux01:80...|[spark.app.id -> ...|
|2020-12-05 15:44:...|4764864293483030282|3022463020588869964|   append|hdfs://linux01:80...|[spark.app.id -> ...|
|2020-12-05 15:44:...|8363256205651138549|4764864293483030282|   append|hdfs://linux01:80...|[spark.app.id -> ...|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+


————————————————
版权声明:本文为CSDN博主「白眼黑刺猬」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qq_37933018/article/details/110690749

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值