构建大数据ETL通道--Json数据的流式转换--Json转Parquet（三）

最新推荐文章于 2023-07-24 11:05:02 发布

置顶 TOMSCUT

最新推荐文章于 2023-07-24 11:05:02 发布

阅读量2.6k

点赞数

分类专栏：大数据文章标签： Json Parquet Json2Parquet 大数据 ETL

本文链接：https://blog.csdn.net/qq_29829081/article/details/80532773

版权

大数据专栏收录该内容

12 篇文章 1 订阅

订阅专栏

  如果生成的日志数据是Avro格式，可直接采用上一篇的方式（ 
 https://blog.csdn.net/qq_29829081/article/details/80518671），将Avro数据转储为Parquet。但是我们一般都是日志数据不是Avro，大部分是Json数据。因此，本篇主要讲如何将Json通过Morphline流式转储为Parquet数据。文章中只是简单的例子，在实际生产环境中，我们的Json数据非常复杂，但是也可以采用Morphline进行转储，可以采用通用的方式进行处理，下一节再详述。 

  本节主要讲述如何借助于flume的Morphline Interceptor，将json数据线临时转成avro，再通过kite dataset sink最终将数据转成parquet格式进行存储。其实最关键的就是Flume的配置，Morphline命令行的组合。 

 
 1 Flume配置： 

(1) nginx端flume配置

# Name the components on this agent
a1.sources = r
a1.sinks = k_kafka
a1.channels = c_mem

# Channelsinfo
a1.channels.c_mem.type = memory
a1.channels.c_mem.capacity = 2000
a1.channels.c_mem.transactionCapacity = 300
a1.channels.c_mem.keep-alive = 60

# Sources info
a1.sources.r.type = exec
a1.sources.r.shell = /bin/bash -c
a1.sources.r.command = tail -F /home/litao/litao.json
a1.sources.r.channels = c_mem

# Sinks info
a1.sinks.k_kafka.channel  = c_mem
a1.sinks.k_kafka.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k_kafka.kafka.bootstrap.servers = kafka1:9093,kafka2:9093,kafka3:9093,kafka4:9093,kafka5:9093,kafka6:9093
a1.sinks.k_kafka.kafka.topic = test_2018-03-14
a1.sinks.k_kafka.kafka.flumeBatchSize = 5
a1.sinks.k_kafka.kafka.producer.acks =1

  (2) kafka端flume配置 

# Name the components on this agent
a1.channels = c1
a1.sources = r1
a1.sinks  = k1

# Channel config
a1.channels.c1.type = memory
a1.channels.c1.capacity = 500000
a1.channels.c1.transactionCapacity =100000
a1.channels.c1.keep-alive = 50

# Sources info
a1.sources.r1.type = com.bigo.flume.source.kafka.KafkaSource
a1.sources.r1.channels = c1
a1.sources.r1.kafka.bootstrap.servers = kafka1:9093,kafka2:9093,kafka3:9093,kafka4:9093,kafka5:9093,kafka6:9093
a1.sources.r1.kafka.topics = test_2018-03-14
a1.sources.r1.kafka.consumer.group.id = test_2018-03-14.conf_flume_group
a1.sources.r1.kafka.consumer.timeout.ms = 100
a1.sources.r1.batchSize = 2000

# Config Interceptors
a1.sources.r1.interceptors=i1 morphline

# Inject the Schema into the header so the AvroEventSerializer can pick it up
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = flume.avro.schema.url
#a1.sources.r1.interceptors.i1.value = file:/home/litao/litao.avsc
a1.sources.r1.interceptors.i1.value=hdfs://bigocluster/user/litao/litao.avsc

# Morphline interceptor config
a1.sources.r1.interceptors.morphline.type = org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
a1.sources.r1.interceptors.morphline.morphlineFile = /etc/flume/conf/a1/morphline.conf
a1.sources.r1.interceptors.morphline.morphlineId = convertJsonToAvro

# Sink config
a1.sinks.k1.type  = org.apache.flume.sink.kite.DatasetSink
a1.sinks.k1.channel  = c1
a1.sinks.k1.kite.dataset.uri  = dataset:hdfs://bigocluster/flume/hellotalk/parquet
a1.sinks.k1.kite.batchSize = 100
a1.sinks.k1.kite.rollInterval = 30

 
 2 morphlines配置： 

                    morphlines: [
  {
    id: convertJsonToAvro
    importCommands: [ "org.kitesdk.**" ]
    commands: [
      # read the JSON blob
      { readJson: {} }
      # extract JSON objects into fields
      { extractJsonPaths {
        flatten: true
        paths: {
          name: /name
          age: /age
        }
      } }
      # add a creation timestamp to the record
      #{ addCurrentTime {
      #  field: timestamp
      #  preserveExisting: true
      #} }
      # convert the extracted fields to an avro object
      # described by the schema in this field
      { toAvro {
        schemaFile: /home/litao/litao.avsc
      } }
      # serialize the object as avro
      { writeAvroToByteArray: {
        format: containerlessBinary
      } }
    ]
  }
]

 
 3 依赖的jar包： 

  config-1.3.1.jar 

  metrics-healthchecks-3.0.2.jar 

  kite-morphlines-core-1.1.0.jar 

  kite-morphlines-json-1.1.0.jar 

  kite-morphlines-avro-1.1.0.jar 

TOMSCUT

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
构建大数据ETL通道--Json数据的流式转换--Json转Parquet（三）

如果生成的日志数据是Avro格式，可直接采用上一篇的方式（https://blog.csdn.net/qq_29829081/article/details/80518671），将Avro数据转储为Parquet。但是我们一般都是日志数据不是Avro，大部分是Json数据。因此，本篇主要讲如何将Json通过Morphline流式转储为Parquet数据。文章中只是简单的例子，在实际生产环境中，我们...
复制链接

扫一扫