如果生成的日志数据是Avro格式,可直接采用上一篇的方式(
https://blog.csdn.net/qq_29829081/article/details/80518671),将Avro数据转储为Parquet。但是我们一般都是日志数据不是Avro,大部分是Json数据。因此,本篇主要讲如何将Json通过Morphline流式转储为Parquet数据。文章中只是简单的例子,在实际生产环境中,我们的Json数据非常复杂,但是也可以采用Morphline进行转储,可以采用通用的方式进行处理,下一节再详述。
本节主要讲述如何借助于flume的Morphline Interceptor,将json数据线临时转成avro,再通过kite dataset sink最终将数据转成parquet格式进行存储。其实最关键的就是Flume的配置,Morphline命令行的组合。
1 Flume配置:
(1) nginx端flume配置
# Name the components on this agent
a1.sources = r
a1.sinks = k_kafka
a1.channels = c_mem
# Channelsinfo
a1.channels.c_mem.type = memory
a1.channels.c_mem.capacity = 2000
a1.channels.c_mem.transactionCapacity = 300
a1.channels.c_mem.keep-alive = 60
# Sources info
a1.sources.r.type = exec
a1.sources.r.shell = /bin/bash -c
a1.sources.r.command = tail -F /home/litao/litao.json
a1.sources.r.channels = c_mem
# Sinks info
a1.sinks.k_kafka.channel = c_mem
a1.sinks.k_kafka.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k_kafka.kafka.bootstrap.servers = kafka1:9093,kafka2:9093,kafka3:9093,kafka4:9093,kafka5:9093,kafka6:9093
a1.sinks.k_kafka.kafka.topic = test_2018-03-14
a1.sinks.k_kafka.kafka.flumeBatchSize = 5
a1.sinks.k_kafka.kafka.producer.acks =1
(2) kafka端flume配置
# Name the components on this agent
a1.channels = c1
a1.sources = r1
a1.sinks = k1
# Channel config
a1.channels.c1.type = memory
a1.channels.c1.capacity = 500000
a1.channels.c1.transactionCapacity =100000
a1.channels.c1.keep-alive = 50
# Sources info
a1.sources.r1.type = com.bigo.flume.source.kafka.KafkaSource
a1.sources.r1.channels = c1
a1.sources.r1.kafka.bootstrap.servers = kafka1:9093,kafka2:9093,kafka3:9093,kafka4:9093,kafka5:9093,kafka6:9093
a1.sources.r1.kafka.topics = test_2018-03-14
a1.sources.r1.kafka.consumer.group.id = test_2018-03-14.conf_flume_group
a1.sources.r1.kafka.consumer.timeout.ms = 100
a1.sources.r1.batchSize = 2000
# Config Interceptors
a1.sources.r1.interceptors=i1 morphline
# Inject the Schema into the header so the AvroEventSerializer can pick it up
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = flume.avro.schema.url
#a1.sources.r1.interceptors.i1.value = file:/home/litao/litao.avsc
a1.sources.r1.interceptors.i1.value=hdfs://bigocluster/user/litao/litao.avsc
# Morphline interceptor config
a1.sources.r1.interceptors.morphline.type = org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
a1.sources.r1.interceptors.morphline.morphlineFile = /etc/flume/conf/a1/morphline.conf
a1.sources.r1.interceptors.morphline.morphlineId = convertJsonToAvro
# Sink config
a1.sinks.k1.type = org.apache.flume.sink.kite.DatasetSink
a1.sinks.k1.channel = c1
a1.sinks.k1.kite.dataset.uri = dataset:hdfs://bigocluster/flume/hellotalk/parquet
a1.sinks.k1.kite.batchSize = 100
a1.sinks.k1.kite.rollInterval = 30
2 morphlines配置:
morphlines: [
{
id: convertJsonToAvro
importCommands: [ "org.kitesdk.**" ]
commands: [
# read the JSON blob
{ readJson: {} }
# extract JSON objects into fields
{ extractJsonPaths {
flatten: true
paths: {
name: /name
age: /age
}
} }
# add a creation timestamp to the record
#{ addCurrentTime {
# field: timestamp
# preserveExisting: true
#} }
# convert the extracted fields to an avro object
# described by the schema in this field
{ toAvro {
schemaFile: /home/litao/litao.avsc
} }
# serialize the object as avro
{ writeAvroToByteArray: {
format: containerlessBinary
} }
]
}
]
3 依赖的jar包:
config-1.3.1.jar
metrics-healthchecks-3.0.2.jar
kite-morphlines-core-1.1.0.jar
kite-morphlines-json-1.1.0.jar
kite-morphlines-avro-1.1.0.jar