构建大数据ETL通道--Json数据的流式转换--Json转Avro（一）

最新推荐文章于 2025-03-12 09:32:09 发布

置顶 TOMSCUT

最新推荐文章于 2025-03-12 09:32:09 发布

阅读量1.9k

点赞数

分类专栏：大数据文章标签： Kafka Flume Avro Json HDFS

本文链接：https://blog.csdn.net/qq_29829081/article/details/79980595

版权

大数据专栏收录该内容

12 篇文章

订阅专栏

 
  公司之前收集日志数据，采用的是Json格式（因为原始日志格式就是Json的）。采用Json格式的优点是开发简单、数据可读性强；缺点是占用的存储空间太大、相应Hive表的查询速度很慢。因此，我们开展调研工作，准备解决这两个痛点。调研的过程分了很多步，接下来我将写几篇文章来大概描述一下。因为也是在探索，有的工作做得不是很充分，大家可以多提建议，共同交流。 
 
  为了提高数据的传输效率，我们打算将日志的生成格式改成Avro。因为Avro依赖于schema。当读取Avro数据时，总是能够获取到写入该数据时用到的schema。这样一来，每次写入数据都无需多余的开销，进而加快序列化速度、减小序列的大小。 
 
  在修改日志生成格式之前，我们先做了一个简单调研。即Avro数据流通过Flume到Kafka，再到HDFS的流式转换。得到了如下结果： 
 
  1 采用exec source，tail -F会破坏avro数据结构。不可行！ 
 
  https://stackoverflow.com/questions/21617025/flume-directory-to-avro-avro-to-hdfs-not-valid-avro-after-transfer?rq=1 
 
        2 采用spooldir source，数据是一整个avro文件，读取过来会保证原有的数据结构。

 
 (1) json数据： 

  {"name": "litao", "age": 18} 

  {"name": " 
 litao", "age": 18} 

  {"name": " 
 litao", "age": 18} 

 
   
  (2) schema： 

{

  "namespace": "com.howdy", 

  "name": "some_schema", 

  "type": "record", 

  "fields": [ { "name":"name","type":"string"}, 

  { "name":"age","type":"int"} 

]

}

 
 (3) Flume的配置文件： 

  为了体验Flume的Avro Sink，我特意在ngnix日志生成服务器、kafka服务器进行了Flume配置（Kafka配置和HDFS可以合并在一起的，我是为了方便测试拆开的，大家可以根据实际情况配置）。 

Ngnix 配置：
# Name the components on this agent
a1.sources = r
a1.sinks = k_kafka
a1.channels = c_mem

# Channels info
a1.channels.c_mem.type = memory
a1.channels.c_mem.capacity = 2000
a1.channels.c_mem.transactionCapacity = 300
a1.channels.c_mem.keep-alive = 60

# Sources info
a1.sources.r.type = spooldir
a1.sources.r.channels = c_mem
a1.sources.r.spoolDir = /home/litao/avro_file/
a1.sources.r.fileHeader = true
a1.sources.r.deserializer = avro

# Sinksinfo
a1.sinks.k_kafka.type = avro
a1.sinks.k_kafka.hostname = localhost
a1.sinks.k_kafka.port = 55555
a1.sinks.k_kafka.channel = c_mem

Kafka 配置：
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks  = k1

# Sources info
a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = localhost
a1.sources.r1.port = 55555

# Channels info
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity =500
a1.channels.c1.keep-alive = 50

# Sinksinfo
a1.sinks.k1.channel  = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = kafka1:9093,kafka2:9093,kafka3:9093,kafka4:9093,kafka5:9093,kafka6:9093
a1.sinks.k1.kafka.topic = test_2018-03-14
a1.sinks.k1.kafka.flumeBatchSize = 5
a1.sinks.k1.kafka.producer.acks =1

HDFS 配置：
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks  = k1

# Sources info
a1.sources.r1.channels = c1

a1.sources.r1.type = com.bigo.flume.source.kafka.KafkaSource
a1.sources.r1.kafka.bootstrap.servers = kafka1:9093,kafka2:9093,kafka3:9093,kafka4:9093,kafka5:9093,kafka6:9093
a1.sources.r1.kafka.topics = test_2018-03-14
a1.sources.r1.kafka.consumer.group.id = test_2018-03-14.conf_flume_group
a1.sources.r1.kafka.consumer.timeout.ms = 100

#Inject the Schema into the header so the AvroEventSerializer can pick it up
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key=flume.avro.schema.url
a1.sources.r1.interceptors.i1.value=hdfs://bigocluster/user/litao/litao.avsc

# Channels info
a1.channels.c1.type = memory
a1.channels.c1.capacity = 5000
a1.channels.c1.transactionCapacity =1000
a1.channels.c1.keep-alive = 50

# Sinksinfo
a1.sinks.k1.type  = hdfs
a1.sinks.k1.channel  = c1
a1.sinks.k1.serializer = org.apache.flume.serialization.AvroEventSerializer$Builder
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.path  = hdfs://bigocluster/flume/bigolive/test_2018-03-14
a1.sinks.k1.hdfs.filePrefix  = test.%Y-%m-%d
a1.sinks.k1.hdfs.fileSuffix  = .avro
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.idleTimeout = 603
a1.sinks.k1.hdfs.useLocalTimeStamp = false
a1.sinks.k1.hdfs.fileType = DataStream

 
        
   (4) Hive表创建语句（两种方式均可）： 

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

CREATE EXTERNAL TABLE tmp.test_hdfs_litao
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://bigocluster/user/litao/litao.avsc')
    STORED AS
    INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION 'hdfs://bigocluster/flume/bigolive/test_2018-03-14';

CREATE EXTERNAL TABLE tmp.test_hdfs_litao
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS
    INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION 'hdfs://bigocluster/flume/bigolive/test_2018-03-14'
    TBLPROPERTIES (
        'avro.schema.literal'='{
          "namespace": "com.howdy",
          "name": "some_schema",
          "type": "record",
          "fields": [ { "name":"name","type":"string"},
                      { "name":"age","type":"int"}
                    ]
        }'
);

 
      (5) 校验数据

select * from  tmp.test_hdfs_litao;

     
    (6) 原始的Avro数据

(7) 保存在HDFS的Avro数据

3 结论

 
   
  如果我们的日志是定时生成一个个的Avro文件，那么可以通过Flume的Spooldir Source读取日志，并流式传输，保存在HDFS，并且可以通过Hive读取。