ETL日志数据到HBASE表中，程序代码优化点

最新推荐文章于 2021-12-20 14:28:15 发布

乔尼娜沙德星

最新推荐文章于 2021-12-20 14:28:15 发布

阅读量258

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/qq_35315363/article/details/98663663

版权

spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

（1）创建表的时候

设置表的数据压缩

      //设置数据压缩
      family.setCompressionType(Compression.Algorithm.SNAPPY)

创建预分区

      admin.createTable(desc,Array(
        Bytes.toBytes("145057118"),Bytes.toBytes("145057138"),
        Bytes.toBytes("145057158"),Bytes.toBytes("145057188")
      ))

设置读取表中的数据不缓存

cache block

2）spark程序的优化

.filter(tuple =>eventTypeList.contains(EventEnum.valueOfAlias(tuple._1)))

eventTypeList是Driver里面，filter是在Executor里面task运行

如果RDD有中3个分区，分别在不同的executor中，那么eventTypeList需要存储3份

在实际的开发中，一天处理的数据量几十个GB，分区有可能很多，一个数据库对应一个分区，一个分区对应一个Task，如果有1000个分区，

如果eventTypeList1M的话，将消耗1GB

可以考虑一个executor存储一份，如果有10个executor存储10M就好了

Spark supports two types of shared variables

spark提供2种方式变量共享

broadcast variables：广播变量

which can be used to cache a value in memory on all nodes,

使用广播变量将集合类别广播出去：将数据发送到每一个executor里面

    //将集合List变量广播出去
    val eventTypeBroadcast = sc.broadcast(eventTypeList)

    val eventPutRDD = parseEventLogRDD
        //针对时间类型进行过滤eventType
        .filter(tuple =>eventTypeBroadcast.value.contains(EventEnum.valueOfAlias(tuple._1)))
        .map{
          case  (eventAlias,logInfo)=>{
                //。。。
            )

accumulators：累加器

which are variables that are only “added” to, such as counters and sums.

（3）使用HFileOuputFile

向HBASE表中存储数据的时候，

put方式

putData ->WAL->MemStore ->StoreFile(HFile)

Hfile方式

Date –>Hfile ->load table

刷新缓存：flush 'ns1:t1'

强制让memStore的数据到StoreFile（Hfile ）中

乔尼娜沙德星

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ETL日志数据到HBASE表中，程序代码优化点

（1）创建表的时候设置表的数据压缩 //设置数据压缩 family.setCompressionType(Compression.Algorithm.SNAPPY) 创建预分区 admin.createTable(desc,Array( Bytes.toBytes("145057118"),Bytes.toByt...
复制链接

扫一扫