Spark bulkload一些问题

南风知我意丿

已于 2022-09-04 17:35:06 修改

阅读量862

点赞数

分类专栏： # Spark-Hbase 文章标签： spark hbase big data

于 2022-06-02 17:22:46 首次发布

本文链接：https://blog.csdn.net/Lzx116/article/details/125032489

版权

Spark-Hbase 专栏收录该内容

12 篇文章 2 订阅

订阅专栏

文章目录

1，版本问题


//hbase 2.1.10
val load = new LoadIncrementalHFiles(HBaseConfiguration.create())
load.doBulkLoad(new Path(tmpdir), conn.getAdmin, table, regionLocator)

//hbase 2.3.5
val load = new BulkLoadHFilesTool(HBaseConfiguration.create())
load.doBulkLoad(new Path(tmpdir), conn.getAdmin, table, regionLocator)

2,reduce问题

spark bulkload 数据到hbase 遇到这样的问题，为了更好的平移至hbase数据库 spark写入的并行度和
hbase的region个数保持了一致这样也存在一个问题
如果hbase的region个数不够并行度就小了假如数据量大的情况下
并行度又比较小容易造成oom 遇到这种情况怎么办？

bulk load的map阶段是很快的，但是reduce阶段慢的一批，看了下是hbase给我们提供了reduce，也就是我们无法更改reduce的数量，但是，这样就会造成严重的效率低下的问题。

如何解决

如何解决：就是建表的时候进行合理的预分区，预分区数量决定reduce的数量（预分区+1 = reduce），预分区越多，自然效率就提高了。

hbase如何预分区？

hbase shell

CREATE 'TABLE TEST.TEST_REGION','i',{SPLIT=>['1','2','3','4','5','6']}

phoenix

CREATE TABLE TEST.TEST_REGION (
     "rk" VARCHAR NOT NULL PRIMARY KEY,
     "i"."sha1" VARCHAR,
     "i"."task_type" VARCHAR,
     "i"."task_id" BIGINT,
     "i"."task_createtime" BIGINT
)column_encoded_bytes=0 SPLIT ON ('1','2','3','4','5','6');

这样预分区6个，reduce数量就为 6+1 速度就快起来

3，数据量过大问题（32 hfile）

日志信息

Exception in thread "main" java.io.IOException: Trying to load more than 
32 hfiles to one family of one region
	at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad
	(LoadIncrementalHFiles.java:377)
	at hbase_Insert.Hbase_Insert.main(Hbase_Insert.java:241)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(

分析：
就是超过了32个hfile的大小的数据量导入到了hbase的一个region里面，超过了hbase默认规定的数据量大小。

解决

hbase.hregion.max.filesize
单个ColumnFamily的region大小，若按照ConstantSizeRegionSplitPolicy策略，超过设置的该值则自动split 默认的大小是1G hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily

允许的hfile的最大个数,默认配置是32 也就是说:这两个参数的默认值决定了,每次批量入库的数据量不能超过1*32也就是32个G,超过这个数量就会导致入库失败

在hbase-site.xml里面针对这两个参数进行设置（一劳永逸）

<!-- 单个ColumnFamily的region大小 kb -->
<property>
	<name>hbase.hregion.max.filesize</name> 
	<value>10737418240</value> 
</property> 
<!-- 允许的hfile的最大个数 -->
<property> 
	<name>hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily</name> 
	<value>3200</value>
</property>

或者可以在代码里

     val conf: Configuration = HBaseConfiguration.create()
     
    //为了预防hfile文件数过多无法进行导入，设置该参数值
    conf.setInt("hbase.hregion.max.filesize", 10737418240)
    conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 3200)

4，找不到 HBaseConfiguration

bug

java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

解决

1,hadoop-env.sh 添加：export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/export/server/hbase-1.3.1/lib/*

2,yarn-site.xml 中 yarn.application.classpath属性中加入hbase-1.3.1/lib/*

<property> 
	<name>yarn.application.classpath</name> 
	<value>
	$HADOOP_CONF_DIR,/usr/hdp/${hdp.version}/hadoop-client/*, hbase-1.3.1/lib/*
	</value> 
</property>

5.Hbase报ClusterId read in ZooKeeper is null

1.表现：

连接Hbase时, 明明hbase.zookeeper.quorum 和hbase.zookeeper.property.clientPort的设置都是正确的,却总是报错 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null
首先,这种情况出现在: 使用的configuration 是 new configuration这种方式获得的

2.分析：

这里: 涉及到一个关键的配置:
zookeeper.znode.parent --> 这个值的默认值是/hbase
但是如果集群里面设置的值不是这个的话,就会抛出这个异常!比如说我们的集群:
因为使用 new Configuration()获得的configuration对象是不会读取Hbase的配置文件hbase-site.xml文件的

3.解决：

代码中将该配置按照hbase-site.xml里面配置的添加进来即可
conf.set("zookeeper.znode.parent", "/hbase-unsecure");
或者
用HBaseConfiguration.create()创建的configuration对象
这样,该问题得到解决!

6.Can not create a Path from a null string

1.表现：

报错信息如下所示:

Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from a null string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:122)
        at org.apache.hadoop.fs.Path.<init>(Path.java:134)
        at org.apache.hadoop.fs.Path.<init>(Path.java:88)
        at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configurePartitioner(HFileOutputFormat2.java:596)
        at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configureIncrementalLoad(HFileOutputFormat2.java:445)
        at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configureIncrementalLoad(HFileOutputFormat2.java:410)
        at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configureIncrementalLoad(HFileOutputFormat2.java:372)
        at mastercom.cn.hbase.helper.AddPaths.addUnCombineConfigJob(AddPaths.java:272)
        at mastercom.cn.hbase.config.HbaseBulkloadConfigMain.CreateJob(HbaseBulkloadConfigMain.java:129)
        at mastercom.cn.hbase.config.HbaseBulkloadConfigMain.main(HbaseBulkloadConfigMain.java:141)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

由报错信息上可以看出来:是在HFileOutputFormat2类里面出现的错误
这个类是使用bulkload方式进行入库的很关键的类
我们接下来一步一步的去定位错误:
抛出来的错误信息是来自于path类的这个方法:

private void checkPathArg( String path ) throws IllegalArgumentException {
    // disallow construction of a Path from an empty string
    if ( path == null ) {
      throw new IllegalArgumentException(
          "Can not create a Path from a null string");
    }
    if( path.length() == 0 ) {
       throw new IllegalArgumentException(
           "Can not create a Path from an empty string");
    }   
  }

根据界面上的报错结合一下: 可以得到path是一个null,
那么这个空是从何而来,我们继续看源码

static void configurePartitioner(Job job, List<ImmutableBytesWritable> splitPoints)
      throws IOException {
    Configuration conf = job.getConfiguration();
    // create the partitions file
    FileSystem fs = FileSystem.get(conf);
    Path partitionsPath = new Path(conf.get("hbase.fs.tmp.dir"), "partitions_" + UUID.randomUUID());
    fs.makeQualified(partitionsPath);
    writePartitions(conf, partitionsPath, splitPoints);
    fs.deleteOnExit(partitionsPath);
    
    // configure job to use it
    job.setPartitionerClass(TotalOrderPartitioner.class);
    TotalOrderPartitioner.setPartitionFile(conf, partitionsPath);
  }

解决：

分析上面的源码,能够产生null的又和path相关的,显然是这行代码:
Path(conf.get(“hbase.fs.tmp.dir”),“partitions_” + UUID.randomUUID());
我们不妨测试一下,在获得conf对象后,打印一下hbase.fs.tmp.dir的值,果然为空!
只需要在代码里面加上这行! conf.set("hbase.fs.tmp.dir", "/wangyou/mingtong/mt_wlyh/tmp/hbase-staging");

7.查询hbase的时候报错:

日志表现：

Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.util.ByteStringer
  at org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:989)
  at org.apache.hadoop.hbase.protobuf.RequestConverter.buildScanRequest(RequestConverter.java:485)
  at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:195)
  at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:181)
  at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
  ... 6 more
java.lang.NullPointerException
  at mastercom.cn.bigdata.util.hbase.HbaseDBHelper.qureyAsList(HbaseDBHelper.java:86)
  at conf.config.CellBuildInfo.loadCellBuildHbase(CellBuildInfo.java:150)
  at mro.loc.MroXdrDeal.init(MroXdrDeal.java:200)
  at mapr.mro.loc.MroLableFileReducers$MroDataFileReducers.reduce(MroLableFileReducers.java:80)
  at mapr.mro.loc.MroLableFileReducers$MroDataFileReducers.reduce(MroLableFileReducers.java:1)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.util.ByteStringer
  at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(RpcRetryingCaller.java:229)
  at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:140)
  at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:310)
  at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:291)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)

解决：

在一些情况下,根据行键进行查询,可能得到的结果集是null,但是我的代码里并没有加上对可能出现的空指针异常进行处理的机制,然后使用for循环遍历这个空的结果集
for (Result result : results) 遍历一个空的结果集当然会报错啦!
解决方法: 前面加上一个判断,就解决了!

8.HMaster启动之后马上挂掉

日志里面报错信息如下:

 FATAL [kiwi02:60000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown.
  org.apache.hadoop.hbase.util.FileSystemVersionException: HBase file layout needs to be upgraded. 
  You  have version null and I want version 8. 
  Consult http://hbase.apache.org/book.html for further information about upgrading HBase. 
  Is your hbase.rootdir valid? If so, you may need to run 'hbase hbck -fixVersionFile'.

解决方案 :

在hdfs中，删除hbase的目录，然后重启hbase master 解决

那么,hbase的目录是哪一个呢?
在 : $HBASE_HOME/conf/hbase-site.xml里面配置,通常为/hbase

<property>
      <name>hbase.rootdir</name>
      <value>/hbase</value>
  </property>

南风知我意丿

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
6
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录