基于CDH的solr+Key-Value Store Indexer+hbase二级索引框架构建（二）离线数据hbase-bulkload-构建solr索引简单优化

本文链接：https://blog.csdn.net/yzh865318761/article/details/82899758

转载请注明出处！！！

在将用户信息导入hbase中，报了如下错误。从错误上来看是通讯问题，有台机器异常，查资料，大部分解决方式是关于配置host等方面的问题，检查后发现hosts配置并无异常。

18/09/26 12:41:05 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.2.0.8:36423, server: hadoop09/10.2.0.9:2181
18/09/26 12:41:05 INFO zookeeper.ClientCnxn: Session establishment complete on server hadoop09/10.2.0.9:2181, sessionid = 0x36610ce89590826, negotiated timeout = 60000
18/09/26 12:41:05 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.2.0.9:8020/tmp/bulkload_test/data_db_t_data_user_info_new/_SUCCESS
18/09/26 12:41:05 INFO hfile.CacheConfig: Created cacheConfig: CacheConfig:disabled
18/09/26 12:41:05 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://10.2.0.9:8020/tmp/bulkload_test/data_db_t_data_user_info_new/user_info_new/4418676bb95d4ba097d5b1069fa8f681 first=0 last=999996974494548601
^@^@^@^@^@^@^@^@^@^@18/09/26 12:51:49 INFO client.RpcRetryingCaller: Call exception, tries=10, retries=35, started=643698 ms ago, cancelled=false, msg=row '' on table 'data_db_t_data_user_info_new' at region=data_db_t_data_user_info_new,,1537877718932.b822d81bee85aa0fe6f0addfc4ebbdbc., hostname=hadoop11,60020,1537935485446, seqNum=10
^@18/09/26 12:52:54 INFO client.RpcRetryingCaller: Call exception, tries=11, retries=35, started=708740 ms ago, cancelled=false, msg=row '' on table 'data_db_t_data_user_info_new' at region=data_db_t_data_user_info_new,,1537877718932.b822d81bee85aa0fe6f0addfc4ebbdbc., hostname=hadoop11,60020,1537935485446, seqNum=10
^@18/09/26 12:54:09 INFO client.RpcRetryingCaller: Call exception, tries=12, retries=35, started=783789 ms ago, cancelled=false, msg=row '' on table 'data_db_t_data_user_info_new' at region=data_db_t_data_user_info_new,,1537877718932.b822d81bee85aa0fe6f0addfc4ebbdbc., hostname=hadoop11,60020,1537935485446, seqNum=10
^@18/09/26 12:55:24 INFO client.RpcRetryingCaller: Call exception, tries=13, retries=35, started=858898 ms ago, cancelled=false, msg=row '' on table 'data_db_t_data_user_info_new' at region=data_db_t_data_user_info_new,,1537877718932.b822d81bee85aa0fe6f0addfc4ebbdbc., hostname=hadoop11,60020,1537935485446, seqNum=10
^@^@18/09/26 12:56:39 INFO client.RpcRetryingCaller: Call exception, tries=14, retries=35, started=934049 ms ago, cancelled=false, msg=row '' on table 'data_db_t_data_user_info_new' at region=data_db_t_data_user_info_new,,1537877718932.b822d81bee85aa0fe6f0addfc4ebbdbc., hostname=hadoop11,60020,1537935485446, seqNum=10

最终检查mr阶段生成的hfile文件，发现是权限问题。
在服务器中运行mr生成hfile时，使用的是习惯中的hdfs用户，所以在最终生成的文件权限中是hdfs权限，但是在运行数据批量导入时，hbase集群默认用户是hbase,所以读不到相应文件

在mr完成后修改相应目录的权限，或者使用hbase用户运行mr，问题得到解决。

hadoop生态环境中，每个组件都有自己的权限，跨组件时，一定要先分析权限关系!!
附（bulkload代码hive-hbase v0.1，初版，若后续频繁使用可以考虑配置化）：

package com.xiaoying.utils;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;

public class GeneratePutHFileAndBulkLoadToHBase {

    public static class ConvertUserInfoOutToHFileMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            String line = value.toString();
            String[] userInfoArr = line.split("\001");

            //构建HBase中的RowKe
            //这里的key在数据准备阶段已经在hive中对第一列uid做了翻转，作为导入时rowkey，只是为了分散数据到各个region，并无业务含义，看效果，后续可以添加hash等
            String reverskey = userInfoArr[0];
            byte[] rowKey = Bytes.toBytes(reverskey);
            ImmutableBytesWritable rowKeyWritable = new ImmutableBytesWritable(rowKey);

            //指定列簇名
            byte[] family = Bytes.toBytes("user_info_new");

            // Put 用于列簇下的多列提交，若只有一个列，则可以使用 KeyValue 格式
            // KeyValue keyValue = new KeyValue(rowKey, family, qualifier, hbaseVal);
            Put put = new Put(rowKey);
            //实际数据有120列左右，这里只模拟三个列吧，太多显得乱
            String clumstr = "reviceuid,fuiuid,fstremail";
            String[] clumstrArr = clumstr.split(",");

            //指定列名、对应值.列数是在太多了,改成for循环
            for (int i = 1; i < clumstrArr.length; i++) {
                byte[] cluName = Bytes.toBytes(clumstrArr[i]);//字段名称
                byte[] cluVal = Bytes.toBytes(userInfoArr[i]);//文件中获取字段值，因为第一列为fuiuid翻转值，在循环外已做了rowkey，所以在这里值中不再处理，循环从1开始
                put.add(family, cluName, cluVal);
            }

//            put.add(family, cluName, cluVal);
            context.write(rowKeyWritable, put);

        }

    }

    public static void main(String[] args) throws Exception {

        Configuration hadoopConfiguration = new Configuration();
        String[] dfsArgs = new GenericOptionsParser(hadoopConfiguration, args).getRemainingArgs();


        //hive表做为输入，只需要编写Mapper类，在Mapper类中转换为HBase需要的KeyValue的方式。
        Job toHFileJob = Job.getInstance(hadoopConfiguration);
        //Job convertUserInfoJobOutputToHFileJob=new Job(hadoopConfiguration, "wordCount_bulkload");

        toHFileJob.setJarByClass(GeneratePutHFileAndBulkLoadToHBase.class);
        toHFileJob.setMapperClass(ConvertUserInfoOutToHFileMapper.class);

        //ReducerClass 无需指定，框架会自行根据 MapOutputValueClass 来决定是使用 KeyValueSortReducer 还是 PutSortReducer
        //convertUserInfoJobOutputToHFileJob.setReducerClass(KeyValueSortReducer.class);

        toHFileJob.setMapOutputKeyClass(ImmutableBytesWritable.class);
        toHFileJob.setMapOutputValueClass(Put.class);

        //以Text文件作为输入源
        FileInputFormat.addInputPath(toHFileJob, new Path(dfsArgs[0]));
        FileOutputFormat.setOutputPath(toHFileJob, new Path(dfsArgs[1]));
        //创建HBase的配置对象
        Configuration hbaseConfiguration = HBaseConfiguration.create();
        hbaseConfiguration.set("hbase.zookeeper.quorum", "hadoop08:2181,hadoop09:2181,hadoop10:2181");
        Connection conn = ConnectionFactory.createConnection(hbaseConfiguration);
        HTable table = (HTable) conn.getTable(TableName.valueOf("data_db_t_data_user_info_new2"));
        //创建目标表对象
        HFileOutputFormat2.configureIncrementalLoad(toHFileJob, table, table.getRegionLocator());
        //提交job
//        toHFileJob.setNumReduceTasks(3);//这里设置了一下job最终的reduce任务的个数，默认与region个数有关，新表只有一个，太慢啦。事实证明手动设置无卵用，会报错不能如此使用
        if("tohfile".equals(args[2])){
            toHFileJob.waitForCompletion(true);
        }
        if("loadhfile".equals(args[2])){
            //当job结束之后，调用BulkLoad方式来将MR结果批量入库
            LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConfiguration);
            //第一个参数为第二个Job的输出目录即保存HFile的目录，第二个参数为目标表
            loader.doBulkLoad(new Path(dfsArgs[1]), table);
        }

        //最后调用System.exit进行退出
        System.exit(0);

    }

}

调用方式：
jar包调用参数说明：1：jar名称 2：类名称 3: mr输入文件地址 4：mr输出文件地址，也是hbase加载的数据源目录 5：操作阶段（tohfile/loadhfile）
1、生成hfile:

hadoop jar Streaming-1.0-SNAPSHOT.jar com.xiaoying.utils.GeneratePutHFileAndBulkLoadToHBase  hdfs://10.*.*.25:8020/user/hive/warehouse/xy_temp.db/data_db_t_data_user_info_new2 hdfs://10.*.*.9:8020/tmp/bulkload_test/data_db_t_data_user_info_new2 tohfile

2、修改生成hfile的权限（因老集群用户没有使用统一用户，各种权限问题，大坑）

hadoop fs -chmod -R  777 /tmp/bulkload_test/data_db_t_data_user_info_new 不够温柔，不过临时文件定期删除也某问题啦嘎嘎

3、将hfile导入到hbase中

hadoop jar Streaming-1.0-SNAPSHOT.jar com.xiaoying.utils.GeneratePutHFileAndBulkLoadToHBase  hdfs://10.*.*.25:8020/user/hive/warehouse/xy_temp.db/data_db_t_data_user_info_new2 hdfs://10.*.*.9:8020/tmp/bulkload_test/data_db_t_data_user_info_new2 loadhfile

优化：

以上流程在第一次运行时，总共耗时将近2个小时，查看yarn运行情况，发现任务卡在reduce阶段。任务在map结束后只有一个reduce在运行，集群资源空闲。
所以首先想到是增加reduce的数量，在代码中设置reduce数量，会报错，事实上reduce的数量跟hbase建表的region个数有关，包括后续solr根据hbase数据构建索引时的任务数也与此有关。
##### 1、修改建表方式：

-- 最初建表方式，默认会有一个region，数据量到阀值时自动分裂（列簇名越短越好，这里并不优）
create 'data_db_t_data_user_info_new', {NAME => 'user_info_new', REPLICATION_SCOPE => 1}
-- 修改为预分区建表方式（划分为50个分区，因为公司uid并不是连贯的，每个业务块有自己的生成方式，位数不同，所以这里采用了uid翻转，然后取前两位作为startkey-endkey，避免位数不同导致分布不均）
create 'data_db_t_data_user_info_new2', {NAME =>'user_info_new'}, SPLITS_FILE => '/data/yzh/hbase/create_table/split.txt'

split.txt文件如下,各业务生成uid位数不同，这里取上文中翻转的uid的前两位，可以均匀分布，这里分了50个region：

02|
04|
06|
08|
10|
12|
14|
16|
18|
20|
22|
24|
26|
28|
30|
32|
34|
36|
38|
40|
42|
44|
46|
48|
50|
52|
54|
56|
58|
60|
62|
64|
66|
68|
70|
72|
74|
76|
78|
80|
82|
84|
86|
88|
90|
92|
94|
96|
98|

观察表如下即ok:

å¾çæè¿°

2、修改hbase配置(以上不报错可以不配，此项配置防止数据量过大，使用预分区导入时一个region内数据太大报错)

hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamil =320  //此参数默认值为32
hbase.hregion.max.filesize=1G //每个文件最大大小，两者结合确定每个region的最大容量

优化后生成170G左右hfile并且导入hbase时间控制在20分钟内，并且solr在基于此表构建离线索引时长从150分钟缩短到20分钟内，勉强满足需求,正式环境机器配置下应该会有更好的表现。
注：这里并没有测试压缩方面的影响，后续从压缩方面继续优化。