Spark Bulkload(Java)

最新推荐文章于 2024-05-12 00:30:46 发布

wangweislk

最新推荐文章于 2024-05-12 00:30:46 发布

阅读量3.8k

点赞数 4

分类专栏： Spark Hbase 文章标签： spark hbase Bulkload

本文链接：https://blog.csdn.net/wangweislk/article/details/78339377

版权

Spark 同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

Hbase

2 篇文章 0 订阅

订阅专栏

1、使用Spark通过Bulkload的方式导数据到Hbase

在未用Bulkload写Hbase时，使用RDD进行封装为Tuple2<ImmutableBytesWritable, Put>的KVRDD，然后通过saveAsNewAPIHadoopDataset写Hbase，非常慢，400G的数据大概写了2H+还没写完，后面没有办法就考虑使用Bulkload来导入数据。
在测试之前网上很多资料都是Scala版本的，并且实现都是单个列来操作，实际生产中会存在多个列族和列的情况，并且这里面有很多坑。
先上代码：

public class HbaseSparkUtils {
    private static Configuration hbaseConf;

    static {
        hbaseConf = HBaseConfiguration.create();
        hbaseConf.set(ConfigUtils.getHbaseZK()._1(), ConfigUtils.getHbaseZK()._2());
        hbaseConf.set(ConfigUtils.getHbaseZKPort()._1(), ConfigUtils.getHbaseZKPort()._2());
   }
   public static void saveHDFSHbaseHFile(SparkSession spark,  // spark session
  				      Dataset<Row> ds,   // 数据集
                                      String table_name,  //hbase表名
                                      Integer rowKeyIndex, //rowkey的索引id
                                      String fields) throws Exception { // 数据集的字段列表

        hbaseConf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 1024);
        hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, table_name);
        Job job = Job.getInstance();
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(KeyValue.class);
        job.setOutputFormatClass(HFileOutputFormat2.class);

        Connection conn = ConnectionFactory.createConnection(hbaseConf);
        TableName tableName = TableName.valueOf(table_name);
        HRegionLocator regionLocator = new HRegionLocator(tableName, (ClusterConnection) conn);
        Table realTable = ((ClusterConnection) conn).getTable(tableName);
        HFileOutputFormat2.configureIncrementalLoad(job, realTable, regionLocator);

        JavaRDD<Row> javaRDD = ds.toJavaRDD();

        JavaPairRDD<ImmutableBytesWritable, KeyValue> javaPairRDD =
                javaRDD.mapToPair(new PairFunction<Row, ImmutableBytesWritable, List<Tuple2<ImmutableBytesWritable, KeyValue>>>() {
            @Override
            public Tuple2<ImmutableBytesWritable, List<Tuple2<ImmutableBytesWritable, KeyValue>>> call(Row row) throws Exception {
                List<Tuple2<ImmutableBytesWritable, KeyValue>> tps = new ArrayList<>();

                String rowkey = row.getString(rowKeyIndex);
                ImmutableBytesWritable writable = new ImmutableBytesWritable(Bytes.toBytes(rowkey));

                // sort columns。这里需要对列进行排序，不然会报错
                ArrayList<Tuple2<Integer, String>> tuple2s = new ArrayList<>();
                String[] columns = fields.split(",");
                for (int i = 0; i < columns.length; i++) {
                    tuple2s.add(new Tuple2<Integer, String>(i, columns[i]));
                }

                for (Tuple2<Integer, String> t : tuple2s) {
                    String[] fieldNames = row.schema().fieldNames();
                   // 不将作为rowkey的字段存在列里面
                    if (t._2().equals(fieldNames[rowKeyIndex])) {
                        System.out.println(String.format("%s == %s continue", t._2(), fieldNames[rowKeyIndex]));
                        continue;
                    }

                    if ("main".equals(t._2())) {
                        continue;
                    }
                    String value = getRowValue(row, t._1(), tuple2s.size());

                    KeyValue kv = new KeyValue(Bytes.toBytes(rowkey),
                            Bytes.toBytes(ConfigUtils.getFamilyInfo()._2()),
                            Bytes.toBytes(t._2()), Bytes.toBytes(value));
                    tps.add(new Tuple2<>(writable, kv));
                }

                for (Tuple2<Integer, String> t : tuple2s) {
                    String value = getRowValue(row, t._1(), tuple2s.size());

                    if ("main".equals(t._2())) {  // filed == 'main'
                        KeyValue kv = new KeyValue(Bytes.toBytes(rowkey),
                                Bytes.toBytes(ConfigUtils.getFamilyMain()._2()),
                                Bytes.toBytes(t._2()), Bytes.toBytes(value));
                        tps.add(new Tuple2<>(writable, kv));
                        break;
                    }
                }

                return new Tuple2<>(writable, tps);
            }
       // 这里一定要按照rowkey进行排序，这个效率很低，目前没有找到优化的替代方案
        }).sortByKey().flatMapToPair(new PairFlatMapFunction<Tuple2<ImmutableBytesWritable, List<Tuple2<ImmutableBytesWritable, KeyValue>>>,
                ImmutableBytesWritable, KeyValue>() {
            @Override
            public Iterator<Tuple2<ImmutableBytesWritable, KeyValue>> call(Tuple2<ImmutableBytesWritable,
                    List<Tuple2<ImmutableBytesWritable, KeyValue>>> tuple2s) throws Exception {

                return tuple2s._2().iterator();
            }
        });

        // 创建HDFS的临时HFile文件目录
        String temp = "/tmp/bulkload/"+table_name+"_"+System.currentTimeMillis();
        javaPairRDD.saveAsNewAPIHadoopFile(temp, ImmutableBytesWritable.class,
                KeyValue.class, HFileOutputFormat2.class, job.getConfiguration());

        LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConf);
        Admin admin = conn.getAdmin();
        loader.doBulkLoad(new Path(temp), admin, realTable, regionLocator);

    }
}

2、下面是一些遇到的异常问题

 
  1、Can not create a Path from a null string 
 

源码分析：

   需要添加下面属性： 
 

   job.getConfiguration().set("mapred.output.dir","/user/wangwei/tmp/"+tableName); 
 

   job.getConfiguration().set("mapreduce.output.fileoutputformat.outputdir", "/tmp/"+tableName); // 推荐的参数 
 

 
  2、 Bulk load operation did not find any files to load in directory /tmp/wwtest. Does it contain files in subdirectories that correspond to column family names? 
 

   17/10/11 15:54:09 WARN LoadIncrementalHFiles: Skipping non-directory file:/tmp/wwtest/_SUCCESS 
 

 
  17/10/11 15:54:09 WARN LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory /tmp/wwtest. Does it contain files in subdirectories that correspond to column family names? 
 

   1、查看输入数据是否为空 
 

   2、setMapOutputKeyClass 和 saveAsNewAPIHadoopFile 中class是否一致 
 

3、代码BUG

 
  3、 Added a key not lexically larger than previous  
 

 
  java.io.IOException: Added a key not lexically larger than previous key=\x00\x02Mi\x0BsearchIndexuserId\x00\x00\x01>\xD5\xD6\xF3\xA3\x04, lastkey=\x00\x01w\x0BsearchIndexuserId\x00\x00\x01>\xD5\xD6\xF3\xA3\x04 
 

 
  最主要原因，在制作HFile文件的时候，一定要主键排序。Put进去会自动排序。但自己做成HFile文件不会自动排序。 
 

 
  所有一定要排序好，从 
 

 
  主键 
 

 
  列族 
 

列

 
  都要手动排序好。然后生成HFile文件。不然只会报错。 
 

 
  4、Caused by: java.lang.ClassCastException: org.apache.hadoop.hbase.client.Put cannot be cast to org.apache.hadoop.hbase.Cell（使用Put作为MapOutputKey出现，使用KeyValue不存在问题） 
 

   没解决，使用KeyValue 放到一个List里面，然后FlatMap一下 
 

 
  5、java.io.IOException: Trying to load more than 32 hfiles to one family of one region 
 

   hbaseConf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 1024); 
 

wangweislk

关注

4
点赞
踩
9

收藏

觉得还不错? 一键收藏
1
评论
Spark Bulkload(Java)

1、使用Spark通过Bulkload的方式导数据到Hbase在未用Bulkload写Hbase时，使用RDD进行封装为Tuple2的KVRDD，然后通过saveAsNewAPIHadoopDataset写Hbase，非常慢，400G的数据大概写了2H+还没写完，后面没有办法就考虑使用Bulkload来导入数据。在测试之前网上很多资料都是Scala版本的，并且实现都是单个列来操作，实际
复制链接

扫一扫