hbase java批量添加数据,java – 以数据方式将数据批量加载到HBase中的最快方法是什么？...

最新推荐文章于 2021-11-30 23:56:27 发布

数据分析与Python

最新推荐文章于 2021-11-30 23:56:27 发布

阅读量157

点赞数

文章标签： hbase java批量添加数据

我有一个纯文本文件,可能需要数百万行,需要自定义解析,我想尽快将其加载到HBase表中(使用Hadoop或HBase

Java客户端).

我目前的解决方案是基于MapReduce作业,而没有减少部分.我使用FileInputFormat读取文本文件,以便将每一行都传递给Mapper类的map方法.在这一点上,该行被解析以形成写入上下文的Put对象.然后,TableOutputFormat获取Put对象并将其插入到表中.

该解决方案的平均插入速度为每秒1,000行,这比我预期的要少.我的HBase设置在单一服务器上处于伪分布式模式.

一个有趣的事情是,在插入1,000,000行时,会产生25个映射器(任务),但是它们一个接一个地连续运行;这是正常吗？

以下是我当前解决方案的代码：

public static class CustomMap extends Mapper {

protected void map(LongWritable key, Text value, Context context) throws IOException {

Map parsedLine = parseLine(value.toString());

Put row = new Put(Bytes.toBytes(parsedLine.get(keys[1])));

for (String currentKey : parsedLine.keySet()) {

row.add(Bytes.toBytes(currentKey),Bytes.toBytes(currentKey),Bytes.toBytes(parsedLine.get(currentKey)));

}

try {

context.write(new ImmutableBytesWritable(Bytes.toBytes(parsedLine.get(keys[1]))), row);

} catch (InterruptedException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

}

}

public int run(String[] args) throws Exception {

if (args.length != 2) {

return -1;

}

conf.set("hbase.mapred.outputtable", args[1]);

// I got these conf parameters from a presentation about Bulk Load

conf.set("hbase.hstore.blockingStoreFiles", "25");

conf.set("hbase.hregion.memstore.block.multiplier", "8");

conf.set("hbase.regionserver.handler.count", "30");

conf.set("hbase.regions.percheckin", "30");

conf.set("hbase.regionserver.globalMemcache.upperLimit", "0.3");

conf.set("hbase.regionserver.globalMemcache.lowerLimit", "0.15");

Job job = new Job(conf);

job.setJarByClass(BulkLoadMapReduce.class);

job.setJobName(NAME);

TextInputFormat.setInputPaths(job, new Path(args[0]));

job.setInputFormatClass(TextInputFormat.class);

job.setMapperClass(CustomMap.class);

job.setOutputKeyClass(ImmutableBytesWritable.class);

job.setOutputValueClass(Put.class);

job.setNumReduceTasks(0);

job.setOutputFormatClass(TableOutputFormat.class);

job.waitForCompletion(true);

return 0;

}

public static void main(String[] args) throws Exception {

Long startTime = Calendar.getInstance().getTimeInMillis();

System.out.println("Start time : " + startTime);

int errCode = ToolRunner.run(HBaseConfiguration.create(), new BulkLoadMapReduce(), args);

Long endTime = Calendar.getInstance().getTimeInMillis();

System.out.println("End time : " + endTime);

System.out.println("Duration milliseconds: " + (endTime-startTime));

System.exit(errCode);

}

数据分析与Python

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hbase java批量添加数据,java – 以数据方式将数据批量加载到HBase中的最快方法是什么？...

我有一个纯文本文件,可能需要数百万行,需要自定义解析,我想尽快将其加载到HBase表中(使用Hadoop或HBaseJava客户端).我目前的解决方案是基于MapReduce作业,而没有减少部分.我使用FileInputFormat读取文本文件,以便将每一行都传递给Mapper类的map方法.在这一点上,该行被解析以形成写入上下文的Put对象.然后,TableOutputFormat获取Put对象...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。