Hadoop可以清洗计算TB级别的数据,数据清洗结束存入HDFS中,也可以存入到Hbase中,可以方便快速查询;
1.Hbase中需要创建一张表用来存储HDFS清洗后的数据:
hbase(main):014:0> create_namespace 'hdfs' //构建表空间
0 row(s) in 0.9030 seconds
hbase(main):015:0> create 'hdfs:product','info' //表空间下构建表
0 row(s) in 1.2830 seconds
=> Hbase::Table - hdfs:product
hbase(main):016:0> desc 'hdfs:product'
Table hdfs:product is ENABLED
hdfs:product
COLUMN FAMILIES DESCRIPTION
{NAME => 'info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERS
IONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.1070 seconds
2.准备数据,数据oracle两种表订单表和商品表,需求是:求每年每种商品的销售总额,要求显示:商品名称、年份、每年销售总额,存入到Hbase中
3.构建Hadoop程序
Mapper端是Hadoop用来清洗数据,以及相同key的数据汇总
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class ProductMapper extends Mapper<LongWritable,Text, IntWritable, Text> {
//2张表需要获取表的名称
String fileName;
Text v = new Text();
IntWritable k = new IntWritable();
protected void setup(Context context) throws IOException, InterruptedException {
//初始化获取表的名称
FileSplit split = (FileSplit)context.getInputSplit();
fileName = split.getPath().getName();
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取文件名称 根据名称输出 不同的内容 区分开商品名字以及订单销售额
String data = value.toString();
String []words = data.split(",");
if(fileName.equals("sales")){
k.set(Integer.parseInt(words[0]));
v.set(words[2].substring(0,4)+":"+words[6]);
context.write(k,v);
}else{
k.set(Integer.parseInt(words[0]));
v.set("*"+words[1]);
context.write(k,v);
}
}
}
2.集成hbase Reducer端用TableReducer,处理数据插入表中;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.u