遇到的一些问题:
1. impala上创建kudu内部表
在impala客户端上查看表明为kudutable,但在kudu上表名却是impala::database.kudutable
2.impala和kudu的timestamp
在impala上的timestamp是不同的,impala的精确到秒,kudu则精确到毫秒
所以在impala映射kudu表是timestamnp类型的字段无法正常使用,最好存成其他类型。
3.其它问题
参看https://blog.csdn.net/xueyao0201/article/details/80874583?utm_source=blogxgwz9
API:
一.建表
(1)创建ColumnSchema集合
List<ColumnSchema> columnSchemas = new ArrayList<ColumnSchema>();
(2)向集合中添加ColumnSchema 即列信息
columnSchemas.add(new ColumnSchema.ColumnSchemaBuilder
(kuduColumn.getColumnName(),kuduColumn.getColumnType()) //传入列名和类型
.key(kuduColumn.isPrimaryKey()) //是否为主键
.nullable(kuduColumn.isNullAble()) //是否允许空值
.defaultValue(kuduColumn.getDefaultValue()) //设置默认值
.typeAttributes(columnTypeAttributes) //decimal类型需要设置精度,如果不是decimal则不需要加这行
.build()); //最后build出columnschema
(3)创建出Kudu表的schema
Schema schema = new Schema(columnSchemas);
(4)通过kudu的client创建表(表名,schema,分区)
client.createTable(tablename,schema,new CreateTableOptions().addHashPartitions(primaryKeys,partitions));//分区可以选择hash分区或range分区
(5)在impala上创建kudu外部表
CREATE EXTERNAL TABLE ${impalatablename}
STORED AS KUDU
TBLPROPERTIES(
'kudu.table_name' = '${kudutablename}');
二.写入数据
1.通过spark dataset 批量写入
(1)获取dataset,可以通过读取文件或spqrksql读取hive表等获取
String sql = "select * from "+tableName+" where dt=to_date(date_sub(now(),1))";
Dataset<Row> rowDataset = sqlContext.sql(sql);
Dataset<Row> newrowDataset = rowDataset;
(2)写入kudu 需先引入 “org.apache.kudu” %% “kudu-spark2” % "1.8.0"jar包
public KuduContext kuduContext = new KuduContext(mastersString, SparkUtil.javaSparkContext.sc());
kuduContext.upsertRows(dataset,kuduTableName,new KuduWriteOptions(false,false));
2.单条写入
(1)创建一个Operation对象 可以是insert,delete,upsert,update,为Operation对象getRow()返回值赋值
public Operation writetoKudu(String tablename, Map<String, String> msg,String type,KuduClient client) {
Operation upsert = null;
try {
KuduTable table = client.openTable(tablename);
List<ColumnSchema> column_info = getTableInfo(tablename);
upsert = table.newUpsert();
PartialRow row = upsert.getRow();
for (ColumnSchema column : column_info) {
if (msg.get(column.getName()) != null && !msg.get(column.getName()).equals("") && msg.get(column.getName()).length() != 0) {
switch (column.getType()) {
case BOOL:
row.addBoolean(column.getName(), Boolean.valueOf(msg.get(column.getName())));
break;
case BINARY:
row.addBinary(column.getName(), msg.get(column.getName()).getBytes());
break;
case STRING:
row.addString(column.getName(), msg.get(column.getName()));
break;
case INT8:
row.addByte(column.getName(), Byte.valueOf(msg.get(column.getName())));
break;
case INT16:
row.addShort(column.getName(), Short.valueOf(msg.get(column.getName())));
break;
case INT32:
row.addInt(column.getName(), Integer.valueOf(msg.get(column.getName())));
break;
case INT64:
row.addLong(column.getName(), Long.valueOf(msg.get(column.getName())));
break;
case FLOAT:
row.addFloat(column.getName(), Float.valueOf(msg.get(column.getName())));
break;
case DOUBLE:
row.addDouble(column.getName(), Double.valueOf(msg.get(column.getName())));
break;
case DECIMAL:
row.addDecimal(column.getName(),new BigDecimal(msg.get(column.getName())));
break;
case UNIXTIME_MICROS:
row.addLong(column.getName(), this.datetotimestamp(msg.get(column.getName())));
break;
default:
break;
}
} else {
row.setNull(column.getName());
}
}
} catch (KuduException e) {
System.out.println(e);
}
return upsert;
}
(2)通过KuduSession提交,返回OperationResponse 可以查看写入的log
这里需要注意session的刷新方式,有三种 选择实时提交需要手动flush,后台有需提交需要设置缓冲区大小并要手动flush否则apply时回返回一个错误
AUTO_FLUSH_BACKGROUND(“后台提交,无序”),
AUTO_FLUSH_SYNC(“实时提交”),
MANUAL_FLUSH(“后台提交,有序”);
KuduSession session = client.newSession();
SessionFlushMode flushMode = SessionConfiguration.FlushMode.MANUAL_FLUSH;
int bach = 50;
session.setTimeoutMillis(60000);
session.setFlushMode(flushMode);
session.setMutationBufferSpace(bach);
session.apply(upsert);
session.flush();