Hive metastore整体代码分析及详解
metadata的目录结构:
整个hivemeta的目录包含metastore(客户端与服务端调用逻辑)、events(事件目录包含table生命周期中的检查、权限认证等listener实现)、hooks(这里的hooks仅包含了jdo connection的相关接口)、parser(对于表达树的解析)、spec(partition的相关代理类)、tools(jdo execute相关方法)及txn及model。
下来从整个metadata分类逐一进行代码分析及注释。从Hive这个大类开始看,因为它是metastore元数据调用的入口。整个生命周期分析流程为: HiveMetaStoreClient客户端的创建及加载、HiveMetaStore服务端的创建及加载、createTable、dropTable、AlterTable、createPartition、dropPartition、alterPartition。当然,这只是完整metadata的一小部分。
1、HiveMetaStoreClient客户端的创建及加载
从Hive这个类开始看起:
private HiveConf conf = null;
private IMetaStoreClient metaStoreClient;
private UserGroupInformation owner;
// metastore calls timing information
private final Map<String, Long> metaCallTimeMap = new HashMap<String, Long>();
private static ThreadLocal<Hive> hiveDB = new ThreadLocal<Hive>() {
@Override
protected synchronized Hive initialValue() {
return null;
}
@Override
public synchronized void remove() {
if (this.get() != null) {
this.get().close();
}
super.remove();
}
};
这里声明的有hiveConf对象、metaStoreClient 、操作用户组userGroupInfomation以及调用时间Map,这里存成一个map,用来记录每一个动作的运行时长。同时维护了一个本地线程hiveDB,如果db为空的情况下,会重新创建一个Hive对象,代码如下:
public static Hive get(HiveConf c, boolean needsRefresh) throws HiveException {
Hive db = hiveDB.get();
if (db == null || needsRefresh || !db.isCurrentUserOwner()) {
if (db != null) {
LOG.debug("Creating new db. db = " + db + ", needsRefresh = " + needsRefresh +
", db.isCurrentUserOwner = " + db.isCurrentUserOwner());
}
closeCurrent();
c.set("fs.scheme.class", "dfs");
Hive newdb = new Hive(c);
hiveDB.set(newdb);
return newdb;
}
db.conf = c;
return db;
}
随后我们会发现,在创建Hive对象时,便已经将function进行注册,什么是function呢,通过上次的表结构分析,可以理解为所有udf等jar包的元数据存储。代码如下:
// register all permanent functions. need improvement
static {
try {
reloadFunctions();
} catch (Exception e) {
LOG.warn("Failed to access metastore. This class should not accessed in runtime.",e);
}
}
public static void reloadFunctions() throws HiveException {
//获取 Hive对象,用于后续方法的调用
Hive db = Hive.get(); //通过遍历每一个dbName
for (String dbName : db.getAllDatabases()) {
//通过dbName查询挂在该db下的所有function的信息。
for (String functionName : db.getFunctions(dbName, "*")) {
Function function = db.getFunction(dbName, functionName);
try {
//这里的register便是将查询到的function的数据注册到Registry类中的一个Map<String,FunctionInfo>中,以便计算引擎在调用时,不必再次查询数据库。
FunctionRegistry.registerPermanentFunction(
FunctionUtils.qualifyFunctionName(functionName, dbName), function.getClassName(),
false, FunctionTask.toFunctionResource(function.getResourceUris()));
} catch (Exception e) {
LOG.warn("Failed to register persistent function " +
functionName + ":" + function.getClassName() + ". Ignore and continue.");
}
}
}
}
调用getMSC()方法,进行metadataClient客户端的创建,代码如下:
private IMetaStoreClient createMetaStoreClient() throws MetaException {
//这里实现接口HiveMetaHookLoader
HiveMetaHookLoader hookLoader = new HiveMetaHookLoader() {
@Override
public HiveMetaHook getHook(org.apache.hadoop.hive.metastore.api.Table tbl)throws MetaException {
try {
if (tbl == null) {
return null;
}
//根据tble的kv属性加载不同storage的实例,比如hbase、redis等等拓展存储,作为外部表进行存储
HiveStorageHandler storageHandler =HiveUtils.getStorageHandler(conf,tbl.getParameters().get(META_TABLE_STORAGE));
if (storageHandler == null) {
return null;
}
return storageHandler.getMetaHook();
} catch (HiveException ex) {
LOG.error(StringUtils.stringifyException(ex));
throw new MetaException(
"Failed to load storage handler: " + ex.getMessage());
}
}
};
return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap,
SessionHiveMetaStoreClient.class.getName());
}
可以看到,创建MetaStoreClient中,创建了HiveMetaHook,这个Hook的作用在于,每次对meta进行操作的时候,比如createTable的时候,如果建表的存储方式不是文件,比如集成hbase,HiveMetaStoreClient会调用hook的接口方法preCreateTable,进行建表前的准备,用来判断外部表与内部表,如果中途有失败的话,依旧调用hook中的rollbackCreateTable进行回滚。
2、HiveMetaStore服务端的创建及加载
public HiveMetaStoreClient(HiveConf conf, HiveMetaHookLoader hookLoader)throws MetaException {
this.hookLoader = hookLoader;
if (conf == null) {
conf = new HiveConf(HiveMetaStoreClient.class);
}
this.conf = conf;
filterHook = loadFilterHooks();
//根据hive-site.xml中的hive.metastore.uris配置,如果配置该参数,则认为是远程连接,否则为本地连接
String msUri = conf.getVar(HiveConf.ConfVars.METASTOREURIS);
localMetaStore = HiveConfUtil.isEmbeddedMetaStore(msUri);
if (localMetaStore) {
//本地连接直接连接HiveMetaStore
client = HiveMetaStore.newRetryingHMSHandler("hive client", conf, true);
isConnected = true;
snapshotActiveConf();
return;
}
//获取配置中的重试次数及timeout时间
retries = HiveConf.getIntVar(conf, HiveConf.ConfVars.METASTORETHRIFTCONNECTIONRETRIES);
retryDelaySeconds = conf.getTimeVar(
ConfVars.METASTORE_CLIENT_CONNECT_RETRY_DELAY, TimeUnit.SECONDS);
//拼接metastore uri
if (conf.getVar(HiveConf.ConfVars.METASTOREURIS) != null) {
String metastoreUrisString[] = conf.getVar(HiveConf.ConfVars.METASTOREURIS).split(",");
metastoreUris = new URI[metastoreUrisString.length];
try {
int i = 0;
for (String s : metastoreUrisString) {
URI tmpUri = new URI(s);
if (tmpUri.getScheme() == null) {
throw new IllegalArgumentException("URI: " + s+ " does not have a scheme");
}
metastoreUris[i++] = tmpUri;
}
} catch (IllegalArgumentException e) {
throw (e);
} catch (Exception e) {
MetaStoreUtils.logAndThrowMetaException(e);
}
} else {
LOG.error("NOT getting uris from conf");
throw new MetaException("MetaStoreURIs not found in conf file");
}
调用open方法创建连接
open();
}
从上面代码中可以看出,如果我们是远程连接,需要配置hive-site.xml中的hive.metastore.uri。如果client与server不在同一台机器,就需要配置进行远程连接。那么继续往下面看,创建连接的open方法:
private void open() throws MetaException {
isConnected = false;
TTransportException tte = null; //是否使用Sasl
boolean useSasl = conf.getBoolVar(ConfVars.METASTORE_USE_THRIFT_SASL); //If true, the metastore Thrift interface will use TFramedTransport. When false (default) a standard TTransport is used.
boolean useFramedTransport = conf.getBoolVar(ConfVars.METASTORE_USE_THRIFT_FRAMED_TRANSPORT); //If true, the metastore Thrift interface will use TCompactProtocol. When false (default) TBinaryProtocol will be used.
boolean useCompactProtocol = conf.getBoolVar(ConfVars.METASTORE_USE_THRIFT_COMPACT_PROTOCOL); //获取socket timeout时间
int clientSocketTimeout = (int) conf.getTimeVar(ConfVars.METASTORE_CLIENT_SOCKET_TIMEOUT, TimeUnit.MILLISECONDS);
for (int attempt = 0; !isConnected && attempt < retries; ++attempt) {
for (URI store : metastoreUris) {
LOG.info("Trying to connect to metastore with URI " + store);
try {
transport = new TSocket(store.getHost(), store.getPort(), clientSocketTimeout);
if (useSasl) {
// Wrap thrift connection with SASL for secure connection.
try {
//创建HadoopThriftAuthBridge client
HadoopThriftAuthBridge.Client authBridge =ShimLoader.getHadoopThriftAuthBridge().createClient();
//权限认证相关
// check if we should use delegation tokens to authenticate
// the call below gets hold of the tokens if they are set up by hadoop
// this should happen on the map/reduce tasks if the client added the
// tokens into hadoop's credential store in the front end during job
// submission.
String tokenSig = conf.get("hive.metastore.token.signature");
// tokenSig could be null
tokenStrForm = Utils.getTokenStrForm(tokenSig);
if(tokenStrForm != null) {
// authenticate using delegation tokens via the "DIGEST" mechanism
transport = authBridge.createClientTransport(null, store.getHost(),"DIGEST", tokenStrForm, transport,MetaStoreUtils.getMetaStoreSaslProperties(conf));
} else {
String principalConfig =conf.getVar(HiveConf.ConfVars.METASTORE_KERBEROS_PRINCIPAL);
transport = authBridge.createClientTransport(principalConfig, store.getHost(), "KERBEROS", null,transport,MetaStoreUtils.getMetaStoreSaslProperties(conf));
}
} catch (IOException ioe) {
LOG.error("Couldn't create client transport", ioe);
throw new MetaException(ioe.toString());
}
} else if (useFramedTransport) {
transport = new TFramedTransport(transport);
}
final TProtocol protocol;
if (useCompactProtocol) {
protocol = new TCompactProtocol(transport);
} else {
protocol = new TBinaryProtocol(transport);
} //创建ThriftHiveMetastore client
client = new ThriftHiveMetastore.Client(protocol);
try {
transport.open();
isConnected = true;
} catch (TTransportException e) {
tte = e;
if (LOG.isDebugEnabled()) {
LOG.warn("Failed to connect to the MetaStore Server...", e);
} else {
// Don't print full exception trace if DEBUG is not on.
LOG.warn("Failed to connect to the MetaStore Server...");
}
}
//用户组及用户的加载
if (isConnected && !useSasl && conf.getBoolVar(ConfVars.METASTORE_EXECUTE_SET_UGI)){
// Call set_ugi, only in unsecure mode.
try {
UserGroupInformation ugi = Utils.getUGI();
client.set_ugi(ugi.getUserName(), Arrays.asList(ugi.getGroupNames()));
} catch (LoginException e) {
LOG.warn("Failed to do login. set_ugi() is not successful, " +
"Continuing without it.", e);
} catch (IOException e) {
LOG.warn("Failed to find ugi of client set_ugi() is not successful, " +
"Continuing without it.", e);
} catch (TException e) {
LOG.warn("set_ugi() not successful, Likely cause: new client talking to old server. "
+ "Continuing without it.", e);
}
}
} catch (MetaException e) {
LOG.error("Unable to connect to metastore with URI " + store
+ " in attempt " + attempt, e);
}
if (isConnected) {
break;
}
}
// Wait before launching the next round of connection retries.
if (!isConnected && retryDelaySeconds > 0) {
try {
LOG.info("Waiting " + retryDelaySeconds + " seconds before next connection attempt.");
Thread.sleep(retryDelaySeconds * 1000);
} catch (InterruptedException ignore) {
}
}
}
if (!isConnected) {
throw new MetaException("Could not connect to meta store using any of the URIs provided." +
" Most recent failure: " + StringUtils.stringifyException(tte));
}
snapshotActiveConf();
LOG.info("Connected to metastore.");
}
从代码中可以看出HiveMetaStore服务端是通过ThriftHiveMetaStore创建,它本是一个class类,但其中定义了接口Iface、AsyncIface,这样做的好处是利于继承实现。那么下来,我们看一下HMSHandler的初始化。如果是在本地调用的过程中,直接调用newRetryingHMSHandler,便会直接进行HMSHandler的初始化。代码如下:
public HMSHandler(String name, HiveConf conf, boolean init) throws MetaException {
super(name);
hiveConf = conf;
if (init) {
init();
}
}
下俩我们继续看它的init方法:
public void init() throws MetaException {
//获取与数据交互的实现类className,该类为objectStore,是RawStore的实现,负责JDO与数据库的交互。
rawStoreClassName = hiveConf.getVar(HiveConf.ConfVars.METASTORE_RAW_STORE_IMPL); //加载Listeners,来自hive.metastore.init.hooks,可自行实现并加载
initListeners = MetaStoreUtils.getMetaStoreListeners(MetaStoreInitListener.class, hiveConf,hiveConf.getVar(HiveConf.ConfVars.METASTORE_INIT_HOOKS));
for (MetaStoreInitListener singleInitListener: initListeners) {
MetaStoreInitContext context = new MetaStoreInitContext();
singleInitListener.onInit(context);
}
//初始化alter的实现类
String alterHandlerName = hiveConf.get("hive.metastore.alter.impl",HiveAlterHandler.class.getName());
alterHandler = (AlterHandler) ReflectionUtils.newInstance(MetaStoreUtils.getClass(alterHandlerName), hiveConf);//初始化warehouse
wh = new Warehouse(hiveConf);
//创建默认db以及用户,同时加载currentUrl
synchronized (HMSHandler.class) {
if (currentUrl == null || !currentUrl.equals(MetaStoreInit.getConnectionURL(hiveConf))) {
createDefaultDB();
createDefaultRoles();
addAdminUsers();
currentUrl = MetaStoreInit.getConnectionURL(hiveConf);
}
}
//计数信息的初始化
if (hiveConf.getBoolean("hive.metastore.metrics.enabled", false)) {
try {
Metrics.init();
} catch (Exception e) {
// log exception, but ignore inability to start
LOG.error("error in Metrics init: " + e.getClass().getName() + " "+ e.getMessage(), e);
}
}
//Listener的PreListener的初始化
preListeners = MetaStoreUtils.getMetaStoreListeners(MetaStorePreEventListener.class,hiveConf,hiveConf.getVar(HiveConf.ConfVars.METASTORE_PRE_EVENT_LISTENERS));
listeners = MetaStoreUtils.getMetaStoreListeners(MetaStoreEventListener.class, hiveConf,hiveConf.getVar(HiveConf.ConfVars.METASTORE_EVENT_LISTENERS));
listeners.add(new SessionPropertiesListener(hiveConf));
endFunctionListeners = MetaStoreUtils.getMetaStoreListeners(
MetaStoreEndFunctionListener.class, hiveConf,
hiveConf.getVar(HiveConf.ConfVars.METASTORE_END_FUNCTION_LISTENERS));
//针对partitionName的正则校验,可自行设置,根据hive.metastore.partition.name.whitelist.pattern进行设置
String partitionValidationRegex =hiveConf.getVar(HiveConf.ConfVars.METASTORE_PARTITION_NAME_WHITELIST_PATTERN);
if (partitionValidationRegex != null && !partitionValidationRegex.isEmpty()) {
partitionValidationPattern = Pattern.compile(partitionValidationRegex);
} else {
partitionValidationPattern = null;
}
long cleanFreq = hiveConf.getTimeVar(ConfVars.METASTORE_EVENT_CLEAN_FREQ, TimeUnit.MILLISECONDS);
if (cleanFreq > 0) {
// In default config, there is no timer.
Timer cleaner = new Timer("Metastore Events Cleaner Thread", true);
cleaner.schedule(new EventCleanerTask(this), cleanFreq, cleanFreq);
}
}
它初始化了与数据库交互的rawStore的实现类、物理操作的warehouse以及Event与Listener。从而通过接口调用相关meta生命周期方法进行表的操作。
3、createTable
public void createTable(String tableName, List<String> columns, List<String> partCols,
Class<? extends InputFormat> fileInputFormat,
Class<?> fileOutputFormat, int bucketCount, List<String> bucketCols,
Map<String, String> parameters) throws HiveException {
if (columns == null) {
throw