s the EMR/Hadoop cluster’s are transient, tracking all those databases and tables across clusters may be difficult. So, Instead of having different warehouse directories across clusters, You can use a single permanent hive warehouse across all EMR clusters. S3
would be a great choice as it is persistent storage and had robust architecture providing redundancy and read-after-write consistency.
For each cluster:
This can be configured using hive.metastore.warehouse.dir
property on hive-site.xml
.
1 2 3 4 5 | <property> <name>hive.metastore.warehouse.dir</name> <value>s3n://bucket/hive_warehouse</value> <description>location of default database for the warehouse</description> </property> |
You may need to update this setting on all nodes.
On a single hive session:
this can be configured using a command like set hive.metastore.warehouse.dir ="s3n://bucket/hive_warehouse"
or initialize hive cli with the following invocation -hiveconf hive.metastore.warehouse.dir=s3n://bucket/hive_warehouse
Note that using above configuration, all default databases and tables will be stored on s3 on path like s3://bucket/hive_warehouse/myHiveDatabase.db/