第一次使用分区归档,解决分区下小文件过多的问题:
使用步骤:
1 执行:set hive.archive.enabled=true;
2 执行:ALTER TABLE test.users_part ARCHIVE PARTITION(ds='20200401');
执行第二条SQL时报错:
SQL CLI中的报错信息为:
hive (test)> ALTER TABLE test.users_part ARCHIVE PARTITION(ds='20200401');
intermediate.archived is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ARCHIVED
intermediate.original is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ORIGINAL
Creating data.har for hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401
in hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401/.hive-staging_hive_2020-05-13_18-46-48_858_3143515583523997534-1/-ext-10000/partlevel
Please wait... (this may take a while)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/tools/HadoopArchives
hive (test)>
hive的log4j详细日志(该日志目录可以到hive/conf中的log4j配置文件中查找,默认为/tmp/$user)报错信息为:
2020-05-13T18:46:48,825 INFO [main] conf.HiveConf: Using the default value passed in for log id: 06ab61b1-0eab-44f7-a818-6cf2cf8f0b08
2020-05-13T18:46:48,825 INFO [main] session.SessionState: Updating thread name to 06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main
2020-05-13T18:46:48,828 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Compiling command(queryId=ytc_20200513184648_9ab2659a-ef3b-4e37-a3f0-1ff344b2721b): ALTER TABLE test.users_part ARCHIVE PARTITION(ds='20200401')
2020-05-13T18:46:48,859 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Concurrency mode is disabled, not creating a lock manager
2020-05-13T18:46:48,859 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] metastore.HiveMetaStore: 0: get_table : tbl=hive.test.users_part
2020-05-13T18:46:48,859 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] HiveMetaStore.audit: ugi=ytc ip=unknown-ip-addr cmd=get_table : tbl=hive.test.users_part
2020-05-13T18:46:48,875 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] metastore.HiveMetaStore: 0: get_partitions_ps_with_auth : tbl=hive.test.users_part[15]
2020-05-13T18:46:48,875 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] HiveMetaStore.audit: ugi=ytc ip=unknown-ip-addr cmd=get_partitions_ps_with_auth : tbl=hive.test.users_part[15]
2020-05-13T18:46:48,916 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Semantic Analysis Completed (retrial = false)
2020-05-13T18:46:48,916 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
2020-05-13T18:46:48,916 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Completed compiling command(queryId=ytc_20200513184648_9ab2659a-ef3b-4e37-a3f0-1ff344b2721b); Time taken: 0.088 seconds
2020-05-13T18:46:48,916 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] reexec.ReExecDriver: Execution #1 of query
2020-05-13T18:46:48,916 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Concurrency mode is disabled, not creating a lock manager
2020-05-13T18:46:48,916 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Executing command(queryId=ytc_20200513184648_9ab2659a-ef3b-4e37-a3f0-1ff344b2721b): ALTER TABLE test.users_part ARCHIVE PARTITION(ds='20200401')
2020-05-13T18:46:48,916 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Starting task [Stage-0:DDL] in serial mode
2020-05-13T18:46:48,917 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] metastore.HiveMetaStore: 0: get_table : tbl=hive.test.users_part
2020-05-13T18:46:48,917 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] HiveMetaStore.audit: ugi=ytc ip=unknown-ip-addr cmd=get_table : tbl=hive.test.users_part
2020-05-13T18:46:48,932 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] metastore.HiveMetaStore: 0: get_partitions_ps_with_auth : tbl=hive.test.users_part[15]
2020-05-13T18:46:48,932 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] HiveMetaStore.audit: ugi=ytc ip=unknown-ip-addr cmd=get_partitions_ps_with_auth : tbl=hive.test.users_part[15]
2020-05-13T18:46:48,975 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: intermediate.archived is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'_INTERMEDIATE_ARCHIVED
2020-05-13T18:46:48,975 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: intermediate.original is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'_INTERMEDIATE_ORIGINAL
2020-05-13T18:46:48,978 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] common.FileUtils: Creating directory if it doesn't exist: hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'ds=20200401'/.hive-staging_hive_2020-05-13_18-46-48_858_3143515583523997534-1
2020-05-13T18:46:48,980 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: Creating data.har for hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'
2020-05-13T18:46:48,980 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: in hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'/.hive-staging_hive_2020-05-13_18-46-48_858_3143515583523997534-1/-ext-10000/partlevel
2020-05-13T18:46:48,980 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: Please wait... (this may take a while)
2020-05-13T18:46:48,980 ERROR [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.DDLTask: Failed
java.lang.NoClassDefFoundError: org/apache/hadoop/tools/HadoopArchives
at org.apache.hadoop.hive.ql.exec.DDLTask.archive(DDLTask.java:1756) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:479) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2664) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2335) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2011) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1709) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1703) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:218) ~[hive-exec-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) ~[hive-cli-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) ~[hive-cli-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) ~[hive-cli-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) ~[hive-cli-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) ~[hive-cli-3.1.2.jar:3.1.2]
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) ~[hive-cli-3.1.2.jar:3.1.2]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_211]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_211]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_211]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_211]
at org.apache.hadoop.util.RunJar.run(RunJar.java:318) ~[hadoop-common-3.1.3.jar:?]
at org.apache.hadoop.util.RunJar.main(RunJar.java:232) ~[hadoop-common-3.1.3.jar:?]
2020-05-13T18:46:48,980 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] reexec.ReOptimizePlugin: ReOptimization: retryPossible: false
2020-05-13T18:46:48,980 ERROR [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/tools/HadoopArchives
2020-05-13T18:46:48,980 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Completed executing command(queryId=ytc_20200513184648_9ab2659a-ef3b-4e37-a3f0-1ff344b2721b); Time taken: 0.064 seconds
2020-05-13T18:46:48,980 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Concurrency mode is disabled, not creating a lock manager
2020-05-13T18:46:48,981 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] conf.HiveConf: Using the default value passed in for log id: 06ab61b1-0eab-44f7-a818-6cf2cf8f0b08
2020-05-13T18:46:48,982 INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] session.SessionState: Resetting thread name to main
重点为:java.lang.NoClassDefFoundError: org/apache/hadoop/tools/HadoopArchives
该类在执行hive时不会加载,需要将该类所在的jar包拷贝至运行环境下的lib目录,我是在yarn下采用mapreduce引擎,因此我将
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-archives-3.1.3.jar文件拷贝到下面两个目录
$HADOOP_HOME/share/hadoop/yarn/lib/
$HADOOP_HOME/share/hadoop/mapreduce/lib
3 重新执行hive命令进入SQL CLI:
ytc@ytc:/apps/hive/log$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/apps/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/apps/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = ad93aab7-55df-4181-a52c-245a2220978b
Logging initialized using configuration in file:/apps/hive/conf/hive-log4j2.properties Async: true
Hive Session ID = 52f38cb6-e313-455d-a6cd-8b333fcab3ce
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive (default)>
>
>
> use test;
OK
Time taken: 0.811 seconds
hive (test)> set hive.archive.enabled=true;
hive (test)>
>
> ALTER TABLE test.users_part ARCHIVE PARTITION(ds=20200401);
intermediate.archived is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ARCHIVED
intermediate.original is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ORIGINAL
Creating data.har for hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401
in hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401/.hive-staging_hive_2020-05-13_18-47-23_023_5539564871141800018-1/-ext-10000/partlevel
Please wait... (this may take a while)
Moving hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401/.hive-staging_hive_2020-05-13_18-47-23_023_5539564871141800018-1/-ext-10000/partlevel to hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ARCHIVED
Moving hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401 to hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ORIGINAL
Moving hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ARCHIVED to hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401
OK
Time taken: 18.526 seconds
hive (test)>
这样就可以正常使用分区归档了