上一节课我们一起学习了Hive的基础及分区表,这节课我们一起来做一个hive的综合练习。
第一步:建表
我们需要两张表,分别是trade_detail和user_info,两张表的信息如下。
第一张表:trade_detail,id要作为主键
第二张表:user_info,id要作为主键
第二步:Mysql数据库下关联查询
现在我们先在关系型数据库mysql关联查询这两张表,我们要查询的是某个账号对应的姓名,所有的收入,所有的支出,结余。我们的sql语句为:SELECT t.account,u.name,t.income,t.expenses,t.surplus FROM user_info u JOIN (SELECT account,SUM(income) AS income,SUM(expenses) AS expenses,SUM(income-expenses) AS surplus FROM trade_detail GROUP BY account) t ON u.account=t.account;查询结果如下图所示。
第三步:创建hive表
我们既然要利用hive来帮我们进行查询,我们就需要先建两张对应的hive表。
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
hive> create table trade_detail (id bigint,account string,income double,expenses double,trade_time string) row format delimited fields terminated by '\t';
OK
Time taken: 0.842 seconds
hive> show tables;
OK
beauties
ext_student
people
student
trade_detail
Time taken: 0.068 seconds, Fetched: 5 row(s)
hive> create table user_info (id bigint,account string,name string,age int) row format delimited fields terminated by '\t';
OK
Time taken: 0.091 seconds
hive> show tables;
OK
beauties
ext_student
people
student
trade_detail
user_info
Time taken: 0.024 seconds, Fetched: 6 row(s)
hive>
第四步:利用sqoop直接将Mysql数据库中的数据导入到hive表中
我们需要到sqoop的bin目录下执行导入操作,如下所示。
[root@itcast03 ~]# cd /itcast/sqoop-1.4.6
[root@itcast03 sqoop-1.4.6]# ls
bin CHANGELOG.txt conf ivy lib NOTICE.txt README.txt sqoop-patch-review.py src
build.xml COMPILING.txt docs ivy.xml LICENSE.txt pom-old.xml sqoop-1.4.6.jar sqoop-test-1.4.6.jar testdata
[root@itcast03 sqoop-1.4.6]# cd bin
[root@itcast03 bin]# ./sqoop import --connect jdbc:mysql://169.254.254.1:3306/test --username root --password root --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
执行上面的导入操作后会报一个错误,错误如下:
16/11/06 21:15:52 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Cannot run program "hive": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at java.lang.Runtime.exec(Runtime.java:617)
at java.lang.Runtime.exec(Runtime.java:528)
at org.apache.sqoop.util.Executor.exec(Executor.java:76)
at org.apache.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:386)
at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:339)
at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:240)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:514)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 14 more
那么是什么原因导致上面的错误呢?其实是我们没有配置hive的环境变量导致的,sqoop导入操作其实分两个阶段,第一个阶段先把Mysql中的数据上传到HDFS的一个临时目录下,第二个阶段将临时目录下的数据load到hive表中,这过程当然要调用hive的命令,我们当前是在sqoop的bin目录下进行操作,当然无法直接使用hive的命令了,为了能够使用hive的命令,我们需要将hive添加到环境变量。如下所示。
[root@itcast03 bin]# vim /etc/profile
*)
if [ "$2" = "after" ] ; then
PATH=$PATH:$1
else
PATH=$1:$PATH
fi
esac
}
//中间省略了一部分信息
unset i
unset -f pathmunge
export JAVA_HOME=/usr/java/jdk1.7.0_80
export HADOOP_HOME=/itcast/hadoop-2.2.0
export SQOOP_HOME=/itcast/sqoop-1.4.6
export HIVE_HOME=/itcast/apache-hive-0.13.0-bin/
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$SQOOP_HOME/bin:$HIVE_HOME/bin:$PATH
添加完环境变量后保存退出,并使用source /etc/profile使配置生效
[root@itcast03 bin]# source /etc/profile
[root@itcast03 bin]#
配置生效后我们再进行导入操作,会发现还会报另外一个错,如下所示。
[root@itcast03 bin]# ./sqoop import --connect jdbc:mysql://169.254.254.1:3306/test --username root --password root --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
Warning: /itcast/sqoop-1.4.6/../hbase does not exist! HBase imports will fail.
16/11/06 21:26:53 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ns1/user/root/trade_detail already exists
16/11/06 21:26:53 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:Output directory hdfs://ns1/user/root/trade_detail already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:456)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:342)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)
at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:196)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:169)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:266)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:118)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
这是因为我们刚才进行第一遍导入的时候导入失败,它在HDFS的/user/root/目录下生成了临时文件trade_detail,由于没有将临时文件的内容成功load到hive表中,因此就留下了临时文件。不信我们看看HDFS/user/root目录下当前的文件列表,如下图所示。
既然知道是这个临时文件导致的错误,我们便删除掉这个临时文件,删除操作如下
[root@itcast03 bin]# hadoop fs -rm -r /user/root/trade_detail
16/11/06 21:27:55 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/root/trade_detail
[root@itcast03 bin]#
删除完之后我们再在HDFS的/user/root目录下看看还有没有trade_detail这个临时文件。发现确实已经没有临时文件了。
现在我们真正开始导入操作了,这次我们真的导入成功了!导入语句中有几个参数需要说明一下,test是我们要导入的两张表所在的数据库名,--hive-overwrite意思是覆盖hive表已有的数据,如果不指定这个参数会在已有表追加数据。--fields-terminated-by '\t'意思是指定列与列的分隔符是'\t'。
[root@itcast03 bin]# ./sqoop import --connect jdbc:mysql://169.254.254.1:3306/test --username root --password root --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
16/11/06 21:29:27 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/11/06 21:29:30 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
16/11/06 21:29:30 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
16/11/06 21:29:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1478396540774_0005
16/11/06 21:29:31 INFO impl.YarnClientImpl: Submitted application application_1478396540774_0005 to ResourceManager at itcast03/169.254.254.30:8032
16/11/06 21:29:31 INFO mapreduce.Job: The url to track the job: http://itcast03:8088/proxy/application_1478396540774_0005/
16/11/06 21:29:31 INFO mapreduce.Job: Running job: job_1478396540774_0005
16/11/06 21:29:37 INFO mapreduce.Job: Job job_1478396540774_0005 running in uber mode : false
16/11/06 21:29:37 INFO mapreduce.Job: map 0% reduce 0%
16/11/06 21:29:44 INFO mapreduce.Job: map 50% reduce 0%
16/11/06 21:29:45 INFO mapreduce.Job: map 100% reduce 0%
16/11/06 21:29:46 INFO mapreduce.Job: Job job_1478396540774_0005 completed successfully
16/11/06 21:29:53 INFO hive.HiveImport: Hive import complete.
导入成功后,我们到HDFS查看是否导入成功了,如下图所示。发现导入成功了。
下面我们再导入user_info的数据,如下所示。
[root@itcast03 bin]# ./sqoop import --connect jdbc:mysql://169.254.254.1:3306/test --username root --password root --table user_info --hive-import --hive-overwrite --hive-table user_info --fields-terminated-by '\t'
16/11/06 21:51:01 INFO mapreduce.ImportJobBase: Beginning import of user_info
16/11/06 21:51:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1478396540774_0006
16/11/06 21:51:06 INFO impl.YarnClientImpl: Submitted application application_1478396540774_0006 to ResourceManager at itcast03/169.254.254.30:8032
16/11/06 21:51:06 INFO mapreduce.Job: The url to track the job: http://itcast03:8088/proxy/application_1478396540774_0006/
16/11/06 21:51:06 INFO mapreduce.Job: Running job: job_1478396540774_0006
16/11/06 21:51:12 INFO mapreduce.Job: Job job_1478396540774_0006 running in uber mode : false
16/11/06 21:51:12 INFO mapreduce.Job: map 0% reduce 0%
16/11/06 21:51:19 INFO mapreduce.Job: map 25% reduce 0%
16/11/06 21:51:20 INFO mapreduce.Job: map 75% reduce 0%
16/11/06 21:51:21 INFO mapreduce.Job: map 100% reduce 0%
16/11/06 21:51:21 INFO mapreduce.Job: Job job_1478396540774_0006 completed successfully
导入user_info后我们到HDFS上查看我们的user_info表是否成功,发现已经成功导入了。
第五步:利用hive来进行查询
我们执行下面的语句,执行完之后我们查看执行结果,发现与我们用mysql直接查询的结果是完全一样的。说明用hive来查询是没问题的。而且过程中分为了三个阶段,意思是启动了3个MapReduce,但是这些都不需要我们处理,hive自动帮我们做好了,真的是非常强大。
hive> select t.account,u.name,t.income,t.expenses,t.surplus from user_info u join (select account,sum(income) as income,sum(expenses) as expenses,sum(income-expenses) as surplus from trade_detail group by account) t on u.account=t.account;
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1478396540774_0007, Tracking URL = http://itcast03:8088/proxy/application_1478396540774_0007/
Kill Command = /itcast/hadoop-2.2.0/bin/hadoop job -kill job_1478396540774_0007
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2016-11-06 22:26:58,461 Stage-2 map = 0%, reduce = 0%
2016-11-06 22:27:04,741 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.41 sec
2016-11-06 22:27:10,971 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.57 sec
MapReduce Total cumulative CPU time: 2 seconds 570 msec
Ended Job = job_1478396540774_0007
16/11/06 22:27:15 WARN conf.Configuration: file:/tmp/root/hive_2016-11-06_22-26-50_274_777420487936050051-1/-local-10009/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
16/11/06 22:27:15 WARN conf.Configuration: file:/tmp/root/hive_2016-11-06_22-26-50_274_777420487936050051-1/-local-10009/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
Execution log at: /tmp/root/root_20161106222626_52c07a9b-087e-4b62-ab85-cbb138a3526f.log
2016-11-06 10:27:15 Starting to launch local task to process map join; maximum memory = 518979584
2016-11-06 10:27:16 Dump the side-table into file: file:/tmp/root/hive_2016-11-06_22-26-50_274_777420487936050051-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2016-11-06 10:27:16 Uploaded 1 File to: file:/tmp/root/hive_2016-11-06_22-26-50_274_777420487936050051-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (423 bytes)
2016-11-06 10:27:16 End of local task; Time Taken: 0.961 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1478396540774_0008, Tracking URL = http://itcast03:8088/proxy/application_1478396540774_0008/
Kill Command = /itcast/hadoop-2.2.0/bin/hadoop job -kill job_1478396540774_0008
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2016-11-06 22:27:22,454 Stage-3 map = 0%, reduce = 0%
2016-11-06 22:27:27,608 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.93 sec
MapReduce Total cumulative CPU time: 930 msec
Ended Job = job_1478396540774_0008
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.57 sec HDFS Read: 815 HDFS Write: 266 SUCCESS
Job 1: Map: 1 Cumulative CPU: 0.93 sec HDFS Read: 617 HDFS Write: 123 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 500 msec
OK
lisi@163.com 李四 0.0 2300.0 -2300.0
wangwu@gmail.com 王五 900.0 0.0 900.0
zhangsan@126.com 张三 4000.0 400.0 3600.0
Time taken: 39.561 seconds, Fetched: 3 row(s)
hive>
好了,本小节课我们便一起学习到这里。