学习Hadoop第三十二课(hive练习--多表联合查询)

上一节课我们一起学习了Hive的基础及分区表,这节课我们一起来做一个hive的综合练习。

第一步:建表

       我们需要两张表,分别是trade_detail和user_info,两张表的信息如下。

       第一张表:trade_detail,id要作为主键

     第二张表:user_info,id要作为主键

第二步:Mysql数据库下关联查询

       现在我们先在关系型数据库mysql关联查询这两张表,我们要查询的是某个账号对应的姓名,所有的收入,所有的支出,结余。我们的sql语句为:SELECT t.account,u.name,t.income,t.expenses,t.surplus FROM user_info u JOIN (SELECT account,SUM(income) AS income,SUM(expenses) AS expenses,SUM(income-expenses) AS surplus FROM trade_detail GROUP BY account) t ON u.account=t.account;查询结果如下图所示。

第三步:创建hive表

       我们既然要利用hive来帮我们进行查询,我们就需要先建两张对应的hive表。

[root@itcast03 bin]#  ./hive
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/11/06 20:57:38 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
Logging initialized using configuration in jar:file:/itcast/apache-hive-0.13.0-bin/lib/hive-common-0.13.0.jar!/hive-log4j.properties
hive>  create table trade_detail (id bigint,account string,income double,expenses double,trade_time string) row format delimited fields terminated by '\t';
OK
Time taken: 0.842 seconds
hive>  show tables;
OK
beauties
ext_student
people
student
trade_detail
Time taken: 0.068 seconds, Fetched: 5 row(s)
hive>  create table user_info (id bigint,account string,name string,age int) row format delimited fields terminated by '\t';
OK
Time taken: 0.091 seconds
hive> show tables;
OK
beauties
ext_student
people
student
trade_detail
user_info

Time taken: 0.024 seconds, Fetched: 6 row(s)
hive>

第四步:利用sqoop直接将Mysql数据库中的数据导入到hive表中

       我们需要到sqoop的bin目录下执行导入操作,如下所示。

[root@itcast03 ~]# cd /itcast/sqoop-1.4.6
[root@itcast03 sqoop-1.4.6]# ls
bin        CHANGELOG.txt  conf  ivy      lib          NOTICE.txt   README.txt       sqoop-patch-review.py  src
build.xml  COMPILING.txt  docs  ivy.xml  LICENSE.txt  pom-old.xml  sqoop-1.4.6.jar  sqoop-test-1.4.6.jar   testdata
[root@itcast03 sqoop-1.4.6]# cd bin
[root@itcast03 bin]# ./sqoop import --connect jdbc:mysql://169.254.254.1:3306/test --username root --password root --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'

       执行上面的导入操作后会报一个错误,错误如下:

16/11/06 21:15:52 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Cannot run program "hive": error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
        at java.lang.Runtime.exec(Runtime.java:617)
        at java.lang.Runtime.exec(Runtime.java:528)
        at org.apache.sqoop.util.Executor.exec(Executor.java:76)
        at org.apache.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:386)
        at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:339)
        at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:240)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:514)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Caused by: java.io.IOException: error=2, No such file or directory
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)
        at java.lang.ProcessImpl.start(ProcessImpl.java:130)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
        ... 14 more

        那么是什么原因导致上面的错误呢?其实是我们没有配置hive的环境变量导致的,sqoop导入操作其实分两个阶段,第一个阶段先把Mysql中的数据上传到HDFS的一个临时目录下,第二个阶段将临时目录下的数据load到hive表中,这过程当然要调用hive的命令,我们当前是在sqoop的bin目录下进行操作,当然无法直接使用hive的命令了,为了能够使用hive的命令,我们需要将hive添加到环境变量。如下所示。

[root@itcast03 bin]# vim /etc/profile
        *)
            if [ "$2" = "after" ] ; then
                PATH=$PATH:$1
            else
                PATH=$1:$PATH
            fi
    esac
}

//中间省略了一部分信息

unset i
unset -f pathmunge
export JAVA_HOME=/usr/java/jdk1.7.0_80
export HADOOP_HOME=/itcast/hadoop-2.2.0
export SQOOP_HOME=/itcast/sqoop-1.4.6
export HIVE_HOME=/itcast/apache-hive-0.13.0-bin/
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$SQOOP_HOME/bin:$HIVE_HOME/bin:$PATH

      添加完环境变量后保存退出,并使用source /etc/profile使配置生效

[root@itcast03 bin]# source /etc/profile
[root@itcast03 bin]#

      配置生效后我们再进行导入操作,会发现还会报另外一个错,如下所示。

[root@itcast03 bin]# ./sqoop import --connect jdbc:mysql://169.254.254.1:3306/test --username root --password root --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
Warning: /itcast/sqoop-1.4.6/../hbase does not exist! HBase imports will fail.
16/11/06 21:26:53 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ns1/user/root/trade_detail already exists
16/11/06 21:26:53 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:Output directory hdfs://ns1/user/root/trade_detail already exists
        at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
        at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:456)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:342)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)
        at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:196)
        at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:169)
        at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:266)
        at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
        at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:118)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

       这是因为我们刚才进行第一遍导入的时候导入失败,它在HDFS的/user/root/目录下生成了临时文件trade_detail,由于没有将临时文件的内容成功load到hive表中,因此就留下了临时文件。不信我们看看HDFS/user/root目录下当前的文件列表,如下图所示。

         既然知道是这个临时文件导致的错误,我们便删除掉这个临时文件,删除操作如下

[root@itcast03 bin]# hadoop fs -rm -r /user/root/trade_detail
16/11/06 21:27:55 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/root/trade_detail
[root@itcast03 bin]#

         删除完之后我们再在HDFS的/user/root目录下看看还有没有trade_detail这个临时文件。发现确实已经没有临时文件了。

     现在我们真正开始导入操作了,这次我们真的导入成功了!导入语句中有几个参数需要说明一下,test是我们要导入的两张表所在的数据库名,--hive-overwrite意思是覆盖hive表已有的数据,如果不指定这个参数会在已有表追加数据。--fields-terminated-by '\t'意思是指定列与列的分隔符是'\t'。

[root@itcast03 bin]# ./sqoop import --connect jdbc:mysql://169.254.254.1:3306/test --username root --password root --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
16/11/06 21:29:27 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/11/06 21:29:30 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
16/11/06 21:29:30 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
16/11/06 21:29:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1478396540774_0005
16/11/06 21:29:31 INFO impl.YarnClientImpl: Submitted application application_1478396540774_0005 to ResourceManager at itcast03/169.254.254.30:8032
16/11/06 21:29:31 INFO mapreduce.Job: The url to track the job: http://itcast03:8088/proxy/application_1478396540774_0005/
16/11/06 21:29:31 INFO mapreduce.Job: Running job: job_1478396540774_0005
16/11/06 21:29:37 INFO mapreduce.Job: Job job_1478396540774_0005 running in uber mode : false
16/11/06 21:29:37 INFO mapreduce.Job:  map 0% reduce 0%
16/11/06 21:29:44 INFO mapreduce.Job:  map 50% reduce 0% 
16/11/06 21:29:45 INFO mapreduce.Job:  map 100% reduce 0%
16/11/06 21:29:46 INFO mapreduce.Job: Job job_1478396540774_0005 completed successfully

16/11/06 21:29:53 INFO hive.HiveImport: Hive import complete.

        导入成功后,我们到HDFS查看是否导入成功了,如下图所示。发现导入成功了。

          下面我们再导入user_info的数据,如下所示。

[root@itcast03 bin]# ./sqoop import --connect jdbc:mysql://169.254.254.1:3306/test --username root --password root --table user_info --hive-import --hive-overwrite --hive-table user_info --fields-terminated-by '\t'
16/11/06 21:51:01 INFO mapreduce.ImportJobBase: Beginning import of user_info
16/11/06 21:51:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1478396540774_0006
16/11/06 21:51:06 INFO impl.YarnClientImpl: Submitted application application_1478396540774_0006 to ResourceManager at itcast03/169.254.254.30:8032
16/11/06 21:51:06 INFO mapreduce.Job: The url to track the job: http://itcast03:8088/proxy/application_1478396540774_0006/
16/11/06 21:51:06 INFO mapreduce.Job: Running job: job_1478396540774_0006
16/11/06 21:51:12 INFO mapreduce.Job: Job job_1478396540774_0006 running in uber mode : false
16/11/06 21:51:12 INFO mapreduce.Job:  map 0% reduce 0%
16/11/06 21:51:19 INFO mapreduce.Job:  map 25% reduce 0%
16/11/06 21:51:20 INFO mapreduce.Job:  map 75% reduce 0%
16/11/06 21:51:21 INFO mapreduce.Job:  map 100% reduce 0%
16/11/06 21:51:21 INFO mapreduce.Job: Job job_1478396540774_0006 completed successfully

           导入user_info后我们到HDFS上查看我们的user_info表是否成功,发现已经成功导入了。

第五步:利用hive来进行查询

       我们执行下面的语句,执行完之后我们查看执行结果,发现与我们用mysql直接查询的结果是完全一样的。说明用hive来查询是没问题的。而且过程中分为了三个阶段,意思是启动了3个MapReduce,但是这些都不需要我们处理,hive自动帮我们做好了,真的是非常强大。

hive> select t.account,u.name,t.income,t.expenses,t.surplus from user_info u join (select account,sum(income) as income,sum(expenses) as expenses,sum(income-expenses) as surplus from trade_detail group by account) t on u.account=t.account;
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1478396540774_0007, Tracking URL = http://itcast03:8088/proxy/application_1478396540774_0007/
Kill Command = /itcast/hadoop-2.2.0/bin/hadoop job  -kill job_1478396540774_0007
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2016-11-06 22:26:58,461 Stage-2 map = 0%,  reduce = 0%
2016-11-06 22:27:04,741 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.41 sec
2016-11-06 22:27:10,971 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 2.57 sec
MapReduce Total cumulative CPU time: 2 seconds 570 msec
Ended Job = job_1478396540774_0007
16/11/06 22:27:15 WARN conf.Configuration: file:/tmp/root/hive_2016-11-06_22-26-50_274_777420487936050051-1/-local-10009/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
16/11/06 22:27:15 WARN conf.Configuration: file:/tmp/root/hive_2016-11-06_22-26-50_274_777420487936050051-1/-local-10009/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/11/06 22:27:15 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
Execution log at: /tmp/root/root_20161106222626_52c07a9b-087e-4b62-ab85-cbb138a3526f.log
2016-11-06 10:27:15     Starting to launch local task to process map join;      maximum memory = 518979584
2016-11-06 10:27:16     Dump the side-table into file: file:/tmp/root/hive_2016-11-06_22-26-50_274_777420487936050051-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2016-11-06 10:27:16     Uploaded 1 File to: file:/tmp/root/hive_2016-11-06_22-26-50_274_777420487936050051-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (423 bytes)
2016-11-06 10:27:16     End of local task; Time Taken: 0.961 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1478396540774_0008, Tracking URL = http://itcast03:8088/proxy/application_1478396540774_0008/
Kill Command = /itcast/hadoop-2.2.0/bin/hadoop job  -kill job_1478396540774_0008
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2016-11-06 22:27:22,454 Stage-3 map = 0%,  reduce = 0%
2016-11-06 22:27:27,608 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.93 sec
MapReduce Total cumulative CPU time: 930 msec
Ended Job = job_1478396540774_0008
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 2.57 sec   HDFS Read: 815 HDFS Write: 266 SUCCESS
Job 1: Map: 1   Cumulative CPU: 0.93 sec   HDFS Read: 617 HDFS Write: 123 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 500 msec
OK
lisi@163.com    李四    0.0     2300.0  -2300.0
wangwu@gmail.com        王五    900.0   0.0     900.0
zhangsan@126.com        张三    4000.0  400.0   3600.0

Time taken: 39.561 seconds, Fetched: 3 row(s)
hive>

    好了,本小节课我们便一起学习到这里。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值