Hadoop数据仓库hive的应用

最新推荐文章于 2024-05-31 09:28:08 发布

spring8743

最新推荐文章于 2024-05-31 09:28:08 发布

阅读量990

点赞数

本文链接：https://blog.csdn.net/spring8743/article/details/41008121

版权

Hive介绍

1.hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供完整的sql查询功能，可以将sql语句转换为MapReduce任务进行运行。其优点是学习成本低，可以通过类SQL语句快速实现简单的MapReduce统计，不必开发专门的MapReduce应用，十分适合数据仓库的统计分析。

2.Hive是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具，可以用来进行数据提取转化加载（ETL），这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。Hive 定义了简单的类 SQL 查询语言，称为 HQL，它允许熟悉 SQL 的用户查询数据。同时，这个语言也允许熟悉 MapReduce 开发者的开发自定义的 mapper 和 reducer 来处理内建的 mapper 和 reducer 无法完成的复杂的分析工作。

Apache下的大数据产品（MapReduce, Hive, Pig, HBase）对比

优点	缺点	典型案例
自己写MapReduce任务	性能比Hive和Pig要高点	开发难度大	1）搜索引擎网页处理，PageRank计算（Google）2）典型的ETL（全盘扫描）3）机器学习/聚类，分类，推荐等（百度Ecomm）
使用Hive做基于SQL的分析	对于数据分析师来说SQL太熟悉了	有些场景下性能不如MR	1）用户访问日志处理/互联网广告（Yahoo, Facebook, hulu, Amazon）2）电子商务（淘宝的云梯）
使用Pig做数据分析	Pig的语法不是很普及	有些场景下性能不如MR	统计和机器学习（Yahoo, twitter）
基于HBase开发的系统	基本可以达到准实时统计分析功能	目前没有开源实现，开发成本高	大多是自有系统，例如Google的Percolator，淘宝的prom

Hive的体系架构

安装

http://www.motorlogy.com/apache/hive/hive-0.12.0/hive-0.12.0.tar.gz

设置环境变量

HIVE_HOME=/usr/local/hive
PATH=$PATH:$PIG_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin
CLASSPATH=$CLASSPATH:$HIVE_HOME/lib

hive运行的三种模式

1. 内嵌模式，元数据保存在内嵌的Derby模式，只运行一个会话连接，一般仅作为测试使用

2. 本地独立模式：在本地安装Mysql，把元数据保存在Mysql中

3. 远程模式：元数据放置在远程的Mysql中

新建表：

CREATE TABLE test_hive(a int, b int, c int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

CREATE TABLE records(year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

CREATE TABLE log_history(ip STRING, year INT, month STRING, day INT, hour INT, url STRING, status INT, size INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

导入数据：

LOAD DATA LOCAL INPATH '/usr/local/hive/t_hive.txt' OVERWRITE INTO TABLE test_hive; (本地导入数据)

LOAD DATA INPATH '/usr/local/hhive/t_hive.txt' OVERWRITE INTO TABLE test_hive; (HDFS导入数据)

INSERT OVERWRITE TABLE target SELECT col1, col2 FROM source;

CREATE TABLE target AS SELECT col1, col2 FROM source;

查看表结构：

show tables;

describe test_hive;

查看表数据：

select * from test_hive;

修改表：

ALTER TABLE test_hive ADD COLUMNS(new_col String);

删除表：

DROP TABLE test_hive;

日志实例：

现有如下的web日志格式

196.13.32.71 - - [09/Aug/1995:03:21:17 -0400] "GET /icons/image.xbm HTTP/1.0" 200 509

pc-128-78.ntc.nokia.com - - [09/Aug/1995:03:22:01 -0400] "GET /shuttle/technology/sts-newsref/ HTTP/1.0" 200 16376

gate.germany.eu.net - - [09/Aug/1995:03:22:02 -0400] "GET /ksc.html HTTP/1.0" 200 7131

方案A：可以写MapReduce程序来进行日志分析（每天的点击数，每小时点击数，IP的点击数，Domain的点击数）

方案B：可以将日志文件导入到hive数据仓库中，然后写HQL进行以上的日志分析

1. 导入hive表中的数据必须是结构化的数据，因此需要把web日志进行格式化，生成以tab分隔的文件，可以写一个MapReduce程序来完成数据的格式化，代码如下：

public class LogHive extends Configured implements Tool{
	private static final String INPUT_PATH = "hdfs://192.168.56.103:9000/feixu/logInput1";
	private static final String OUTPUT_PATH = "hdfs://192.168.56.103:9000/feixu/logOutput2";
	
	public static class LogHiveMapper extends Mapper<LongWritable, Text, NullWritable, Text>{
		@Override
		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
            LogParser parser = LogParser.parser(value.toString());
			Text out = new Text(parser.getIp() + "\t" + parser.getYear() + "\t" + parser.getMonth() + "\t" + parser.getDay() + "\t" + parser.getHour() + "\t" + parser.getUrl() + "\t" + parser.getStatus() + "\t" + parser.getSize());
            
			context.write(NullWritable.get(), out);
		}
	}
	@Override
	public int run(String[] arg0) throws Exception {
		Job job = new Job();
		job.setJarByClass(HitOfDay.class);
		job.setJobName("Log Hive Job");
		
		FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
		FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
		
		job.setMapperClass(LogHiveMapper.class);
		
		job.setOutputKeyClass(NullWritable.class);
		job.setOutputValueClass(Text.class);
		
		job.submit();
		return job.waitForCompletion(true)? 0:1;
	}
	
	public static void main(String[] args) throws Exception{
		int exitCode = ToolRunner.run(new LogHive(), args);
		System.exit(exitCode);
	}
}

注：只有Map程序，不需要Reduce。Map的key为NullWritable不用输出。

2. 生成log_history的hive表，并将格式化之后的日志加载到hive表中

创建表：
CREATE TABLE log_history(ip STRING, year INT, month STRING, day INT, hour INT, url STRING, status INT,size INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

加载数据：
    
    
     
     LOAD DATA INPATH '/usr/local/log_history' OVERWRITE INTO TABLE log_history;

分析每个月1~31日每天的点击量

hive> select day, count(day) from log_history group by day;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201312030304_0006, Tracking URL = http://feixu-master:50030/jobdetails.jsp?jobid=job_201312030304_0006
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201312030304_0006
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2013-12-04 10:43:13,586 Stage-1 map = 0%, reduce = 0%
2013-12-04 10:43:26,703 Stage-1 map = 12%, reduce = 0%
2013-12-04 10:43:28,728 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:29,743 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:30,759 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:31,777 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:32,797 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:33,816 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:34,831 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:35,864 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:36,879 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:37,890 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:38,904 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:39,919 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:40,943 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.49 sec
2013-12-04 10:43:41,959 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.49 sec
2013-12-04 10:43:42,975 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.49 sec
2013-12-04 10:43:43,989 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 12.49 sec
2013-12-04 10:43:45,002 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.7 sec
2013-12-04 10:43:46,019 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.7 sec
2013-12-04 10:43:47,031 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.7 sec
2013-12-04 10:43:48,046 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.7 sec
MapReduce Total cumulative CPU time: 13 seconds 700 msec
Ended Job = job_201312030304_0006
MapReduce Jobs Launched:
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 13.7 sec HDFS Read: 303841163 HDFS Write: 288 SUCCESS
Total MapReduce CPU Time Spent: 13 seconds 700 msec
OK
1 98710
2 60265
3 130972
4 130009
5 126468
6 133380
7 144595
8 99024
9 95730
10 134108
11 141653
12 130607
13 170683
14 143981
15 104379
16 104507
17 133969
18 120528
19 104832
20 99556
21 120169
22 93029
23 97296
24 116811
25 120020
26 90457
27 94503
28 82617
29 67988
30 80641
31 90125
Time taken: 42.899 seconds, Fetched: 31 row(s)