hive简介

最新推荐文章于 2024-06-16 20:11:13 发布

IT独白者

最新推荐文章于 2024-06-16 20:11:13 发布

阅读量1.4k

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/sun_wangdong/article/details/75675690

版权

hadoop 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

hive是一种类sql语言，通过将用户输入的sql语句转化成mapreduce的job来执行，大大加快传统的sql语句在数据库中的查询。而且因为hive支持标准的sql语法，免去了用户编写mapreduce的过程，因此减少了公司开发的成本。hive只需要精通sql知识即可，而不需要特别去学习mapreduce，入门较低，而不是特别高，因此比较受欢迎。此外，hive本身就为大数据批处理而生的，hive的出现解决了传统的关系型数据库在大数据处理上的瓶颈。

hive是建立在hadoop之上的数据仓库，其本身没有存储和计算数据的功能，完全依赖于hdfs和mapreduce，因此可以将其理解为一个客户端，将我们的sql操作转化成mapreduce job，然后在hadoop上运行。看以下的例子：

id      city    name    sex           
0001    beijing zhangli man 
0002    guizhou lifang  woman 
0003    tianjin wangwei man 
0004    chengde wanghe  woman 
0005    beijing lidong  man 
0006    lanzhou wuting  woman 
0007    beijing guona   woman 
0008    chengde houkuo  man

有上述的一张表，其中每一列按照"\t"来分隔，现在需要统计city为beijing的有几条记录？那么传统的通过mapreduce程序来实现：

package IT;

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

public class Consumer
{
    public static String path1 = "hdfs://192.168.80.80:9000/consumer.txt";
    public static String path2 = "hdfs://192.168.80.80:9000/dir";
    public static void main(String[] args) throws Exception
    {
          FileSystem fileSystem = FileSystem.get(new URI(path1) , new Configuration());
          if(fileSystem.exists(new Path(path2)))
          {
              fileSystem.delete(new Path(path2), true);
          }

          Job job = new Job(new Configuration(),"Consumer");
          FileInputFormat.setInputPaths(job, new Path(path1));
          job.setInputFormatClass(TextInputFormat.class);
          job.setMapperClass(MyMapper.class);
          job.setMapOutputKeyClass(Text.class);
          job.setMapOutputValueClass(LongWritable.class);

          job.setNumReduceTasks(1);
          job.setPartitionerClass(HashPartitioner.class);

          job.setReducerClass(MyReducer.class);
          job.setOutputKeyClass(Text.class);
          job.setOutputValueClass(LongWritable.class);
          job.setOutputFormatClass(TextOutputFormat.class);
          FileOutputFormat.setOutputPath(job, new Path(path2));
          job.waitForCompletion(true);
          //查看执行结果
          FSDataInputStream fr = fileSystem.open(new Path("hdfs://hadoop80:9000/dir/part-r-00000"));
          IOUtils.copyBytes(fr, System.out, 1024, true);
     }
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>
    {      
            public static long sum = 0L;
            protected void map(LongWritable k1, Text v1,Context context) throws IOException, InterruptedException
            {
                  String[] splited = v1.toString().split("\t");
                  if(splited[1].equals("beijing"))
                  {
                      sum++;
                  }
            }
            protected void cleanup(Context context)throws IOException, InterruptedException
            {
                  String str = "beijing";
                  context.write(new Text(str),new LongWritable(sum));
            }
    }
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable>
    {
            protected void reduce(Text k2, Iterable<LongWritable> v2s,Context context)throws IOException, InterruptedException
            {
                  for (LongWritable v2 : v2s)
                 {
                     context.write(k2, v2);
                 }
            }   
    }   
}

经过上述代码，的到的结果如下：

可以看到输出结果为3，那么如果用hive来执行，如下所示：

那么结果为：

OK
beijing 3
Time taken: 19.768 seconds, Fetched: 1 row(s)

可以看到，其实我们在输入上述hql代码的时候，其内部执行的过程如上述传统的mapreduce代码的一样，帮我们生成mapreduce任务，然后去实现。
那么当我们再写一些稍复杂一点的hivesql语句时，其日志就显示了采用了mapreduce过程。

hive> select city,count(*)
    > from t4    
    > where city='beijing'
    > group by city;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1478233923484_0902, Tracking URL = http://hadoop22:8088/proxy/application_1478233923484_0902/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1478233923484_0902
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-11-09 11:36:36,688 Stage-1 map = 0%,  reduce = 0%
2016-11-09 11:36:42,018 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-11-09 11:36:43,062 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-11-09 11:36:44,105 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-11-09 11:36:45,149 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-11-09 11:36:46,193 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-11-09 11:36:47,237 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-11-09 11:36:48,283 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-11-09 11:36:49,329 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.7 sec
2016-11-09 11:36:50,384 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.7 sec
MapReduce Total cumulative CPU time: 3 seconds 700 msec
Ended Job = job_1478233923484_0902
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.7 sec   HDFS Read: 419 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 700 msec
OK
beijing 3
Time taken: 19.768 seconds, Fetched: 1 row(s)

同样，最后的结果也是beijing 3，但是通过日志，我们可以看到明显底层采用了mapreduce来实现。hive是一个sql解析的引擎。

hive有很多操作，比如创建普通表，外表，区分表等操作。所谓内表为普通表，创建语法如下：

create table tablename
{
      id int;  #字段名
      city string;
      name string;
      sex string;
}
row format delimited   #一行文本对应表中的一条记录
fields terminated by '\t'  #指定输入文件字段的间隔符，即输入文件的字段之间用什么来分隔开

IT独白者

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive简介

hive是一种类sql语言，通过将用户输入的sql语句转化成mapreduce的job来执行，大大加快传统的sql语句在数据库中的查询。而且因为hive支持标准的sql语法，免去了用户编写mapreduce的过程，因此减少了公司开发的成本。hive只需要精通sql知识即可，而不需要特别去学习mapreduce，入门较低，而不是特别高，因此比较受欢迎。此外，hive本身就为大数据批处理而生的，hiv
复制链接

扫一扫

专栏目录