大数据学习日志（二）————基础知识

最新推荐文章于 2024-07-24 15:00:45 发布

夜色如墨

最新推荐文章于 2024-07-24 15:00:45 发布

阅读量97

点赞数

分类专栏：大数据文章标签： hive

本文链接：https://blog.csdn.net/yeserumo/article/details/108151199

版权

大数据专栏收录该内容

11 篇文章 0 订阅

订阅专栏

复杂数据类型

Array：数组类型，对应了Java中的数组或者集合类型
- 原始数据（注意中间有空格）
  
  2,3,4,1,3,4 3,4,2,4,5,6,2,3
  
  4,5,7,4,7,3,2 6,3,3,6,2,5,2
  
  7,3,6,8 2,4,47,7,2
  
  4,5,7,9,0,3 2,4,5,6,8,5
- 建表：create table num(nums1 array<int>,nums2 array<int>) row format delimited fields terminated by ' ' collection items terminated by ',';
- 加载数据：load data local inpath '/home/hivedata/arr.txt' into table num;
- 非空查询：select nums1[6] from num where nums1[6] is not null;
Map：映射类型，对应了Java中的Map类型
- 原始数据：
  
  1 tom,15
  2 sam,18
  3 jack,20
  4 lucy,16
  5 mark,22
- 建表：create table infos(id int,info map<string,int>) row format delimited fields terminated by ' ' map keys terminated by ',';
- 加载数据：load data local inpath '/home/hivedata/info.txt' into table infos;
- 非空查询：select nums1['jack'] from infos where nums1['jack'] is not null;
Struct：结构体类型，对应了Java中的对象
- 原始数据：
  
  1 tom,15,male alex,16,male
  2 amy,14,female bob,19,male
  3 lucy,15,female lily,16,female
- 建表： create table groups(groupid int,m1 struct<name:string,age:int,gender:string>,m2 struct<name:string,age:int,gender:string>) row format delimited fields terminated by ' ' collection items terminated by ',';
- 加载数据：load data local inpath '/home/hivedata/group.txt' into table groups;
- 查询数据：select m1.age from groups;

函数

为了对数据进行较为方便地操作，Hive提供了大量的内置运算符和内置函数：数学、类型转换、日期、条件、字符串、聚合等
expolde：将数组中的每一个元素提取出来成为单独一行
实例：单词统计
- 原始数据：
  
  hello tom hello bob hello joy
  hello rose hello joy hello
  jerry hello tom
  hello rose hello joy
- 建表（外部表）：create external table words (word array<string>) row format delimited collection items terminated by ' ' location '/words';
- 查询每个单词出现的次数：select w,count(w) from (select explode(word) w from words) ws group by w;
- 统计次数并且把结果写到本地目录中：insert overwrite local directory '/home/hivedata/' row format delimited fields terminated by '\t' select w,count(w) from (select explode(word) w from words) ws group by w;

Hive中的自定义函数分为三大类

UDF（User Defined Function）：用户定义函数，定入（每次都
- split：一行输入一行输出
UDAF
- Max（age）：多行输入、一行输出
UDTF
- explode：一行输入、多行输出

自定义函数：

package cn.tedu.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

public class RepeatString extends UDF {
    // 覆盖evaluate方法
    // 为什么父类中没有将此方法设计为抽象方法
    // 原因：业务不同，返回值和参数类型就不能确定
    public String evaluate(String str,int num){
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < num; i++) {
            sb.append(str);
        }
        return sb.toString();
    }
}

打成jar包后导入虚拟机中
add jar /hive_test-1.0-SNAPSHOT.jar
create temporary function stringrepeat as 'cn.tedu.udf.RepeatString';
select stringrepeat('abcd',5);

夜色如墨

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
大数据学习日志（二）————基础知识

复杂数据类型Array：数组类型，对应了Java中的数组或者集合类型原始数据（注意中间有空格）2,3,4,1,3,4 3,4,2,4,5,6,2,34,5,7,4,7,3,2 6,3,3,6,2,5,27,3,6,8 2,4,47,7,24,5,7,9,0,3 2,4,5,6,8,5建表：create table num(nums1 array<int>,nums2 array<int>) row format delimited fields termin
复制链接

扫一扫