005-Hive Tutorial

最新推荐文章于 2020-03-06 12:04:07 发布

sonehb

最新推荐文章于 2020-03-06 12:04:07 发布

阅读量236

点赞数

分类专栏： hive

hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Hive Tutorial
- 概念
- Usage and Examples

Hive Tutorial

概念

What Is Hive

Hive是基于Apache Hadoop的数据仓库的基础设施。Hadoop提供了大规模的扩展和容错功能，用于在商品硬件上进行数据存储和处理。

Hive的设计是为了方便地进行数据汇总，对大量数据进行特别的查询和分析。它提供了SQL，使用户能够轻松地进行特别查询、摘要和数据分析。与此同时，Hive的SQL为用户提供了多个位置来集成他们自己的功能来进行定制分析，比如用户定义的函数（udf）。

What Hive Is NOT

Hive不是为在线事务处理而设计的。它最好用于传统的数据仓库任务。

Data Units

根据粒度大小，hive 数据可以分为：Databases、Tables、Partitions、Buckets（or Clusters）。

Partitions：分区。每个表都可以有一个或多个分区键来决定数据是如何存储的。
分区，除了存储单元之外，还允许用户高效地识别满足特定标准（分区）的行。
Buckets（or Clusters）：每个分区中的数据可以根据表格中某一列的哈希函数的值将其划分为 bucket。

Type System

Hive 支持原始和复杂的数据类型（primitive and complex data types）。
1、原始类型/基本类型/primitive

Integers
    TINYINT — 1 byte integer
    SMALLINT — 2 byte integer
    INT — 4 byte integer
    BIGINT — 8 byte integer
Boolean type
    BOOLEAN — TRUE/FALSE
Floating point numbers
    FLOAT — single precision
    DOUBLE — Double precision
Fixed point numbers
    DECIMAL — a fixed point value of user defined scale and precision
String types
    STRING — sequence of characters in a specified character set
    VARCHAR — sequence of characters in a specified character set with a maximum length
    CHAR — sequence of characters in a specified character set with a defined length
Date and time types
    TIMESTAMP — a specific point in time, up to nanosecond precision
    DATE — a date
Binary types
    BINARY — a sequence of bytes

2、复杂类型
可以使用原始类型和其他复合类型构建复杂类型。

Structs：结构体，可以使用 . (点号)来访问该类型中的元素。例如，for a column c of type STRUCT {a INT; b INT}可以通过 c.a 来访问 a 字段。
Maps (key-value tuples)：通过 [element_name] 访问。
Arrays (indexable lists)：(可变址)数组。数组中的元素必须是相同类型的。通过 [index] 访问。

Built In Operators and Functions

内置操作符和函数。

在 Beeline 或 Hive CLI 中，使用下面这些命令可以显示最新的文档：
SHOW FUNCTIONS;
DESCRIBE FUNCTION ;
DESCRIBE FUNCTION EXTENDED ;

注意：Hive 中是区分大小写的。

1、操作符
⑴ 关系运算符：返回 True 或 False。

=   !=  <   <=  >   >=   IS NULL    IS NOT NULL     LIKE    RLIKE   REGEXP

Relational Operator	Operand types	Description
A = B	all primitive types	TRUE if expression A is equivalent to expression B; otherwise FALSE
A != B	all primitive types	TRUE if expression A is not equivalent to expression B; otherwise FALSE
A < B	all primitive types	TRUE if expression A is less than expression B; otherwise FALSE
A <= B	all primitive types	TRUE if expression A is less than or equal to expression B; otherwise FALSE
A > B	all primitive types	TRUE if expression A is greater than expression B] otherwise FALSE
A >= B	all primitive types	TRUE if expression A is greater than or equal to expression B otherwise FALSE
A IS NULL	all types	TRUE if expression A evaluates to NULL otherwise FALSE
A IS NOT NULL	all types	FALSE if expression A evaluates to NULL otherwise TRUE
A LIKE B	strings	TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, `'foobar' LIKE 'foo'` evaluates to FALSE where as `'foobar' LIKE 'foo___'` evaluates to TRUE and so does `'foobar' LIKE 'foo%'`. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, `columnValue LIKE 'a\;b'`
A RLIKE B	strings	NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, ‘foobar’ rlike ‘foo’ evaluates to TRUE and so does ‘foobar’ rlike ‘^f.*r$’.
A REGEXP B	strings	Same as RLIKE

说明：
A LIKE B：操作类型（A、B）均为 strings，如果字符串A匹配满足 SQL 简单正则表达式B返回True，否则返回False。
    _ 表示匹配任意字符（类似正则中的 .）；
    % 表示匹配任意数量的字符（类似正则中的 .*）
        例：'foobar' LIKE 'foo'     -->   false
            'foobar' LIKE 'foo___'  -->     true
            'foobar' LIKE 'foo%'    -->     true
        使用 \% 可以匹配一个 %
A RLIKE B：操作类型（A、B）均为 strings，如果 A 或 B 是 NULL 则返回 NULL；如果 A 的任意子串能够匹配 B 则返回 True，否则为 False。
        例：'foobar' rlike 'foo'       -->    true
            'foobar' rlike '^f.*r$'    -->     true
A REGEXP B：操作类型（A、B）均为 strings，类似 RLIKE。

⑵ 算术运算符：所有返回值都是数字类型

+   -   *   /   %   &    |  ^   ~

Arithmetic Operators	Operand types	Description
A + B	all number types	Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float.
A - B	all number types	Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A * B	all number types	Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.
A / B	all number types	Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division.
A % B	all number types	Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A & B	all number types	Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A \| B	all number types	Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A ^ B	all number types	Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
~A	all number types	Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.

⑶ 逻辑运算符：所有返回值都是 boolean

AND     &&      OR      ||      NOT     !

Logical Operators	Operands types	Description
A AND B	boolean	TRUE if both A and B are TRUE, otherwise FALSE
A && B	boolean	Same as A AND B
A OR B	boolean	TRUE if either A or B or both are TRUE, otherwise FALSE
A \|\| B	boolean	Same as A OR B
NOT A	boolean	TRUE if A is FALSE, otherwise FALSE
!A	boolean	Same as NOT A

说明：
操作数类型都是 boolean。
AND 与 && 相同，OR 与 || 相同，NOT 与 ! 相同。

⑷ 复杂类型的操作运算符

Operator	Operand types	Description
A[n]	A is an Array and n is an int	returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of [‘foo’, ‘bar’] then A[0] returns ‘foo’ and A[1] returns ‘bar’
M[key]	M is a Map<K, V> and key has type K	returns the value corresponding to the key in the map for example, if M is a map comprising of {‘f’ -> ‘foo’, ‘b’ -> ‘bar’, ‘all’ -> ‘foobar’} then M[‘all’] returns ‘foobar’
S.x	S is a struct	returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct.

A[n]：A is an Array and n is an int，返回数组A中的第n个元素，索引从0开始。
M[key]：M is a Map<K, V> and key has type K，返回指定 key 的 value 值。
S.x：S is a struct，返回 S 的 x field。

2、内置函数

Return Type	Function Name (Signature)	Description
BIGINT	round(double a)	returns the rounded BIGINT value of the double
BIGINT	floor(double a)	returns the maximum BIGINT value that is equal or less than the double
BIGINT	ceil(double a)	returns the minimum BIGINT value that is equal or greater than the double
double	rand(), rand(int seed)	returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic.
string	concat(string A, string B,…)	returns the string resulting from concatenating B after A. For example, concat(‘foo’, ‘bar’) results in ‘foobar’. This function accepts arbitrary number of arguments and return the concatenation of all of them.
string	substr(string A, int start)	returns the substring of A starting from start position till the end of string A. For example, substr(‘foobar’, 4) results in ‘bar’
string	substr(string A, int start, int length)	returns the substring of A starting from start position with the given length, for example, substr(‘foobar’, 4, 2) results in ‘ba’
string	upper(string A)	returns the string resulting from converting all characters of A to upper case, for example, upper(‘fOoBaR’) results in ‘FOOBAR’
string	ucase(string A)	Same as upper
string	lower(string A)	returns the string resulting from converting all characters of B to lower case, for example, lower(‘fOoBaR’) results in ‘foobar’
string	lcase(string A)	Same as lower
string	trim(string A)	returns the string resulting from trimming spaces from both ends of A, for example, trim(’ foobar ‘) results in ‘foobar’
string	ltrim(string A)	returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(’ foobar ‘) results in ‘foobar ‘
string	rtrim(string A)	returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(’ foobar ‘) results in ’ foobar’
string	regexp_replace(string A, string B, string C)	returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace(‘foobar’, ‘oo\|ar’, ) returns ‘fb’
int	size(Map<K.V>)	returns the number of elements in the map type
int	size(Array<T>)	returns the number of elements in the array type
value of <type>	cast(<expr> as <type>)	converts the results of the expression expr to <type>, for example, cast(‘1’ as BIGINT) will convert the string ‘1’ to it integral representation. A null is returned if the conversion does not succeed.
string	from_unixtime(int unixtime)	convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of “1970-01-01 00:00:00”
string	to_date(string timestamp)	Return the date part of a timestamp string: to_date(“1970-01-01 00:00:00”) = “1970-01-01”
int	year(string date)	Return the year part of a date or a timestamp string: year(“1970-01-01 00:00:00”) = 1970, year(“1970-01-01”) = 1970
int	month(string date)	Return the month part of a date or a timestamp string: month(“1970-11-01 00:00:00”) = 11, month(“1970-11-01”) = 11
int	day(string date)	Return the day part of a date or a timestamp string: day(“1970-11-01 00:00:00”) = 1, day(“1970-11-01”) = 1
string	get_json_object(string json_string, string path)	Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.

在Hive中支持下列构建的聚合函数

Return Type	Aggregation Function Name (Signature)	Description
BIGINT	count(*), count(expr), count(DISTINCT expr[, expr_.])	count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL.
DOUBLE	sum(col), sum(DISTINCT col)	returns the sum of the elements in the group or the sum of the distinct values of the column in the group
DOUBLE	avg(col), avg(DISTINCT col)	returns the average of the elements in the group or the average of the distinct values of the column in the group
DOUBLE	min(col)	returns the minimum value of the column in the group
DOUBLE	max(col)	returns the maximum value of the column in the group

Language Capabilities

Hive’s SQL 提供了基本的SQL操作。这些操作在表或分区上工作。这些操作是：

能够使用WHERE子句从表中过滤行。
能够使用SELECT子句从表中选择特定的列。
能够在两个表之间做等连接。
能够在多个“group by”栏目中对存储在表格中的数据进行评估。
能够将查询结果存储到另一张表中。
能够将表的内容下载到本地（例如，nfs）目录。
能够将查询的结果存储在hadoop dfs目录中。
能够管理表格和分区（创建、删除和修改）。
能够在l中插入自定义脚本

Usage and Examples

注意：下面的许多例子都是过时的。更多的最新信息可以在 LanguageManual 中找到。

1、Creating, Showing, Altering, and Dropping Tables

关于 creating, showing, altering, and dropping tables 这些的详细信息请参阅 Hive Data Definition Language。

⑴ Creating Tables

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
# PARTITIONED BY 子句定义了不同于数据列的分区列
# 文件中的数据被假定为用ASCII 001（ctrl-A）作为字段分隔符，换行符作为行分隔符。

定制字段分隔符：ROW FORMAT DELIMITED FIELDS TERMINATED BY

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

目前不能更改行分隔符，因为它不是由Hive来决定的，而是由Hadoop分隔符决定的。

在表上的某些列上使用桶也是一个好主意，这样就可以对数据集执行高效的抽样查询。
如果桶不存在，随机抽样仍然可以在表上进行，但是查询是全表扫描了，所以查询不高效。
下面的例子说明了在 pageview 表的 userid 列上打上了桶：

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;
# 该表由 userid 的哈希函数集中到32个桶中。
# 在每个 bucket 中，数据按 viewTime 进行升序排序。
# 这样的组织允许用户在 clustered 列上进行高效的抽样——本例为 userid

⑵ 展示表和分区信息

查看所有表：show tables;
查看所有匹配的表：show tables 'page.*';          
列出表的分区：show partitions table_name;      
列出表所有列及其类型：describe table_name;
列出表所有列其他属性：describe extended table_name;
列出表一个分区的列和所有其他属性：describe extended table_name partition (分区列=val)

⑶ 修改表

重命名已存在的表，如果新的表名是已存在的其他表的表名，则会返回错误：
    alter table old_name rename to new_name;
重命名现有表的列。一定要使用相同的列类型，并为每个已存在的列包括一个条目：
    alter table table_name replace columns(col1 type,...);  # 注意：如果只是改列名称的话，注意列类型和其他列名及其类型，如果写少了或写错了，列就被改变了
增加列：
    alter table table_name add columns (c1 int comment 'a new int column',c2 string default 'def val');

⑷ 删除表和分区

删除表的同时会隐式删除该表上的所有索引
    drop table table_name;
删除表分区，分区字段并没有被删除，只是删除了指定分区值的分区数据（注意 .* 将删除所有分区数据）
    alter table table_name drop partition (ds='2008-08-08')

2、Loading Data

将数据加载到Hive表中有多种方法。用户可以创建指向 HDFS 中指定位置的外部表。用户可以使用HDFS put或copy命令将文件复制到指定的位置，并创建一个指向该位置的表，其中包含所有相关的行格式信息。一旦完成，用户就可以转换数据并将其插入到任何其他的HIVE表中。

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User',
                country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';

3、Querying and Inserting Data

⑴ Simple Query

INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;

SELECT user.*
FROM user
WHERE user.active = 1;

⑵ Partition Based Query

INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
      page_views.referrer_url like '%xyz.com';
# 建表的时候定义的分区，PARTITIONED BY(date DATETIME, country STRING) ;

⑶ Joins
equi-joins

# ------ join ------
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ LEFT OUTER, RIGHT OUTER or FULL OUTER ------
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ LEFT SEMI JOIN ------
# 检查另一张表中的键的存在
INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ 多个 JOIN ------
INSERT OVERWRITE TABLE pv_friends
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';

⑷ Aggregations

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

⑸ Multi Table/File Inserts

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count_distinct(pv_users.userid)
    GROUP BY pv_users.gender

INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
    SELECT pv_users.age, count_distinct(pv_users.userid)
    GROUP BY pv_users.age;

⑹ Dynamic-Partition Insert

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';

上面这个语句，是一个非常不好的示例，因为我们预先需要知道所有的 country，并且 dt 如果变了，那么我们需要重新增加新的 insert 语句。例如，当还有另外一个 country=’DC’ 或者 dt = ‘2008-09-10’
Dynamic-partition insert 是为了解决上述问题而被设计的。所以 Dynamic-partition insert 如下即可

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

说明：
- 上述示例语句中，dt 是一个静态分区列（因为它的值一直都是2008-06-08，没有任何变化），country 是动态分区列。
- 动态分区列的值来自输入列。
- 目前，只允许动态分区列作为分区子句中的最后一列（s），因为分区列顺序表示它的层次顺序，所以不能用(dt, country=’US’)来指定分区子句。
- select 语句中额外增加的 pvs.country 列，是动态分区列对应的输入列。请注意，您不需要为静态分区列添加一个输入列，因为它的值已经在隔断子句中知道了。

注意，动态分区值是通过排序、而不是名称来选择的，并作为select子句的最后一列来选择。（即动态分区列的值是来自 select 子句的最后一列，而不通过名字匹配的）

动态分区插入语句的语义：
  - 当动态分区列已经存在非空分区时（例如，在某些ds根分区之下存在着country='CA），如果动态分区插入在输入数据中看到相同的值（比如'CA'），就会被覆盖。
  - 因为 Hive 分区对应于HDFS中的一个目录，所以分区值必须符合HDFS路径格式。任何在URI中具有特殊含义的字符（例如，'%', ':', '/', '#'）都将以'%'的形式转义，后跟2字节的ASCII值。
  - 如果输入列是非字符串的类型，那么它的值将首先被转换成字符串，用于构造HDFS路径。
  - 如果输入列值为NULL或空字符串，这一行将被放入一个特殊的分区中，该分区的名称由hive参数hive.exec.default.default.name控制。默认值是HIVE_DEFAULT_PARTITION{}。基本上这个分区将包含所有的“坏”行，它们的值不是有效的分区名称。这种方法的警告是，如果您选择Hive，那么它将丢失并被HIVE_DEFAULT_PARTITION{}所取代。JIRA hia-1309是一个解决方案，让用户指定“坏文件”来保留输入分区列值。
  - 动态分区插入可能是一个资源占用者，因为它可以在短时间内生成大量的分区。为了让自己分桶，我们定义了三个参数：
  - hive.exec.max.dynamic.partitions.pernode：(默认值是1000)是每个mapper或reducer可以创建的最大动态分区数。如果一个mapper或reducer创建的比这个阈值更多，那么将会从map/reducer（通过计数器）中产生一个致命的错误，整个job将被杀死。
  - hive.exec.max.dynamic.partitions：(默认值是100)能够被一个DML创建的动态分区的总数。如果每一个mapper/reducer都没有超过限制，但是动态分区的总数是超过了，那么在将中间数据移动到最终目的地之前，将会抛出一个异常结束 job。
  - hive.exec.max.created.files：(默认值是100000)是所有的mapper和reducer创建的最大的文件总数。每一个mapper/reducer 创建一个新文件时将执行 Hadoop counter 更新。如果总数超过了hive.exec.max.created.files，将抛出一个致命的错误，job 将被杀死。
  - 我们想要保护不利于动态分区插入的另一种情况是，用户可能意外地指定所有分区为动态分区，而不指定一个静态分区，虽然最初的目的是想覆盖一个根分区的子分区。我们可以定义另外一个参数 hive.exec.dynamic.partition.mode=strict 来保护这种全动态分区情况。在严格模式下，您必须指定至少一个静态分区。默认模式是严格的。另外，我们可以用一个参数 hive.exec.dynamic.partition=true/false 来控制是否允许动态分区。在Hive 0.9.0之前默认值是false，在Hive 0.9.0和之后默认值是 true。 
  - 在Hive 0.6中，hive.merge.mapfiles=true or hive.merge.mapredfiles=true时动态分区插入不工作。所以它内部关闭了merge 参数。在Hive 0.7中 merging file 是支持动态分区插入的（详见JIRA hi1307）。

⑺ Inserting into Local Files

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;

⑻ Sampling – 取样
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
从pv_gender_sum的32个桶中选择第三个桶

INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

⑼ Union All

INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'

    UNION ALL

    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
    ) actions JOIN users u ON(u.id = actions.uid);

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-ArrayOperations
Array Operations
Map (Associative Arrays) Operations
Custom Map/Reduce Scripts
Co-Groups

sonehb

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
005-Hive Tutorial

一、Data Units -- 数据单元根据粒度大小，hive 数据可以分为：Databases、Tables、Partitions、Buckets（or Clusters）。 Partitions：分区。每个表都可以有一个或多个分区键来决定数据是如何存储的。分区，除了存储单元之外，还允许用户高效地识别满足特定标准（分区）的行。...
复制链接

扫一扫