大数据_hive_基本操作

15 篇文章 0 订阅
8 篇文章 0 订阅

DDL


create

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

列数据类型data_type

参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)

primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)

array_type
  : ARRAY < data_type >

map_type
  : MAP < primitive_type, data_type >

struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)

字段分隔符ROW FORMAT row_format

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

存储数据类型STORED AS file_format

file_format:
通过hive.default.fileformat设置默认值。
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration) 
  | --使用DELIMITED分割数据,使用ESCAPED BY指定排除分隔符似的数据不被分割(Enable escaping for the delimiter characters by using the 'ESCAPED BY' clause (such as ESCAPED BY '\') 
  | --Escaping is needed if you want to work with data that can contain these delimiter characters. );A custom NULL format can also be specified using the 'NULL DEFINED AS' clause (default is '\N').
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
  | -默认使用的是TextInputFormat,一行表示一条记录。在有些时候,我们往往面对多行,结构化的文档,并需要将其导入Hive处理,此时,就需要自定义InputFormat、OutputFormat,以及SerDe了。参考:`https://www.coder4.com/archives/4031`

constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 


不同的存储类型对应的INPUTFORMAT和OUTPUTFORMAT不同,也可以自定义INPUTFORMAT和OUTPUTFORMAT。

比如:
TEXTFILE对应的为:
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

RCFILE对应的为:
InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat   
OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat

除INPUTFORMAT和OUTPUTFORMAT不同之外,他们可以配置的SerDe也有限制,比如:
RCFILE默认的为:
org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
TextFile可配置SerDe:比如RegEx,JSON,CSV/TSV
如果配置FILEFORMAT为RCFILE,配置的SERDE为JSON,那么就会出错Failed with exception Wrong file format. Please check the file's format.。

ROW FORMAT SERDE & STORED AS INPUTFORMAT OUTPUTFORMAT

这个比较复杂,就先不看了��
参考地址: https://www.coder4.com/archives/4031
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
SerDe is a short name for “Serializer and Deserializer.”
Hive uses SerDe (and !FileFormat) to read and write table rows.
HDFS files –> InputFileFormat –> <key, value>> Deserializer –> Row object
Row object –> Serializer –> <key, value>> OutputFileFormat –> HDFS files

总结一下,当面临一个HDFS上的文件时,Hive将如下处理(以读为例):

(1) 调用InputFormat,将文件切成不同的文档。每篇文档即一行(Row)。
(2) 调用SerDe的Deserializer,将一行(Row),切分为各个字段。

当HIVE执行INSERT操作,将Row写入文件时,主要调用OutputFormat、SerDe的Seriliazer,顺序与读取相反。

hive -e "create table hive_json_test(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;"

一个json serde的例子:
参考:https://github.com/rcongiu/Hive-JSON-Serde/tree/develop

1、下载对应Hive版本hcatalog包(包含JsonSerDe类)到hive lib中:https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core
2、将jar包的地址配置到hive-site.xml中:
<property>
    <name>hive.aux.jars.path</name>
    <value>file:///usr/local/execpackage/apache-hive-0.13.0-bin/lib/hive-hcatalog-core-0.13.0.jar</value>
</property>
3、测试
hive -e "load data local inpath '/Users/hadoop/Desktop/ws/tempfiles/hive_json.1.txt' into table hive_json_test;"
hive -e "select * from hive_json_test;"
hive -e "alter table hive_json_test add columns(c string);"
hive -e "load data local inpath '/Users/hadoop/Desktop/ws/tempfiles/hive_json.2.txt' into table hive_json_test;"
其中:
hive_json.1.txt:
{"a":"va2","b":"vb2"}
hive_json.2.txt:
{"a":"va2","b":"vb2","c":"vc"}
输出:
va2 vb2 NULL
va2 vb2 vc


如果某条数据不是规范的json,将会引起整个任务错误:

使用以下设置,将会返回NULL值替代
ALTER TABLE json_table SET SERDEPROPERTIES ( 'ignore.malformed.json' = 'true');
https://stackoverflow.com/questions/14137688/hive-elasticmapreduce-how-bring-jsonserde-to-ignore-malformed-json

创建表时指定表属性

TBLPROPERTIES ("comment"="table_comment") 对表的描述
TBLPROPERTIES ("hbase.table.name"="table_name") – see HBase Integration.
TBLPROPERTIES ("immutable"="true") or ("immutable"="false") in release 0.13.0+ (HIVE-6406) – see Inserting Data into Hive Tables from Queries.
TBLPROPERTIES ("orc.compress"="ZLIB") or ("orc.compress"="SNAPPY") or ("orc.compress"="NONE") and other ORC properties – see ORC Files.
TBLPROPERTIES ("transactional"="true") or ("transactional"="false") in release 0.14.0+, the default is "false" – see Hive Transactions.
TBLPROPERTIES ("NO_AUTO_COMPACTION"="true") or ("NO_AUTO_COMPACTION"="false"), the default is "false" – see Hive Transactions.
TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="mapper_memory") – see Hive Transactions.
TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.num.threshold"="threshold_num") – see Hive Transactions.
TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.pct.threshold"="threshold_pct") – see Hive Transactions.
TBLPROPERTIES ("auto.purge"="true") or ("auto.purge"="false") in release 1.2.0+ (HIVE-9118) – see Drop Table and Drop Partitions.
TBLPROPERTIES ("EXTERNAL"="TRUE") in release 0.6.0+ (HIVE-1329) – Change a managed table to an external table and vice versa for "FALSE".
TBLPROPERTIES("parquet.mr.int96.write.zone"="timeZoneId") (HIVE-12767) – Property set on Parquet tables only, storing the Java time zone ID used 

STORED BY

指定一个non-native table format,比如指定hbase、druid

外部表、内部表(Managed and External Tables)的区别

内部表:
1、元数据、文件都由hive管理
2、数据存储位置由参数hive.metastore.warehouse.dir指定
3、删除表会删除对应的数据
4、Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.
外部表:
1、 对接外部数据(HDFS),为外部数据创建元信息
2、当外部表数据结构或分区发生改变时,使用`MSCK REPAIR TABLE table_name;`或则`ALTER TABLE table_name ADD PARTITION`,类似EMR版本中的如下命令
`ALTER TABLE table_name RECOVER PARTITIONS`;
当分区目录下还有子目录,此时使用msck会报错,可以使用set hive.msck.path.validation=ignore;
避免这个错误,或者删除子目录
3、删除表不会删除外部数据

修改内部表为外部表(反之亦然):
hive -e "ALTER TABLE table_name SET TBLPROPERTIES ('EXTERNAL'='TRUE');"

分区

1、可以指定多个分区
2、分区字段名不能与表字段有重复
  • 内部表分区
1、示例:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;
  • 外部表分区
1、创建外部表,指定路径
CREATE EXTERNAL TABLE my_daily_report(  
    last_update string,  
    col_a string,  
    col_b string,  
    col_c string)  
    PARTITIONED BY (  
    par_dt string) location '/user/chenshu/data/daily'; 
2、添加分区,挂载数据
alter table my_daily_report add partition (par_dt='20140530') location '/user/chenshu/data/daily/my_daily_report/20140530';  

Create Table As Select (CTAS)
* 示例

CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile
   AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

1、可以随意指定目标表的serde和存储文件的格式
2、目标表的字段名与select字段名相同,如果 (key % 1024) new_key指定别名new_key,那么目标表就有一个new_key字段,否则自动指定成_col0, _col1, and _col2 等等。
  • 属性:
1、操作原子性,当该操作进行时该表(目标表)对其他用户不可见
2、目标表不能是分区表
3、目标表不能是外部表.
4、The target table cannot be a list bucketing table.

Temporary Tables
* 示例

CREATE TEMPORARY TABLE TABLE_NAME_HERE (key string, value string)

The steps would be:
1Create Table
2、Fill Table:  INSERT OVERWRITE TABLE temptbl <select_query>
3、Run query

数据默认存储在 user's scratch directory目录下,有权限配置的时候就用不了
  • 属性
1、当前session可见,数据存在用户指定的scratch目录下,当session结束将会删除
2、不支持创建分区、不支持创建索引
3、设置临时表存储位置hive.exec.temporary.table.storage可选memory, ssd, or default
  • 参考
http://stackoverflow.com/questions/5385163/create-temporary-table-in-hive

Create Table Like
* 示例

CREATE TABLE empty_key_value_store
LIKE key_value_store;
  • 属性
1、不会拷贝数据
2、拷贝表的定义(如:SerDe and file formats)

Bucketed Sorted Tables
* 示例

1、创建表
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING)
CLUSTERED BY(user_id) INTO 256 BUCKETS;

2、往表中写入数据
set hive.enforce.bucketing = true;  -- (Note: Not needed in Hive 2.x onward)
FROM user_id
INSERT OVERWRITE TABLE user_info_bucketed
PARTITION (ds='2009-02-25')
SELECT userid, firstname, lastname WHERE ds='2009-02-25';

Note:
  set hive.enforce.bucketing = true只针对于v0.x和v1.x版本
  若没有使用hive.enforce.bucketing属性, 则需要设置和分桶个数相匹配的reducer个数, 同时SELECT后添加CLUSTER BY select.
    set mapred.reduce.tasks=256;
    FROM user_id
    INSERT OVERWRITE TABLE user_info_bucketed
    PARTITION (ds='2009-02-25')
    SELECT userid, firstname, lastname WHERE ds='2009-02-25' CLUSTER BY user_id;
  • 属性
1、获得更高的查询处理效率,将保存相同列值的桶进行JOIN操作,可以大大较少JOIN的数据量。
2、使取样(sampling)更高效
3、对写性能不起作用
4、计算方式:hash_function(bucketing_column) mod num_buckets。hash_function与bucketing_column的类型相关。
   我们应该设置 set hive.enforce.bucketing = true (for Hive 0.x and 1.x)使得数据能正确分桶。
5、当插入读取的数据(如示例:insert ... select ...)与表的数据类型不同、或者手动指定的cluster字段与表定义的不同会出现数据错误。

UNION


概念

11.2.0版本之前,只支持UNION ALL,不会去重
1.2.0版本之后,UNION默认去重
2DISTINCT也是用于去重额,但对1.2.0之后UNION默认支持去重的无效
3UNION ALL不去重
4、The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown.

语法

select_statement UNION [ALL | DISTINCT] select_statement UNION [ALL | DISTINCT] select_statement ...

示例

SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'
    UNION ALL
    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
 ) actions JOIN users u ON (u.id = actions.uid)

对UNION结果使用ORDER BY, SORT BY, CLUSTER BY, DISTRIBUTE BY or LIMIT,需要把这些放到最后

SELECT key FROM src
UNION
SELECT key FROM src1 
ORDER BY key LIMIT 10

参考


官方文档

Hadoop是一个分布式计算框架,而Hive是建立在Hadoop之上的一个数据仓库工具,它提供了SQL-like查询接口,使得非技术人员也能方便地处理大数据Hive的主要作用是将结构化的大型数据集转化为适合于分析的操作环境,包括词频统计。 词频统计在Hive中通常用于文本数据的分析,例如日志文件中的单词频率。以下是基本的Hive操作步骤: 1. **加载数据**: 首先,你需要将数据加载到Hive表中,这通常是通过`LOAD DATA INPATH`命令从本地文件系统或HDFS(Hadoop分布式文件系统)中导入。 ```sql LOAD DATA LOCAL INPATH '/path/to/your/textfile.txt' INTO TABLE your_table; ``` 2. **创建外部表**: 如果数据源不是HDFS,可以考虑创建外部表,这样Hive不会尝试将数据加载到Hive元数据中,节省空间并提高性能。 ```sql CREATE EXTERNAL TABLE your_table (lines STRING); ``` 3. **字段解析和分词**: 使用`REGEXP_EXTRACT`或`SPLIT`函数将文本分割成单词,假设我们使用空格作为分隔符。 ```sql SELECT word, COUNT(*) as frequency FROM ( SELECT split(lines, ' ') as word FROM your_table ) words GROUP BY word; ``` 4. **去重和排序**: 为了得到每个单词及其频率,需要去除重复的单词并按频率排序。 ```sql SELECT word, COUNT(*) as frequency FROM ( SELECT DISTINCT word FROM ( SELECT split(lines, ' ') as word FROM your_table ) words ) grouped_words GROUP BY word ORDER BY frequency DESC; ``` 5. **查询结果**: 最后,你可以运行这个查询来获取结果,Hive会返回一个包含单词和其出现次数的表格,按频率降序排列。 ```sql SELECT * FROM ( SELECT word, COUNT(*) as frequency ... ) ORDERED_results; ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值