hive基本操作

最新推荐文章于 2024-05-15 02:31:32 发布

weixin_33918357

最新推荐文章于 2024-05-15 02:31:32 发布

阅读量124

点赞数

文章标签：大数据数据库 python

原文链接：https://my.oschina.net/u/3862440/blog/2252546

版权

2019独角兽企业重金招聘Python工程师标准>>>

load使用：mysql创建表元数据，通过load命令加载本地文件，

load data local inpath '/home/hadoop/emp.txt' into table emp_1;

load data local inpath 文件路径 [overwrite] into table 表名

hive> create table emp_1
    >      (empno int,
    >      ename string,
    >      job string,
    >      mgr int,
    >      hirdate string,
    >      sal double,
    >      comm double,
    >      deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.157 seconds

hive> load data local inpath '/home/hadoop/emp.txt' overwrite into table emp_1; 
Loading data to table default.emp_1
Table default.emp_1 stats: [numFiles=1, numRows=0, totalSize=700, rawDataSize=0]
OK
Time taken: 0.444 seconds

hive> select * from emp_1;
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
8888    HIVE    PROGRAM 7839    1988-1-23       10300.0 NULL    NULL
Time taken: 0.125 seconds, Fetched: 15 row(s)

yarn查看任务进度：

http://192.168.83.11:8088/cluster，

要指定分隔符，所以创建表的时候要指定分隔符（默认分隔符是\001 ^A)；空格、制表符(\t)

使用load加载hdfs文件到表中;

hadoop上传文件emp1.txt

hadoop fs -put /home/hadoop/emp1.txt /

create table emp1 as select * from emp1 where 1>1;

load data inpath '/emp1.txt' into table emp_2;

这里需要注意下，采用如上方法并没有创建分隔符语法。

hive> show create table emp_2;
OK
CREATE TABLE `emp_2`(
  `empno` int, 
  `ename` string, 
  `job` string, 
  `mgr` int, 
  `hirdate` string, 
  `sal` double, 
  `comm` double, 
  `deptno` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://hd1:9000/user/hive/warehouse/emp_2'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'numFiles'='2', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'totalSize'='700', 
  'transient_lastDdlTime'='1540480415')
Time taken: 0.138 seconds, Fetched: 24 row(s)

load data inpath '/home/hadoop/emp1.txt' into table emp1;

创建外部表：

The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.

create external table exter_emp
(empno int,
ename string,
job string,
mgr int,
hirdate string,
sal double,
comm double,
deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/external/';

location '/external' 为hdfs路径，把文件丢在里面，表会按照分隔符自动加载数据。如果 '/external/'下面有多个文件，那么外部表 exter-emp都会把数据加载进来。

当然外部表也可以加载本地数据， load data local inpath '/home/hadoop/emp1.txt' into table exter_emp;

分区表：

Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.

create table part_emp(
empno int,
ename string,
job string,
hirdate string,
sal double,
comm double,
deptno int)
partitioned by(mgr int)
row format
delimited fields terminated by '\t';

load data local inpath '/home/hadoop/emp.txt' into table part_emp PARTITION (mgr=10,mgr=20,mgr=30);

需要注意一点：分区表的分区字段不能出现在表字段里面。如果通过load加载数据就需要把txt文档中的mgr这个列的数据删除，不然加载数据会错位，导致txt数据跟分区表数据不一致。

使用insert插入数据

Usage：INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

注意：select跟insert列必须一一对应

hive> insert overwrite table t1 select * from emp_1;

hive> insert overwrite table t2 partition(mgr=10) select empno,ename,job,hirdate,sal,comm,deptno from part_emp;

注意：t2也必须为分区表且select列跟分区表T2列数量一致。

多重插入：

hive> create table emp2 (j string,s double,c double);
OK
Time taken: 0.114 seconds
hive> from emp_1
> insert into emp1 select empno,ename
> insert into emp2 select job,sal,comm ;
Query ID = hadoop_20181026015757_f2abf70b-7249-44e5-895d-ad22eddfcd7a
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1540475767311_0010, Tracking URL = http://hd1:8088/proxy/application_1540475767311_0010/
Kill Command = /home/hadoop/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1540475767311_0010
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0
2018-10-26 02:02:47,297 Stage-2 map = 0%, reduce = 0%
2018-10-26 02:02:54,835 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.51 sec
MapReduce Total cumulative CPU time: 1 seconds 510 msec
Ended Job = job_1540475767311_0010
Stage-5 is selected by condition resolver.
Stage-4 is filtered out by condition resolver.
Stage-6 is filtered out by condition resolver.
Stage-11 is selected by condition resolver.
Stage-10 is filtered out by condition resolver.
Stage-12 is filtered out by condition resolver.
Moving data to: hdfs://hd1:9000/user/hive/warehouse/emp1/.hive-staging_hive_2018-10-26_02-02-34_121_1621965597337759090-1/-ext-10000
Moving data to: hdfs://hd1:9000/user/hive/warehouse/emp2/.hive-staging_hive_2018-10-26_02-02-34_121_1621965597337759090-1/-ext-10002
Loading data to table default.emp1
Loading data to table default.emp2
Table default.emp1 stats: [numFiles=1, numRows=0, totalSize=164, rawDataSize=0]
Table default.emp2 stats: [numFiles=1, numRows=0, totalSize=278, rawDataSize=0]
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Cumulative CPU: 1.51 sec HDFS Read: 4567 HDFS Write: 580 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 510 msec
OK
Time taken: 22.873 seconds

hive的基本操作

create database if not exists 库名; #创建库
alter database dbname set dbproperties('edited-by'='joe'); #修改库(不能删除或“重置”数据库属性)
describe database extended dbname; #查询库
drop database [if exists] dbname; #删除库
desc database extended 库; #显示库的扩展信息

hive>create external table dat0204(filmname string ,filmdate date ,filmscore string)
>comment '别名'
>row format delimited
>fields terminated by '\t'
>lines terminated by '\n'
>stored as textfile; #创建外部表，外部表相对来说更加安全些，数据组织也更加灵活，方便共享源数据。

hive>create table if not exists dat0204(id int , name string , age int)
>comment '别名'
>row format delimited
>fields terminated by '\t'
>lines terminated by '\n'
>stored as textfile; #创建内部表

desc 表; #表的描述
desc formatted 表; #查询表的结构
desc extended 表; #显示表的扩展信息
select * from 表; #查询表的信息
create table 库名1.表名1 like 库名2.表名2; #复制表(表结构+数据)

alter table hive1.test2 add partition(province='hebei',city='baoding') #添加分区

show partitions hive1; #查看表的分区

insert overwrite table test2 partition(provice='hebei',city='shijiazhuang') select id , name , age from test1; #增加数据

drop table 表; #删除空表
drop table 表 cascade; #删除非空表

show tables like '*name*'; #模糊搜索表

插入数据(加载到HDFS)

hive>load data local inpath 'path/filename' overwrite into table 表名; #从本地数据导入Hive表

hive>load data inpath 'path/filename' into table 表名; #HDFS上导入数据到Hive表

hive> insert overwrite directory "hodoop目录" select user, login_time from user_login; #将查询数据输出hdfs目录（hdfs目录不存在）

$ hive -e "sql语句" > /tmp/out.txt #保存sql语句查询信息到本地文件

hive命令模型

hive>dfs -lsr / //显示dfs下文件：路径/库/表/文件
hive>dfs -rmr /目录 //dfs命令，删除目录
hive>!clear ; //hive中执行shell命令
hive>!dfs -lsr / ; //hive中执行hdfs命令

元数据都储存在mysql

use hive用户库;

select * from VERSION; #查看hive版本
select * from TBLS \G; #查看有哪些表,易区分各表。
select * from SDS \G; #查看表对应的hdfs目录的metedata
select * from PARTITIONS where TBL_ID=1 \G; #查看某个表的partitions：
select * from COLUMNS_V2; #查看某个表的列：
select * from PARTITION_KEYS; #查看某个表的partition
select * from DBS; #查看数据仓库信息

调优
1.explain———解释执行计划
explain select sum(*) from test2 ;

hive的起源与应用

1、起源:由Facebook开源用于解决海量结构化日志的数据统计;
2、结构:Hive是基于hadoop的一个数据仓库工具,可以将结构化的数据文件映射成一张表，并提供类SQL查询功能;
(使用HQL作为查询接口；使用HDFS存储；使用MapReduce计算；)

本质是:将HQL转化成MapReduce程序。
3、应用:适合离线数据处理。
4、schema(模式，元信息存放到数据库中)
5、数据库和表都是路径。

6、hive在写操作是不校验，读时校验。

在hive里面创建了这么多表，前面有提到过 hive的元数据都是存储在mysql中，那么hive中的元数据在哪里呢？

use hive用户库;

调优
1.explain———解释执行计划
explain select sum(*) from test2 ;

mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+-----------+----------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME  | TBL_TYPE       | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+--------+-----------+-------+-----------+----------------+--------------------+--------------------+
|      1 |  1540322901 |     1 |                0 | hadoop |         0 |     1 | test      | MANAGED_TABLE  | NULL               | NULL               |
|      6 |  1540378940 |     6 |                0 | hadoop |         0 |     6 | t1        | MANAGED_TABLE  | NULL               | NULL               |
|     13 |  1540477772 |     1 |                0 | hadoop |         0 |    13 | emp_1     | MANAGED_TABLE  | NULL               | NULL               |
|     16 |  1540481399 |     1 |                0 | hadoop |         0 |    16 | emp_2     | MANAGED_TABLE  | NULL               | NULL               |
|     17 |  1540481750 |     1 |                0 | hadoop |         0 |    17 | emp_3     | MANAGED_TABLE  | NULL               | NULL               |
|     21 |  1540482372 |     1 |                0 | hadoop |         0 |    21 | exter_emp | EXTERNAL_TABLE | NULL               | NULL               |
|     26 |  1540484035 |     1 |                0 | hadoop |         0 |    26 | part_emp  | MANAGED_TABLE  | NULL               | NULL               |
+--------+-------------+-------+------------------+--------+-----------+-------+-----------+----------------+--------------------+--------------------+
7 rows in set (0.00 sec)

都在HIVE.TBLS表中。

desc 表; #表的描述
desc formatted 表; #查询表的结构
desc extended 表; #显示表的扩展信息
select * from 表; #查询表的信息

修改表字段：

hive> alter table test change id id string;
OK
Time taken: 0.388 seconds
hive> desc test;
OK
id string
Time taken: 0.179 seconds, Fetched: 1 row(s)

like 创建表并查看表结构：

hive> create table t1 like emp_1;
OK
Time taken: 0.146 seconds
hive> desc formatted t1l;
FAILED: SemanticException [Error 10001]: Table not found t1l
hive> desc formatted emp_1;
OK
# col_name              data_type               comment             
                 
empno                   int                                         
ename                   string                                      
job                     string                                      
mgr                     int                                         
hirdate                 string                                      
sal                     double                                      
comm                    double                                      
deptno                  int                                         
                 
# Detailed Table Information             
Database:               default                  
Owner:                  hadoop                   
CreateTime:             Thu Oct 25 22:29:32 CST 2018     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hd1:9000/user/hive/warehouse/emp_1        
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        numFiles                1                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               700                 
        transient_lastDdlTime   1540477792          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.214 seconds, Fetched: 39 row(s)

转载于:https://my.oschina.net/u/3862440/blog/2252546