一、export/import 导出表/导入表 --------------------------------------------------- $hive> EXPORT TABLE p1 TO '/user/ubuntu/data/ex'; //导出到hdfs上,而且目录必须为空 $hive> IMPORT FROM '/user/ubuntu/data/ex'; //将数据导入到数据库中,以导出时候的表名。所以,如果表存在的话会报错 二、order/sort/distribute/cluster 全排序/map端排序/分区/聚集 --------------------------------------------------------------- 1.ORDER BY (ASC|DESC) -- 全局的排序,只使用一个reducer $hive> SELECT name FROM customer ORDER BY NAME DESC; 2.SORT BY (ASC|DESC) -- 对某一列进行排序,对map端的输出进行排序,在数据进入reducer之前进行的排序,本地排序。当reduce = 1 的时候,== order by $hive> SELECT name FROM customer SORT BY NAME DESC; 3.DISTRIBUTE BY : -- 将列值相匹配的数据,分配到同一个reduce中。必须存在于Sort By之前。单独使用不会保证排序,查询的字段必须出现在select中 $hive> SELECT cid FROM orders DISTRIBUTE BY id; //报错。因为id没有出现在select的查询列中 4.DISTRIBUTE BY + SORT BY $hive> SELECT name, id FROM customer DISTRIBUTE BY id SORT BY name; 5.CLUSTER BY : = DISTRIBUTE BY + SORT BY : 对同一列进行DISTRIBUTE + SORT $hive> SELECT id FROM customer DISTRIBUTE BY id SORT BY id; //都是对id,就可以直接使用CLUSTER By $hive> SELECT id FROM customer CLUSTER BY id; //都是对id,就可以直接使用CLUSTER By 三、设置作业参数 ----------------------------------------------------------------- $hive> set hive.exec.reducers.bytes.per.reducer=xxx //设置reducetask的字节数。 $hive> set hive.exec.reducers.max=0 //设置reduce task的最大任务数 $hive> set mapreduce.job.reduces=0 //设置reducetask个数。 四、函数:hive支持tab键快速补全 -------------------------------------------------------------------- $hive> select concat('100','tom'); //字符串连接 ... 五、聚合查询 --------------------------------------------------------------------- //查询表orders, 用cid分组并过滤掉记录数小于等于1的组,显示cid count max_paice $hive> select cid, count(*) c ,max(price) from orders group by cid having c>1; 六、hive上实现单词统计WordCount ----------------------------------------------------------------- 1.创建一张表doc $hive> CREATE TABLE doc(line string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; 2.加载文档到表中 $hive> LOAD DATA LOCAL INPATH '/mnt/hgfs/share/wc' OVERWRITE INTO TABLE doc; +------------------+--+ | doc.line | +------------------+--+ | hello tom ad | | ads hello tom | | hello ads | | tom ads ads ads | +------------------+--+ 3.使用切割函数 --> 按照空格切割 -- 切割成数组 [" " , " ", " "] $hive> select split(line,' ') from doc; +----------------------------+--+ | c0 | +----------------------------+--+ | ["hello","tom","ad"] | | ["ads","hello","tom"] | | ["hello","ads"] | | ["tom","ads","ads","ads"] | +----------------------------+--+ 4.使用炸裂函数 --> 将数组元素炸开成单条记录 $hive> select explode (split(line,' ')) from doc; +--------+--+ | col | +--------+--+ | hello | | tom | | ad | | ads | | hello | | tom | | hello | | ads | | tom | | ads | | ads | | ads | +--------+--+ 5.实现单词统计 $hive> select explode (split(line,' ')) as word from doc; $hive> select t.word from ((select explode (split(line,' ')) as word from doc) as t); $hive> select t.word,count(*) from ((select explode(split(line,' ')) as word from doc) as t) group by t.word; $hive> select t.word,count(*) from ((select explode(split(line,' ')) as word from doc where word = 'tom') as t) group by t.word; $hive> select t.word,count(*) c from ((select explode(split(line, ' ')) as word from doc) as t) group by t.word order by c desc limit 2; $hive> select t.word,count(*) c from ((select explode (split(line,' ')) as word from doc) as t) where t.word != 'tom' group by t.word order by c desc limit 2; +---------+----+--+ | t.word | c | +---------+----+--+ | ads | 5 | | hello | 3 | +---------+----+--+ 6.实现统计结果的输出 $hive> create table wcresult as select t.word,count(*) c from ((select explode (split(line,' ')) as word from doc) as t) where t.word != 'tom' group by t.word order by c desc limit 2; +----------------+-------------+--+ | wcresult.word | wcresult.c | +----------------+-------------+--+ | ads | 5 | | hello | 3 | +----------------+-------------+--+ 七、Transactions 事务 --------------------------------------------------------------------- 1.条件 a.hive版本> 0.13 b.使用的是桶表 c.只支持orc格式 d.所有的事务必须是自动提交 e.参数必须设置好 f.创建桶表的时候,要添加tblproperties('transactional'='true'); 2.设置hive参数,修改配置文件[hive-site.xml] SET hive.support.concurrency = true; // SET hive.enforce.bucketing = true; //强制桶表 SET hive.exec.dynamic.partition.mode = nonstrict; //动态分区 SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; //事务管理器 SET hive.compactor.initiator.on = true; SET hive.compactor.worker.threads = 1; 3.创建表 create table tx(id int,name string,age int) clustered by (id) into 3 buckets row format delimited fields terminated by ',' stored as orc tblproperties ('transactional'='true'); 4.使用事务,增删改查 $hive> insert into tx (id,name,age) values(1,'tom',11); $hive> updata table tx name = 'tom111'; $hive> delete from tx where id = 1; $hive> select * from tx; 八、View 视图(虚表),只存在于逻辑上 --------------------------------------------------------------------- 0.当查询很频繁,而且查询语句很长的时候,就可以使用视图,将一行查询语句的结果存储成一个虚构的表,以后再想查看同样的查询结果,就可以直解查询视图 1.创建视图 a.语法 CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ] [COMMENT table_comment] AS SELECT ... b.命令 将左外连接查询的结果存储成一个虚表,以后可以直接select * from v1 达到查询左外的目的 $hive> create view v1 as select a.id aid,a.name,b.id bid, b.orderno from customer a left outer join orders b on a.id = b.cid; $hive> select * from v1; 2.删除视图 DROP VIEW [IF EXISTS] view_name 3.修改视图 ALTER VIEW view_name 九、特殊连接 MapJoin ------------------------------------------------------------- 1.连接只发生在map端,没有reduce.将小表加载到内存,与大表进行连接查询 2.$hive> hive.auto.convert.join = true //自动开始mapjoin,默认是开启的 3. 连接暗示 /*+mapjoin(customer)*/ ,实现mapjoin -- customer为小表,要加载到内存中 $hve> select /*+mapjoin(customer)*/ a.id aid,a.name,b.id bid, b.orderno from customer a left outer join orders b on a.id = b.cid; 十、hive调优 ------------------------------------------------------------------- 1.explain:使用explain查看查询计划 -- 详细的流程用于分析语句 a.$hive> explain select count(*) from customer; +------------------------------------------------------------------------------------------------------+--+ | Explain | +------------------------------------------------------------------------------------------------------+--+ | STAGE DEPENDENCIES: | | Stage-1 is a root stage | | Stage-0 depends on stages: Stage-1 | | | | STAGE PLANS: | | Stage: Stage-1 | | Map Reduce | | Map Operator Tree: | | TableScan | | alias: customer | | Statistics: Num rows: 1 Data size: 88 Basic stats: COMPLETE Column stats: COMPLETE | | Select Operator | | Statistics: Num rows: 1 Data size: 88 Basic stats: COMPLETE Column stats: COMPLETE | | Group By Operator | | aggregations: count() | | mode: hash | | outputColumnNames: _col0 | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE | | Reduce Output Operator | | sort order: | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE | | value expressions: _col0 (type: bigint) | | Reduce Operator Tree: | | Group By Operator | | aggregations: count(VALUE._col0) | | mode: mergepartial | | outputColumnNames: _col0 | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE | | File Output Operator | | compressed: false | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE | | table: | | input format: org.apache.hadoop.mapred.SequenceFileInputFormat | | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat | | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | Processor Tree: | | ListSink | | | +------------------------------------------------------------------------------------------------------+--+ b.explain extened 查看更加详尽的信息 2.Limit Tuning:limit优化 a.避免limit的时候先进行全表范围的查询然后再输出limit之后的信息,可以使用 Limit Tuning,一种采用采样方式进行优化的limit手段 b.通过配置属性开启 $hive> hive.limit.optimize.enable = true <property> <name>hive.limit.optimize.enable</name> <value>true</value> <description>Whether to enable to optimization to try a smaller subset of data for simple LIMIT first. </description> </property> c.采样处理 <property> <name>hive.limit.row.max.size</name> <value>100000</value> <description>When trying a smaller subset of data for simple LIMIT, how much size we need to guarantee each row to have at least. </description> </property> <property> <name>hive.limit.optimize.limit.file</name> <value>10</value> <description>When trying a smaller subset of data for simple LIMIT, maximum number of files we can sample.</description> </property> 3.启用local模式(开发,调试,测试时使用) a.针对小作业量的任务,仅在本机上开启MR作业运行。 b.$hive> set mapred.job.tracker=local; c.也可以让hive自动执行此策略,亲测好用 $hive> set hive.exec.mode.local.auto = true; [hive-site.xml] <property> <name>hive.exec.mode.local.auto</name> <value>true</value> <description> Let hive determine whether to run in local mode automatically </description> </property> 4.并行执行 a.hive默认执行MR任务是串行的,先...然后...,然而一些工作之间是没有依赖的,这样的工作就可以并发的执行了 b.开启并行:同时执行没有依赖关系的阶段 $hive> set hive.exec.parallel = true; 5.严格模式:主要为了避免大数据大量的查询而设置的 a.严格模式是Hive中的一种设置,可以防止用户发出可能产生意外和不良影响的查询。 b.$hive> set hive.mapred.mode = strict; c.注意,在严格模式下 1) 分区表必须指定分区,否则不让查询(disable : select * from t1) (enable: select * from t1 where id = 3 ,其中id为分区值) 2) order by 的时候必须使用limit子句进行限定,否则不然查询(disable : select * from t1 odrer by id) (enable: select * from t1 odrer by id limit 2) 3) 笛卡尔积查询的时候,也是不允许的。(disable : select a.* , b.* from customer a , orders b ) (enable : select a.* , b.* from customer a , orders b where a.id = b.cid) 6.设置MR的数量 a.设置单个reduce处理的字节数,默认是1G $hive> set hive.exec.reducers.bytes.per.reducer = 750000000; 7.JVM重用 a.使jvm可以重复使用,但是缺点是对于一个不平衡的集群作业,很容易导致资源的闲置(为了保证重用数量,而不能回收,直到作业完成) b.优选场景:大量小文件,很多小mr作业task c.Hadoop’s mapred-site.xml (in $HADOOP_HOME/conf): <property> <name>mapreduce.job.jvm.numtasks</name> <value>10</value> <description> How many tasks to run per jvm. If set to -1, there is no limit.JVM Reuse </description> </property> d.$hive> set mapreduce.job.jvm.numtasks = 10; //默认是1,-1表示没有限制
大数据之hive(三) --- 导入导出,排序聚集查询,hive上实现WordCount,Transactions,View,MapJoin,hive调优
最新推荐文章于 2024-05-09 00:15:00 发布