hive 基础笔记

最新推荐文章于 2024-08-08 00:52:05 发布
盒马coding
最新推荐文章于 2024-08-08 00:52:05 发布
阅读量1k
点赞数
分类专栏： hive
本文链接：https://blog.csdn.net/xfg0218/article/details/78014144
版权
hive 专栏收录该内容
12 篇文章 0 订阅
订阅专栏
 
  字符集乱码 
  ( 
  将 
  LC_ALL=c  
  修改 
  ): 
 
  locale 
 
  unset LC_ALL 
 
  参数内部调用用 
  hiveconf 
  ，外部调用用 
  hivevar,hivevar 
  内部调用会报错； 
 
  !! 
  必须有几个分区就对几个分区 
  ( 
  包括源文件分区 
  ) 
  进行范围限定，否则系统自己给补足从而导致在 
  hadoop 
  上保存为另外一份文件，由于文件不相同所以导致 
  overwrite=into 
 
   hive默认分割符为x01，替换如下 
 
   sed -e 's/\ 
  x01/\t/g' 000000_0 >000000_1 
 
  hive json 
 
   https://stackoverflow.com/questions/14705858/using-json-serde-in-hive-tables 
 
  hive location 
 
  1.上传数据到hdfs某一目录下如 
 
  aa.txt 
 
  张三 
 
  李四 
 
  hadoop fs -put aa.txt /embrace/source/data 
 
  2.创建外部表指定/embrace/source/data位置 
 
  create external table test_location (name string) row format delimited fields terminated by ' 
  \t 
  ' location  
  ' 
  /embrace/source/data 
  ' 
  ; 
 
  备注：千万不要具体到文件，否则报错，只能具体到目录。 
 
  如果有分区则需要 
 
   alter table yt50 add partition(statist_day=20170709) location '/apps/hive/warehouse/cars.db/yt50/statist_day=20170709/'; 
 
   否则仍然不显示。 
 
  http://blog.csdn.net/uckyk/article/details/50543483 
 
  进入 
  hive shell 
 
  #hive 
  或者 
  hive --service cli 
 
  Hive  
  的启动方式 
  : 
 
  hive  
  命令行模式，直接输入 
  /hive/bin/hive 
  的执行程序，或者输入  
  hive –service cli 
 
  hive web 
  界面的启动方式， 
  hive –service hwi  
 
  hive  
  远程服务  
  ( 
  端口号 
  10000)  
  启动方式， 
  hive --service hiveserver 
 
  hive  
  远程后台启动 
  ( 
  关闭终端 
  hive 
  服务不退出 
  ): nohup hive -–service hiveserver & 
 
  显示所有函数： 
 
  hive> show functions; 
 
  查看函数用法： 
 
  hive> describe function substr; 
 
  查看 
  hive 
  为某个查询使用多少个 
  MapReduce 
  作业 
 
  hive> Explain select a.id from tbname a; 
 
  -------------------------------------------------------------------------- 
 
  表结构操作： 
 
  托管表和外部表 
 
  托管表会将数据移入 
  Hive 
  的 
  warehouse 
  目录；外部表则不会。经验法则是，如果所有处理都由 
  Hive 
  完成， 
 
  应该使用托管表；但如果要用 
  Hive 
  和其它工具来处理同一个数据集，则使用外部表。 
 
  创建表 
  ( 
  通常 
  stored as textfile) 
  ： 
 
  hive> create table tbName (id int,name string) stored as textfile; 
 
  创建表并且按分割符分割行中的字段值 
  ( 
  即导入数据的时候被导入数据是以该分割符划分的，否则导入后为 
  null 
  ，缺省列为 
  null) 
  ； 
 
  hive> create table tbName (id int,name string) row format delimited fields terminated by ' 
  \t 
  '; 
 
  创建外部表 
  : 
 
  hive>create external table extbName(id int, name string); 
 
  创建表并创建单分区字段 
  ds( 
  分区表指的是在创建表时指定的 
  partition 
  的分区空间。 
  ): 
 
  清空表： 
 
  truncate table aa; 
 
  hive> create table tbName2 (id int, name string) partitioned by (ds string) 
    
  row format delimited fields terminated by ' 
  \t 
  ' stored as textfile;  
 
  创建表并创建双分区字段 
  ds: 
 
  hive> create table tbname3 (id int, content string) partitioned by (day string, hour string); 
 
  表添加一列 
  : 
 
  hive> alter table tbName add columns (new_col int); 
 
  表删除或者替换一列： 
 
  如 
  a 
  表有 
  sno,sname, 
  字段，可以使用 
 
  Create table a replace columns(sno int); 
 
  这样就会改变字段且列下数据按顺序删除并不支持定位删除，如 
 
  1 2 3 4  
  想要删除 
  1 
  ， 
  3, 
  实际保留的是 
  12,34 
  字段被删除 
 
  修改一列： 
 
  Alter table a change column id idd int comment 
  ’ 
  hehehe 
  ’ 
    
  AFTER severity;( 
  意思是 
  idd 
  放在字段 
  severity 
  后 
  ) 
 
  添加一列并增加列字段注释 
  : 
 
  hive> alter table tbName add columns (new_col2 int comment 'a comment'); 
 
  改列名和位置： 
 
  alter table student change sum sun string after id; 
 
  更改表名 
  : 
 
  hive> alter table tbName rename to tbName3; 
 
  索引创建： 
 
  hive> create index your_index on table your_table(your_column)  
 
  > as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'  
 
  > with deferred rebuild  
 
  > IN TABLE your_index_table; 
 
  显示索引： 
 
  Show formatted index on employees; 
 
  删除索引： 
 
  Drop index if exists employees_index on table employees; 
 
  删除表 
  ( 
  删除表的元数据，如果是托管表还会删除表的数据 
  ): 
 
  hive>drop table tbName; 
 
  只删除内容 
  ( 
  只删除表的内容，而保留元数据，则删除数据文件 
  ) 
  ： 
 
  hive>dfs –rmr ‘warehouse/my-table’; 
 
  删除分区，分区的元数据和数据将被一并删除： 
 
  hive>alter table tbname2 drop partition (dt='2008-08-08', hour='09' 
  /hour>=09 
  ); 
 
  复制数据结构： 
 
  Create table a like aaa; 
 
  -------------------------------------------------------------------------- 
 
  元数据存储 
  ( 
  从 
  HDFS 
  中将数据导入到表中都是瞬时的 
  ): 
 
  将文件中的数据加载到表中 
  ( 
  文件要有后缀名，缺省列默认为 
  null): 
 
  hive> load data local inpath 'myTest.txt' overwrite into table tbName; 
 
  在已创立的表上添加单分区并指定数据： 
 
  hive> alter table tbname2 add partition (ds='20120701') location '/user/hadoop/his_trans/record/20120701'; 
 
  在已创立的表上添加双分区并指定数据： 
 
  hive> alter table tbname2 add partition (ds='2008-08-08', hour='08') location '/path/pv1.txt'; 
 
  加载本地数据，根据给定分区列信息 
  : 
 
  hive> alter table tbname2 add partition (ds='2013-12-12'); 
 
  hdfs 
  数据加载进分区表中语法 
  ( 
  当数据被加载至表中时，不会对数据进行任何转换。 
  Load 
  操作只是将数据复制至 
  Hive 
  表对应的位置 
  )[ 
  不建议使用 
  ] 
  ： 
 
  hive 
  > load data local inpath 'part.txt' overwrite into table tbName2 partition(ds='2013-12-12'); 
 
  hive> load data inpath '/user/hadoop/*' into table tbname3 partition(dt='2008-08-08', hour='08');  
 
  -------------------------------------------------------------------------- 
 
  SQL  
  操作： 
 
  复制分区表及数据： 
 
  Create table new_table like old_table;( 
  复制表结构 
  ) 
 
  用 
  hadoop fs -cp 
  命令把 
  old_table 
  对应的 
  HDFS 
  目录的文件夹全部拷贝到 
  new_table 
  对应的 
  HDFS 
  目录下 
  ; 
 
  使用 
  msck repair table new_table 
  命令修复 
  new_table 
  的分区元数据 
  ; 
 
  查看表结构： 
 
  hive> describe tbname; 
 
  hive> desc tbname; 
 
  显示所有表 
  : 
 
  hive> show tables; 
 
  按正条件（正则表达式）显示表： 
 
  hive> show tables '.*s'; 
 
  查询表数据不会做 
  mapreduce 
  操作： 
 
  hive> select * from tbName; 
 
  查询一列数据，会做 
  mapreduce 
  操作： 
 
  hive> select a.id from tbname a ; 
 
  基于分区的查询的语句： 
 
  hive> select tbname2.* from tbname2 a where a.ds='2013-12-12' ; 
 
  查看分区语句： 
 
  hive> show partitions tbname2; 
 
  函数 
  avg/sum/count/group by/order by (desc)/limit: 
 
  select logdate, count(logdate) as count from access_1 group by logdate order by count limit 5; 
 
  内连接 
  (inner join) 
  ： 
 
  hive> SELECT sales.*, things.* FROM sales JOIN things ON (sales.id = things.id); 
 
  外连接： 
 
  hive> SELECT sales.*, things.* FROM sales LEFT OUTER JOIN things ON (sales.id = things.id); 
  ( 
  左表全部显示，右表只显示与左表匹配部分 
  ) 
 
  hive> SELECT sales.*, things.* FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id); 
  ( 
  右表全部显示，左表只显示与左表匹配部分 
  ) 
 
  hive> SELECT sales.*, things.* FROM sales FULL OUTER JOIN things ON (sales.id = things.id); 
  （全部显示） 
  ; 
 
  in 
  查询： 
  Hive 
  不支持，但可以使用 
  LEFT SEMI JOIN 
 
  hive> SELECT * FROM things LEFT SEMI JOIN sales ON (sales.id = things.id); 
 
  相当于 
  sql 
  语句： 
  SELECT * FROM things WHERE things.id IN (SELECT id from sales); 
 
  Map 
  连接： 
  Hive 
  可以把较小的表放入每个 
  Mapper 
  的内存来执行连接操作 
 
  hive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* FROM sales JOIN things ON (sales.id = things.id); 
 
  如果两张表相同，将其中一张表的数据插入另一张（两张表都有分区） 
 
  首先set hive.exec.dynamic.partition.mode=nonstrict; 
 
  使用 insert overwrite table yt partition(statist_day) select * from yt50 where statist_day=20170708; 
 
  如果使用 insert overwrite table yt partition(statist_day=20170708) select * from yt50 where statist_day=20170708;会报错查询列和已有列不同。 
 
  INSERT OVERWRITE TABLE ..SELECT 
  ：新表预先存在 
 
  hive> FROM records2 
 
  > INSERT OVERWRITE TABLE stations_by_year SELECT year, COUNT(DISTINCT station) GROUP BY year  
 
  > INSERT OVERWRITE TABLE records_by_year SELECT year, COUNT(1) GROUP BY year 
 
  > INSERT OVERWRITE TABLE good_records_by_year SELECT year, COUNT(1) WHERE temperature != 9999 AND  
 
  (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year;  
 
  CREATE TABLE ... AS SELECT 
  ：新表表预先不存在 
 
  hive>CREATE TABLE target AS SELECT col1,col2 FROM source; 
 
  创建视图： 
 
  hive> CREATE VIEW valid_records AS SELECT * FROM records2 WHERE temperature !=9999; 
 
  查看视图详细信息： 
 
  hive> DESCRIBE EXTENDED valid_records; 
 
  -------------------------------------------------------------------------- 
 
  将查询数据输出至目录 
 
  hive> insert overwrite directory '/tmp/hdfs_out' select a.* from tbname2 a where a.ds='2013-12-12'; 
 
  将查询结果输出至本地目录 
 
  hive> insert overwrite local directory '/tmp/local_out' select ds,count(1) from tbname group by ds; 
 
  hive> insert overwrite table events select a.* from tbname a where a.id < 100; 
 
  hive> insert overwrite local directory '/tmp/sum' select sum(a.pc) from tbpc a ; 
 
  将一个表的统计结果插入另一个表中 
 
  hive> from tbname a insert overwrite table events select a.bar,count(1) where a.foo > 0 group by a.bar; 
 
  hive> insert overwrite table events select a.bar,count(1) from tbname a where a.foo > 0 group by a.bar; 
 
  JOIN: 
 
  hive> from tbname t1 join tbname2 t2 on (t1.id = t2.id) insert overwrite table events select t1.id,t1.name,t2,ds; 
 
  将多表数据插入到同一表中 
 
  FROM src 
 
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 
 
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 
 
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 
 
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300; 
 
  将文件流直接插入文件 
 
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09'; 
 
  This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce  
 
  side (please see the Hive Tutorial or examples)  
 
  -------------------------------------------------------------------------- 
 
  ###  
  错误信息  
  ###  
 
  问题： 
  load 
  数据全部为 
  null  
 
  原因：数据分隔符的问题，反序列化数据的时候出错了，定义表的时候需要定义数据分隔符。 
 
  解决： 
  row format delimited fields terminated by ',' stored as textfile; 
 
  create table mytable(key int , value string ) row format delimited fields terminated by ',' escaped by '\\' stored as textfile; 
 
  [row format delimited] 
  是用来设置创建的表在加载数据的时候，支持的列分隔符，如以 
  ',' 
  为分隔符； 
  row format delimited fields terminated by ','; 
 
  [terminated by] 
  分隔符：意思是以什么字符作为分隔符，默认情况下是 
  tab 
  字符（ 
  \t 
  ） 
    
  [enclosed by] 
  字段括起字符 
 
  [escaped by] 
  转义字符 
 
  使用 
  "\" 
  符号转义或者写作 
  :ALTER TABLE splitchar SET SERDEPROPERTIES ('escape.delim' = '\\'); 
 
  [stored as file_format]: 
  是用来设置加载数据的数据类型。 
  Hive 
  本身支持的文件格式只有： 
  Text File 
  ， 
  Sequence File 
  。 
 
  如果文件数据是纯文本，可以使用 
   [stored as textfile] 
  。 
 
  如果数据需要压缩，使用 
   [stored as sequence]  
  通常情况，只要不需要保存序列化的对象，我们默认采用 
  [STORED AS TEXTFILE] 
  。 
 
  将 
  CSV 
  中数据导入表中： 
 
  add jar /home/hadoop/csv-serde-1.1.2.jar;// 
  引用了这个 
  jar 
  包，关于这个表的所有操作都要引入这个 
  jar 
  。 
 
  row format serde 'com.bizo.hive.serde.csv.CSVSerde' 
 
  eg 
  ： 
  create external table trans_data 
 
  ( 
 
  id int, 
 
  name string 
 
  ) 
 
  partitioned by (pdate string)  
 
  row format serde 'com.bizo.hive.serde.csv.CSVSerde' stored as textfile; 
 
  alter table trans_data add partition (pdate='20120701') location '/user/hadoop/his_trans/record/20120701'; 
 
  -------------------------------------------------------------------------- 
 
  ###  
  错误信息  
  ###  
 
  问题： 
  java.lang.OutOfMemoryError: Java heap space 
 
  解决：检查 
  hiveserver 
  服务是否开启 
 
  -------------------------------------------------------------------------- 
 
  ###  
  错误信息  
  ### 
 
  java.lang.NoSuchMethodError: com.facebook.fb303.FacebookService 
 
  由于 
  hadoop 
  与 
  hive 
  版本不兼容导致 
  (hadoop-0.20.2+320) 
 
  解决方法： 
  mv $HADOOP_HOME/lib/libfb303.jar $HADOOP_HOME/lib/libfb303.jar_backup && ln -s $HIVE_HOME/lib/libfb303.jar $HADOOP_HOME/lib/libfb303.jar 
 
  心得部分： 
 
  0.count 
  （ 
  * 
  ）和 
  count 
  （列）比较 
 
  谁更快，不好说，列建索引会很快否则不会有什么区别 
 
  1.create table logs(ts int,line string) 
 
  ROW FORMAT DELIMITED 
 
  FIELDS TERMINATED BY '\t' 
 
  LINES TERMINATED BY '\n' 
 
  partitioned by (dt String,country String); 
 
  2.load data local inpath '/jboss/ttest/aa.txt' into table logs partition (dt='20010101',country='GB'); 
 
  3.show partitions logs; 
 
  4.alter table aa add columns(im int comment 'aaa'); 
 
  5.show create table aa 
  ；查看 
  aa 
  在 
  hdfs 
  上的存储位置 
 
  —————————————————————————————————————————————————————————— 
 
  insert overwrite table aa select bb.id,dd.age,bb.name from beer.bb left join deer.dd on(bb.id=dd.id); 
 
  insert into (table) aa select bb.id,dd.age,bb.name from beer.bb left join deer.dd on(bb.id=dd.id); 
 
  区别： 
  into 
  后可以加 
  table 
  或者不加都支持， 
  overwrite 
  必须有 
  table 
 
  —————————————————————————————————————————————————————————— 
 
  6.hive 
  存储格式有三种 
  :TEXTFILE 
  、 
  SEQUENCEFILE 
  、 
  RFCFILE 
 
  （ 
  1 
  ）、 
  TEXTFILE 
  能耗较大，不支持压缩 
 
  （ 
  2 
  ）、 
  SEQUENCEFILE 
  ， 
  hadoopAPI 
  提供的一种二进制文件支持，具有使用方便、可分割、可压缩的特点 
 
  （ 
  3 
  ）、 
  RFCFILE 
  ，一种行列相结合的存储方式。 
 
  相比于前两者， 
  RFCFILE 
  由于是列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看， 
  RFCFILE 
  相比其余两种格式具有明显的优势。 
 
  7.hive 
  内部表和外部表 
 
  如果数据仅仅只有 
  hive 
  使用，可以使用内部表也就是托管表或者管理表，如果数据需要多个数据库使用，建议使用外部表。 
 
  8. 
  查询 
  hive 
  下不同数据库下的表的联合 
 
  select * from user.student a join default.teacher b on (a.id=b.id); 
 
  9. 
  表的详细信息查看 
 
  desc extended aaa; 
 
  HIVE 
  问题： 
 
  一、截取 
  url 
  中的 
  host 
  值 
 
  select parse_url(a.url, 'HOST') from social_time_2016 a limit 10; 
 
  附 
  ： 
  URL 
  解析函数： 
  parse_url 
 
  语法 
  : parse_url(string urlString, string partToExtract [, stringkeyToExtract]) 
 
  返回值 
  : string 
 
  说明：返回 
  URL 
  中指定的部分。 
  partToExtract 
  的有效值为： 
  HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. 
 
  二、 
  修改分区表的分区名称 
    
  alter table partition_biao1 partition(date='2016-12-06',province='beijing') rename to partition( date='20161206',province='beijing'); 
 
  三 
    
  、删除表分区 
 
  alter table d_moc.mocdb_gps_date_all drop partition(dt='20161207') 
 
  四 
    
  、清空表数据 
 
  Truncate table xxxx 
 
  五、  
  添加列 
 
  alter 
    
  table 
   test  
  add 
   columns(age  
  int 
  );  
 
  六、 
  collect_set 
  （）函数的使用 
 
  COLLECT_SET 
  ，对于多列的 
  group by 
  操作时， 
 
  如果你想得到这样的结果： 
 
  appid app_name app_url 
 
  1  
  应用汇  
  www.test1.com 
 
  1  
  阿拉工具  
  www.test2.com 
 
  2  
  小星星  
  www.test3.com 
 
  3  
  小生  
  www.test4.com 
 
  3  
  小明  
  www.test5.com 
 
  希望得到这样的结果： 
 
  appid app_name app_url 
 
  1  
  应用汇  
  www.test1.com 
 
  2  
  小星星  
  www.test3.com 
 
  3  
  小生  
  www.test4.com 
 
  由于不能使用 
   multi-distinct 
  ， 故可以使用如下方式得到： 
 
  hive 
  > 
 
  select 
   appid 
  , 
    
  collect_set 
  ( 
  app_name 
  )[ 
  0 
  ], 
 
  collect_set 
  ( 
  app_url 
  )[ 
  0 
  ] 
 
  from 
    
  your_table 
 
  group  
  by 
    
  appid 
  ; 
 
  ------------------------------------------------------------ 
 
  另一种做法：可以考虑使用 
  min 
  , 
   max 
 
  select 
   appid 
  , 
 
  max 
  ( 
  app_name 
  ), 
 
  max 
  ( 
  app_url 
  ) 
 
  from 
 
  your_table 
 
  group  
  by 
    
  appid 
  ; 
 
  详解： 
 
        array 
      
        collect_set(col) 
      
        Returns a set of objects with duplicate elements eliminated 
      
  collect_set: 返回去重的元素数组。 
 
  七． 
  左链接实现 
  not in 
  （由于 
  not in  
  不支持子查询 所以不能用 
  not in 
  ）注意：数据量特别大的时候不合适 
 
  select distinct a.lps_did,a.os_version as osversion, 
 
  a.device_model as model,a.manufacturer from (select distinct lps_did,os_version,device_model,manufacturer 
 
  from d_moc.rps__h_date_partition_log_4jd37oe7g8x9 where p_event_date='${dt}') a 
 
  left outer join(select distinct(lps_did) from d_moc.rps__h_date_partition_log_4jd37oe7g8x9 where p_event_date<'${dt}') b 
 
  on a.lps_did=b.lps_did where b.lps_did is null; 
 
  八、hive 
  实现增量更新数据 
 
  1.  
  通过创建主表的临时表 然后左链接 
 
  2. 
 
  九、 
  非空函数： 
 
  2.   
  非空查找函数 
  : COALESCE 
 
  语法 
  : COALESCE(T v1, T v2, …) 
 
  返回值 
  : T 
 
  说明 
  :  
  返回参数中的第一个非空值；如果所有值都为 
  NULL 
  ，那么返回 
  NULL 
 
  举例： 
 
  hive> select COALESCE(null,'100','50′) from lxw_dual; 
 
  十、 
  取消 
  hive 
  表中为空的方法 
 
  第一 
   is not null 
 
  第二 
   length 
  （字段） 
  >0 
 
  select * from 
 
  (select distinct case when param1_key = 'company_id' then param1_value when param1_key = 'p1_companyId' 
 
  then param1_value end as companyid, 
 
  lps_did as did,os_version as osversion,device_model as model,manufacturer as manufacturer 
 
  from d_moc.rps__h_date_partition_log_4jd37oe7g8x9 WHERE p_event_date='2016-12-12' and param1_value is not null) t 
 
  where length(t.companyid)>0 
 
  我这个语句中碰到的问题只能用第二种 
 
  十一、 
  n 
  个字段的一个表 按天分区 第 
  n 
  个字段是次数 每天和之前的所有数据对比 当前面所有字段完全相等时 将第 
  n 
  个字段和之前数据的第 
  n 
  个字段累加 请问这个怎么做？ 
  hive 
  中（通过 
  join 
  ） 
 
  select a.companyid,a.did,a.province,a.city, 
 
  case when b.num is null then a.num 
 
  when b.num is not null then a.num+b.num end as num 
 
  from 
 
  (select companyid,did,province,city,count(1) as num from d_moc.gps_route_data_all where dt='2016-12-27' 
 
  group by companyid,did,province,city)a  
 
  left join 
 
  (select companyid,did,province,city,count(1) as num from d_moc.gps_route_data_all where dt<'2016-12-27' 
 
  group by companyid,did,province,city)b on a.companyid=b.companyid and a.did=b.did and a.province=b.province and 
 
  a.city=b.city  
  where b.companyid is null or b.companyid is not null 
   order by a.companyid 
 
  十一、hive 
  行列转换： 
 
  一、行转列的使用 
 
  １、问题 
 
  hive如何将 
 
  a b 1 
 
  a b 2 
 
  a b 3 
 
  c d 4 
 
  c d 5 
 
  c d 6 
 
  变为： 
 
  a b 1,2,3 
 
  c d 4,5,6 
 
  ２、数据 
 
  test.txt 
 
  a b 1  
 
  a b 2  
 
  a b 3  
 
  c d 4  
 
  c d 5  
 
  c d 6 
 
  ３、答案 
 
  1.建表 
 
  drop table tmp_jiangzl_test; 
 
  create table tmp_jiangzl_test 
 
  ( 
 
  col1 string, 
 
  col2 string, 
 
  col3 string 
 
  ) 
 
  row format delimited fields terminated by '\t' 
 
  stored as textfile; 
 
  load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test; 
 
  2.处理 
 
  select col1,col2,concat_ws(',',collect_set(col3))  
 
  from tmp_jiangzl_test  
 
  group by col1,col2; 
 
  二、列转行 
 
  １、问题 
 
  hive如何将 
 
  a b 1,2,3 
 
  c d 4,5,6 
 
  变为： 
 
  a b 1 
 
  a b 2 
 
  a b 3 
 
  c d 4 
 
  c d 5 
 
  c d 6 
 
  2、答案 
 
  1.建表 
 
  drop table tmp_jiangzl_test; 
 
  create table tmp_jiangzl_test 
 
  ( 
 
  col1 string, 
 
  col2 string, 
 
  col3 string 
 
  ) 
 
  row format delimited fields terminated by '\t' 
 
  stored as textfile; 
 
  处理： 
 
  select col1, col2, col5 
 
  from tmp_jiangzl_test a  
 
  lateral view  
  explode(split(col3,',')) b AS col5 
 
  Teacher 
  数据多 
 
  Student 
  数据少 
 
  想要查出 
  select * from teacher t left join student s where s.idd != t.idd 
  ，结果出发笛卡尔积失败，所以使用下面办法 
    
  select * from teacher t left join student s on s.idd = t.idd where s.idd is null; 
 
  Hive开发中使用变量的两种方法 
 
  2013/09/13 by  
  Crazyant 
    
  暂无评论 
    
  在使用 
  hive 
  开发数据分析代码时，经常会遇到需要改变运行参数的情况，比如 
  select 
  语句中对日期字段值的设定，可能不同时间想要看不同日期的数据，这就需要能动态改变日期的值。如果开发量较大、参数多的话，使用变量来替代原来的字面值非常有必要，本文总结了几种可以向 
  hive 
  的 
  SQL 
  中传入参数的方法，以满足类似的需要。 
 
  准备测试表和测试数据 
 
  第一步先准备测试表和测试数据用于后续测试： 
 
  hive> create database test; 
 
  OK 
 
  Time taken: 2.606 seconds 
 
        1 2 3 
      
        hive> create database test; OK Time taken: 2.606 seconds 
      
  然后执行建表和导入数据的 
  sql 
  文件： 
 
        1 2 3 4 5 6 7 8 9 10 11 
      
        [czt@www.crazyant.net testHivePara]$ hive -f student.sql Hive history file=/tmp/crazyant.net/hive_job_log_czt_201309131615_1720869864.txt OK Time taken: 2.131 seconds OK Time taken: 0.878 seconds Copying data from file:/home/users/czt/testdata_student Copying file: file:/home/users/czt/testdata_student Loading data to table test.student OK Time taken: 1.76 seconds 
      
  其中 
  student.sql 
  内容如下： 
 
        1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
      
        use test; ---学生信息表 create table IF NOT EXISTS student( sno bigint comment '学号' , sname string comment '姓名' , sage bigint comment '年龄' , pdate string comment '入学日期' ) COMMENT '学生信息表' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/home/users/czt/testdata_student' INTO TABLE student; 
      
  testdata_student 
  测试数据文件内容如下： 
 
        1 2 3 4 5 6 7 8 9 10 11 12 13 
      
        1 name1 21 20130901 2 name2 22 20130901 3 name3 23 20130901 4 name4 24 20130901 5 name5 25 20130902 6 name6 26 20130902 7 name7 27 20130902 8 name8 28 20130902 9 name9 29 20130903 10 name10 30 20130903 11 name11 31 20130903 12 name12 32 20130904 13 name13 33 20130904 
      
  方法 
  1：shell中设置变量，hive -e中直接使用 
 
  测试的 
  shell 
  文件名： 
 
        1 2 3 4 5 
      
        #!/bin/bash tablename="student" limitcount="8" hive -S -e "use test; select * from ${tablename} limit ${limitcount};" 
      
  运行结果： 
 
        1 2 3 4 5 6 7 8 9 10 11 12 
      
        [czt@www.crazyant.net testHivePara]$ sh -x shellhive.sh + tablename=student + limitcount=8 + hive -S -e 'use test; select * from student limit 8;' 1 name1 21 20130901 2 name2 22 20130901 3 name3 23 20130901 4 name4 24 20130901 5 name5 25 20130902 6 name6 26 20130902 7 name7 27 20130902 8 name8 28 20130902 
      
  由于 
  hive 
  自身是类 
  SQL 
  语言，缺乏 
  shell 
  的灵活性和对过程的控制能力，所以采用 
  shell+hive 
  的开发模式非常常见，在 
  shell 
  中直接定义变量，在 
  hive -e 
  语句中就可以直接引用； 
 
  注意：使用 
  -hiveconf定义，在hive -e中是不能使用的 
 
  修改一下刚才的 
  shell 
  文件，采用 
  -hiveconf 
  的方法定义日期参数： 
 
        1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
      
        #!/bin/bash tablename="student" limitcount="8" hive -S \ -hiveconf enter_school_date="20130902" \ -hiveconf min_age="26" \ -e \ " use test; \ select * from ${tablename} \ where \ pdate='${hiveconf:enter_school_date}' \ and \ sage>'${hiveconf:min_age}' \ limit ${limitcount};" 
      
  运行会失败，因为该脚本在 
  shell 
  环境中运行的，于是 
  shell 
  试图去解析 
  ${hiveconf:enter_school_date} 
  和 
  ${hiveconf:min_age} 
  变量，但是这两个 
  SHELL 
  变量并没有定义，所以会以空字符串放在这个位置。 
 
  运行时该 
  SQL 
  语句会被解析成下面这个样子： 
 
        1 
      
        + hive -S -hiveconf enter_school_date=20130902 -hiveconf min_age=26 -e 'use test; explain select * from student where pdate='\'''\'' and sage>'\'''\'' limit 8;' 
      
  方法 
  2：使用-hiveconf定义，在SQL文件中使用 
 
  因为换行什么的很不方便， 
  hive -e 
  只适合写少量的 
  SQL 
  代码，所以一般都会写很多 
  hql 
  文件，然后使用 
  hive –f 
  的方法来调用，这时候可以通过 
  -hiveconf 
  定义一些变量，然后在 
  SQL 
  中直接使用。 
 
  先编写调用的 
  SHELL 
  文件： 
 
        1 2 3 
      
        #!/bin/bash hive -hiveconf enter_school_date="20130902" -hiveconf min_ag="26" -f testvar.sql 
      
  被调用的 
  testvar.sql 
  文件内容： 
 
        1 2 3 4 5 6 7 8 
      
        use test; select * from student where pdate='${hiveconf:enter_school_date}' and sage > '${hiveconf:min_ag}' limit 8; 
      
  执行过程： 
 
        1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
      
        [czt@www.crazyant.net testHivePara]$ sh -x shellhive.sh + hive -hiveconf enter_school_date=20130902 -hiveconf min_ag=26 -f testvar.sql Hive history file=/tmp/czt/hive_job_log_czt_201309131651_2035045625.txt OK Time taken: 2.143 seconds Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Kill Command = hadoop job -kill job_20130911213659_42303 2013-09-13 16:52:00,300 Stage-1 map = 0%, reduce = 0% 2013-09-13 16:52:14,609 Stage-1 map = 28%, reduce = 0% 2013-09-13 16:52:24,642 Stage-1 map = 71%, reduce = 0% 2013-09-13 16:52:34,639 Stage-1 map = 98%, reduce = 0% Ended Job = job_20130911213659_42303 OK 7 name7 27 20130902 8 name8 28 20130902 Time taken: 54.268 seconds 
      
  总结 
 
  本文主要阐述了两种在 
  hive 
  中使用变量的方法，第一种是在 
  shell 
  中定义变量然后在 
  hive -e 
  的 
  SQL 
  语句中直接用 
  ${var_name} 
  的方法调用；第二种是使用 
  hive –hiveconf key=value –f run.sql 
  模式使用 
  -hiveconf 
  来设置变量，然后在 
  SQL 
  文件中使用 
  ${hiveconf:varname} 
  的方法调用。用这两种方法可以满足开发的时候向 
  hive 
  传递参数的需求，会很好的提升开发效率和代码质量。