describe tablename;
describe extended tablename;
describe FORMATTED tablename; 可以显示更详细的信息,是否压缩 分隔符号等
Replacing EXTENDED with FORMATTED provides more readable but also more verbose
output.
建表
create table callbacks_part ( ID INT, BOOKMARK_STEP_ID INT, CLIENT_ID INT, TXN_ID STRING, SESSION_ID STRING, IP_ADDRESS STRING, METHOD STRING, PAGE_ID INT, RULE_NAME_ID INT, CALLBACK_TYPE_ID INT, PAGE_ELEMENT_ID INT, CALLBACK_SENSE_ID INT, CALLBACK_INFO STRING, NUM_OCCURANCES INT, UNIQUE_ID STRING, EXEC_DATE STRING, EXEC_DURATION FLOAT ) PARTITIONED BY(EXEC_DATE STRING) CLUSTERED BY(PAGE_ID) SORTED BY(PAGE_ID) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
You don't need to specify "EXEC_DATE STRING," in your column definition as it is already specified by "PARTITIONED BY(EXEC_DATE STRING)
PARTITIONED BY(EXEC_DATE STRING) 有了 就不要再create的字段里面写了
partion的终极解释,partion的字段不属于表结构的内容,他根本不存在表结构中,它的存在只是分区用,hdfs文件目录下会有以他为名字的目录!
hive> LOAD DATA LOCAL INPATH '/home/hadoop/xiaoft/hivedata/' overwrite INTO TABLE gprs_log_zip_bak PARTITION (day='20131108');
load编码问题 unicode utf-8
向管理表中加载数据
因为,Hive没有 行级别的插入,更新和删除操作,往表中插入数据的唯一方法就是 使用成批载入操作。
或者你可以通过其他的工具向正确的目录写入数据。
当文件格式不一样时候解决方法:
You can change this in your config files if you want these setting to persist outside the session (including other peoples Hive and MapReduce jobs if you are sharing a cluster). I use SnappyCodec because it works with Hive out of the box, is splittable, and gives good compression/decompression for the CPU time spent. You might decide a different codec is more suitable to your needs.
Now how do you apply all these options if all your data is in a CSV format? The easiest way is to create a table on top of the CSV files, then create another table with the fileformat and SerDe you want, then insert the data from the CSV backed table into the new table (making sure that you are compression your Hive output with your codec of choice). Under the hood, Hive will take care of reading the data from one format (CSV) and writing to another (whatever you decided). After this you will have a duplicate of the data and you can drop the CSV files if you desire.
CREATE EXTERNAL TABLE csv_table (id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
LOCATION /user/muehlbau/yourData;
CREATE TABLE binary_table (id INT, name STRING)
ROW FORMAT SERDE org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
STORED AS SEQUENCEFILE;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
INSERT OVERWRITE TABLE binary_table
SELECT * FROM csv_table
The example above demonstrates how you could take advantage of all the options available to you, but do not think of it as a default, reasonable use case. Read up on the different file formats / SerDes / compression codecs and do some performance testing to settle on your approach.
INSERT OVERWRITE TABLE gprs_log_zip partition(day=20131208) SELECT time,
uid,
site,
label,
content,
flow,
cost,
net_type,
gprs_type,
keywords,
user_agent,
multi_dimension,
content_name,
bearer_type,
service_type,
log_time,
url_stop_time,
down_flow,
url,
brand_id,
product_id,
theme_product,
validity_code,
dev,
mod,
sys,
ver,
web,
cli,
act,
topics