hive调用方式及数据导出

MusicDancing

已于 2023-12-01 16:58:55 修改

阅读量3k

点赞数 1

分类专栏： hive 文章标签： hive 大数据

于 2020-08-12 19:47:41 首次发布

本文链接：https://blog.csdn.net/MusicDancing/article/details/107965108

版权

hive 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

1. hive命令的3种调用方式

1.1 hive交互式模式

sql的语法

hive                  #启动hive
hive>quit; 或exit;     #退出hive

1.2 hive -e 'sql语句'

适合短语句

-- 静音模式：不会显示mapreduce的操作过程
hive -S -e 'select * from t1;'

使用beeline调用

run.sh

#!/bin/bash
source /etc/profile.d/ecm_env.sh
last_dt=$(date --date='2 day ago' +%Y-%m-%d)
echo ${last_dt}

beeline -nairflow -p'hello' -hivevar the_date=${last_dt} -f fact.sql &&
beeline -nairflow -p'hello' -hivevar the_date=${last_dt} -f ext.sql

# hive -f ff.sql -d the_date=${last_dt}
echo "---all done!"

例行调度

cd /data/dm_tag/cron && nohup bash -x run.sh > logs/log_$(date +'%Y-%m-%d') 2>&1 &

1.3 hive -f hive-script.sql

适合多语句

test.sql

use db;
select *
from test_11
where id='${hivevar:id}' and dt='${hivevar:the_date}';

# 传参数

hive \
  -hivevar id=1  \
  -hivevar the_date=2022-01-01  \
  -S -f test.sql

或
/app/hadoop/hive/bin/hive -f test.sql -hivevar the_date=${last_dt}
# hive -f test.sql -d the_date=${last_dt}

#!/bin/bash
source /etc/profile
source util.sh

if [ $# = 1 ];then
    start_dt=$1
    end_dt=$1
elif [ $# = 2 ];then
    start_dt=$1
    end_dt=$2
elif [ $# = 0 ];then
    # use default date, one day ago
    start_dt=`date -d '-1 day' "+%Y-%m-%d %H:%M"`
    end_dt=`date -d '-1 day' "+%Y-%m-%d %H:%M"`
fi
start_sec=`date -d "$start_dt" +%s`
end_sec=`date -d "$end_dt" +%s`


function unit_active(){
    local runDt=$1
    local runNextDt=`date -d "${runDt} -1 days ago" +%Y-%m-%d`
    local runPrevDt=`date -d "${runDt} 1 days ago" +%Y-%m-%d`
    local monthAgo=`date -d "${runDt} 30 days ago" +%Y-%m-%d`
    local start_ts=`date -d "$monthAgo" +%s`
    local end_ts=`date -d "$runDt" +%s`
    for ((ts=$start_ts;ts<=$end_ts;ts+=86400))
    do
        local curr_ts=`date -d @$ts "+%Y-%m-%d %H:%M"`
        local curr_dt=`date +"%Y-%m-%d" -d "-0 hours $curr_ts"`
        local join_active_d="/user/hive/warehouse/ad/adst.db/ocpc_join_active_d/dt=${curr_dt}/_SUCCESS" && check_hdfs_exit ${join_active_d} ${timeout}
    done
    hive -hiveconf mp_queue=${mp_queue} \
        -hiveconf tg_dt=${runDt} \
        -hiveconf tg_dt_m30=${monthAgo} \
        -f ocpc_new_unit_active_d.hql
    if_error_exit "run unit_active: ${runDt}"
    ${HADOOP_HOME}/bin/hdfs dfs -touchz /user/hive/warehouse/ad/adst.db/ocpc_new_unit_active_d/dt=${runDt}/_SUCCESS
}


dt=`date +"%Y-%m-%d" -d "-0 hours $current_time"`
unit_active ${dt}
echo "---done!!!"

2. hive表数据导出到本地文件

2.1 shell环境下执行sql语句，结果写入文件。

hive -e "
set hive.cli.print.header=false; --忽略表头
select id,
name
from test.table_01
limit 100;
" >sample.txt

2.2 将结果转换成excel文件

1. 在Excel中选择“文件”-->"打开"

2. 选择文件格式

3. 保存文件

4. 文件重命名，并选择存储方式

2.3 将表中的数据导出到本地路径中（以逗号","为分隔符）。

insert overwrite local directory '/data/zz/aa'
row format delimited fields terminated by ','
select uid
from table_name;

其中 aa是一个目录

2.4 将hdfs上多个part文件合并，导出到本地文件

hadoop fs -getmerge hdfs://wxlx/user/hive/warehouse/zz.db/table_name/pt=2021-07-04 aa

3. hue 大批量数据导出

3.1 分批次导出

hue不能一次性导出太多数据，一般上限为50W；现在需要导出100w条数据，可分批次导出（每次20W），然后进行合并。

select *
from table1
order by uid -- 要具有唯一性
limit 200000  -- 导出[0, 20W) 条至part-1.csv文件
limit 200000, 200000  -- 导出[20W, 40W) 条至part-2.csv文件， 从第20W条开始，下载之后的20W条
limit 400000, 200000 -- 导出[40W, 60W) 条至part-3.csv文件
limit 600000, 200000  -- 导出[60W, 80W) 条至part-4.csv文件
limit 800000, 200000  -- 导出[80W, 100W) 条至part-5.csv文件

然后通过pandas将以上5个文件进行合并。

df = pd.concat([df1, df2, df3, df4, df5], ignore_index=False)
print(df.__len__())
df.to_csv(output_file, index=None)

output_file 即为最终所得大文件。

3.2 集群spark-sql导出

spark-sql (进入spark-sql环境)

insert overwrite local directory './data' 
row format delimited fields terminated by ','
select uid
from test.table_name;

然后合并文件即可

cat data/ *  > ./data.csv

注意：这时文件的分隔符可能会包含转义字符，需要手动处理。

MusicDancing

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
2
评论
hive调用方式及数据导出

1.创建hive 表(1) 方法一：使用建表语句创建，可以指定分隔符use test_db;drop table if exists tablename;CREATE TABLE tablename ( id string ,name string ,age int)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInp
复制链接

扫一扫