curl 获取json数据导入hive外部双分区表解决科学计数法和特殊字符问题 ——首发

最新推荐文章于 2023-04-23 10:24:39 发布

jzy3711

最新推荐文章于 2023-04-23 10:24:39 发布

阅读量1.3w

点赞数

分类专栏：数据存储大数据 hdfs hive 文章标签： hive shell curl json hdfs

本文链接：https://blog.csdn.net/jzy3711/article/details/84246075

版权

大数据同时被 3 个专栏收录

40 篇文章 1 订阅

订阅专栏

hive

13 篇文章 0 订阅

订阅专栏

数据存储

12 篇文章 0 订阅

订阅专栏

shell 脚本

hadoop dfs -mkdir /jzytest/hivePublicOpinionList/

#创建hive外部双分区表
hive -e 'create external table if not exists jzyPublicOpinionListjson
(
channel string,
publishTime string,
source string,
title string,
url string
)
partitioned by (topic_id string,dt string)
row format serde "org.apache.hive.hcatalog.data.JsonSerDe"
LOCATION "hdfs:///jzytest/hivePublicOpinionList/";'
channel_array=("21" "104" "303")
dt=$(date "+%Y-%m-%d")
for topic_id in $(cat jzyhive.out)
do
   hadoop dfs -mkdir -p /jzytest/hivePublicOpinionList/topic_id=${topic_id}/dt=${dt}
   for channel in ${channel_array[*]}
   do
   curl -X POST -d '{
       "version":"1.0",
       "caller":"pxxx",
       "componentName":"pxxx",
       "password":"pxxx",
       "callee":"tencent_poa",
       "eventId":293194931,
       "seqId":"1481860426.8432320581878515392",
       "spanId":"pxxx",
       "timestamp":1481860426,
       "interface":{
           "interfaceName":"qcloud.supervision.getPublicOpinionList",
           "para":{
               "topic_id": "'$topic_id'",
               "channel": "'$channel'",
               "page": 1,
               "page_size": 30
               }
           }
       }' http://2xx.1xx.xx2.xx4:9090/interface | awk -v head=":[" -v tail="]}" '{print substr($0, index($0,head)+length(head),index($0,tail)-index($0,head)-length(head))}' > jzytopic_id${topic_id}channel${channel}.out
   sed -i "s/},/}\n/g" jzytopic_id${topic_id}channel${channel}.out
   sed -i "s/《/<</g" jzytopic_id${topic_id}channel${channel}.out
   sed -i "s/》/>>/g" jzytopic_id${topic_id}channel${channel}.out
   hadoop dfs -put jzytopic_id${topic_id}channel${channel}.out /jzytest/hivePublicOpinionList/topic_id=${topic_id}/dt=${dt}/
   rm -rf jzytopic_id${topic_id}channel${channel}.out
   done
   hive -e "alter table jzyPublicOpinionListjson add partition (topic_id='$topic_id',dt='$dt');"
done
第一次标红，是数据格式是{*“data”[{*}]}获取{*} *的内容

第一次sed（未红）是数据的分隔符是“，”号放在hive中会出问题，而hive中默认可以按行读取所以应该换成“\n”，建表事就可以不用考虑分隔符问题

第二次标红，是对书名号进行特殊处理否则会出现以下错误：

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected

第三次标红，是防止加载外部分区表变成科学计数法。

总结：

1.获取的json数据要用awk过滤成需要的数据

2.数据中的特殊字符需要出，比如书名号

3.数据中的字段，需要与hive表字段对应，对特殊字段需要处理，如local,如对应不上数据会显示NULL

4.分区字符串如果是纯数字，需要在alter加载分区的时候加“''”单引号，不然会出现科学计数法，例如topic_id=14538744803109468122

最后附上成功后图片，有图有真相

jzy3711

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
curl 获取json数据导入hive外部双分区表解决科学计数法和特殊字符问题 ——首发

shell 脚本hadoop dfs -mkdir /jzytest/hivePublicOpinionList/#创建hive外部双分区表hive -e 'create external table if not exists jzyPublicOpinionListjson( channel string, publishTime string, source string...
复制链接

扫一扫