作为一个小白,最开始连文件怎么上传大数据服务器都不知道,在领导安排下来任务之后还要自己钻研,特别浪费时间,今天我就讲解下数据从大数据平台到clickhouse再到本地的流程。
1.hive-hdfs
首先需要将数据导入到hdfs中,导入的sql如下
insert overwrite directory '/tmp/sqoop/export/服务器名/rpt/table_name '
row format delimited fields terminated by '|'
stored as orc
select
pay_date
,coalesce(year_week_wm,0) as year_week
,coalesce(week_day_wm,0) as week_day
,coalesce(pay_year,0) as pay_year
,coalesce(month_of_year,0) as month_of_year
,to_utc_timestamp(current_timestamp(),'GMT-8') as upd_ts
from db_name.table_name where ts between '2020-05-01' and '2020-06-09';
2.hdfs - clickhouse
这个过程是由datax完成的,请看datax配置
{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [
{ "reader": {
"name": "hdfsreader",
"parameter": {
"defaultFS": "hdfs://cnprod1ha",
"fileType": "orc",
"path": "/tmp/sqoop/export/服务器名/rpt/table_name/*", --hafs地址
"hadoopConfig":{
"dfs.nameservices": "cnprod1ha",
"dfs.ha.namenodes.cnprod1ha": "namenode1,namenode2",
"dfs.namenode.rpc-address.cnprod1ha.namenode1": "oser406433.cn.wal-mart.com:8020",
"dfs.namenode.rpc-address.cnprod1ha.namenode2": "oser406435.cn.wal-mart.com:8020",
"dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "string"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "string"
},
{
"index": 5,
"type": "string"
}
],
"fieldDelimiter": "|"
}
},
"writer": {
"name": "rdbmswriter",
"parameter": {
"connection": [
{
"jdbcUrl": "jdbc:clickhouse://clickhouse服务器:8123/clickhouse_dbname",
"table": [
"table_name" --表名
]
}
],
"username": "", --用户名
"password": "", --密码
"table": "table_name", --表名
"column": [
"*"
],
"preSql": [
""
]
}
}
}
]
}
}
3.clickhouse文件下载到clickhouse服务器
可以在clickhouse服务器上执行如下代码,就可以将文件复制到clickhouse服务器所在的HDFS上
clickhouse-client --query "select * from clickhouse_db_name.table_name INTO OUTFILE './table_name.parquet' FORMAT Parquet"
4.文件从hdfs上下载到本地
在本地执行如下代码
scp 用户名@服务器名:~/表名.parquet ~/Desktop/ --将parquet文件下载到本地
如果要上传可以根据以上步骤逆向进行
5.将本地文件上传至hdfs
scp table_name.parquet 用户名@服务器地址:~/ --上传到服务器的 / 下
hadoop fs -put division.csv /tmp --将文件上传到hdfsde /tmp下
6.将hdfs的数据导入到数据库中
clickhouse-client --query="insert into clickhouse_db_name.table_name FORMAT Parquet" < table_name.parquet
7.然后查询数据有没有导进去
8.以下是csv文件导入进hive时,解决中文乱码问题的方案
ALTER TABLE dim_promotion SET SERDEPROPERTIES (‘serialization.encoding’=‘GBK’);