数据从hive到clickhouse再到本地的整个流程_sqoop将hive数据导入clickhouse-CSDN博客

本文链接：https://blog.csdn.net/weixin_46327368/article/details/106864546

作为一个小白，最开始连文件怎么上传大数据服务器都不知道，在领导安排下来任务之后还要自己钻研，特别浪费时间，今天我就讲解下数据从大数据平台到clickhouse再到本地的流程。
1.hive-hdfs
首先需要将数据导入到hdfs中，导入的sql如下

insert overwrite directory '/tmp/sqoop/export/服务器名/rpt/table_name '
row format delimited fields terminated by '|'
stored as orc
select
  pay_date
  ,coalesce(year_week_wm,0) as year_week
  ,coalesce(week_day_wm,0) as week_day
  ,coalesce(pay_year,0) as pay_year
  ,coalesce(month_of_year,0) as month_of_year
  ,to_utc_timestamp(current_timestamp(),'GMT-8') as upd_ts
  from db_name.table_name where ts between '2020-05-01' and '2020-06-09';

2.hdfs - clickhouse
这个过程是由datax完成的，请看datax配置

{
    "job": {
        "setting": {
            "speed": {
                "channel": 2
            }
        },
        "content": [
            {   "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "defaultFS": "hdfs://cnprod1ha",
                        "fileType": "orc",
                        "path": "/tmp/sqoop/export/服务器名/rpt/table_name/*", --hafs地址
                        "hadoopConfig":{
                            "dfs.nameservices": "cnprod1ha",
                            "dfs.ha.namenodes.cnprod1ha": "namenode1,namenode2",
                            "dfs.namenode.rpc-address.cnprod1ha.namenode1": "oser406433.cn.wal-mart.com:8020",
                            "dfs.namenode.rpc-address.cnprod1ha.namenode2": "oser406435.cn.wal-mart.com:8020",
                            "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
                        },
                        "column": [
                                 {
                                  "index": 0,
                                  "type": "string"
                                 },
                                 {
                                  "index": 1,
                                  "type": "string"
                                 },
                                 {
                                  "index": 2,
                                  "type": "string"
                                 },
                                 {
                                  "index": 3,
                                  "type": "string"
                                 },

                                  {
                                  "index": 4,
                                  "type": "string"
                                 },
                                 {
                                  "index": 5,
                                  "type": "string"
                                 }
                       ],
                       "fieldDelimiter": "|"
                    }
                },
                "writer": {
                    "name": "rdbmswriter",
                    "parameter": {
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:clickhouse://clickhouse服务器:8123/clickhouse_dbname",
                                "table": [
                                    "table_name" --表名
                                ]
                            }
                        ],
                        "username": "",  --用户名
                        "password": "",  --密码
                        "table": "table_name", --表名
                        "column": [
                            "*"
                        ],
                        "preSql": [
                            ""
                        ]
                    }
                }
            }
        ]
    }
}

3.clickhouse文件下载到clickhouse服务器
可以在clickhouse服务器上执行如下代码，就可以将文件复制到clickhouse服务器所在的HDFS上

clickhouse-client --query "select * from clickhouse_db_name.table_name INTO OUTFILE './table_name.parquet' FORMAT Parquet"

4.文件从hdfs上下载到本地
在本地执行如下代码

scp 用户名@服务器名:~/表名.parquet ~/Desktop/    --将parquet文件下载到本地

如果要上传可以根据以上步骤逆向进行
5.将本地文件上传至hdfs

scp table_name.parquet 用户名@服务器地址:~/     --上传到服务器的 / 下
hadoop fs -put division.csv /tmp               --将文件上传到hdfsde /tmp下

6.将hdfs的数据导入到数据库中

clickhouse-client --query="insert into clickhouse_db_name.table_name FORMAT Parquet" < table_name.parquet

7.然后查询数据有没有导进去

8.以下是csv文件导入进hive时，解决中文乱码问题的方案
ALTER TABLE dim_promotion SET SERDEPROPERTIES (‘serialization.encoding’=‘GBK’);
在这里插入图片描述