首先把部署datahub的机器上添加keyberos客户端环境
安装kerberos客户端
yum -y install krb5-libs krb5-workstation
同步KDC配置
scp hadoop102:/etc/krb5.conf /etc/krb5.conf
scp hadoop102:/etc/security/keytab/ranger_all_publc.keytab /etc/security/keytab/
验证能否连接到服务
kinit -kt /etc/security/keytab/ranger_all_publc.keytab hadoop/hadoop102@ZHT.COM
配置hive数据源就不使用web界面配置了,不然会报错在kerberos数据库没有相应的授权,猜测应该是在datahub的docker环境中没有相应的授权
安装sasl 不然后边会报错少这个包
yum install cyrus-sasl cyrus-sasl-lib cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5
pip install sasl
安装hive插件
pip install 'acryl-datahub[hive]'
配置hive相应的yml 并保存成 hive.yml
source:
type: hive
config:
host_port: 'xxxxxx:10000'
database: hudi_dwd
username: hive
stateful_ingestion:
enabled: false
profiling:
enabled: true
profile_table_level_only: false
options:
connect_args:
auth: KERBEROS
kerberos_service_name: hive
scheme: hive+http
之后导入python -m datahub --debug ingest -c hive.yml
如果修改了datahub-gms的端口的话
source:
type: hive
config:
host_port: xxxx:10000
database: test
username: hive
options:
connect_args:
auth: KERBEROS
kerberos_service_name: hive
scheme: 'hive+https'
sink:
type: "datahub-rest"
config:
server: 'http://127.0.0.1:18080'
token: 如果有就写
脚本定时导入hive数据
import os,yaml,subprocess
# 删除所有.yml文件
for file in os.listdir('.'):
if file.endswith('.yml'):
os.remove(file)
# 遍历数组并生成yml文件
databases = ['hudi_ads'
,'hudi_dict'
,'test']
# 替换数据库值
for database in databases:
with open('/root/datalineage/sink.yml', 'r') as f:
data = yaml.load(f, Loader=yaml.FullLoader)
data['source']['config']['database'] = database
with open('sink_{}.yml'.format(database), 'w') as f:
yaml.dump(data, f)
yml_files = [f for f in os.listdir('/root/datalineage/touchYml') if f.endswith('.yml')]
for file in yml_files:
cmd = f"python3 -m datahub ingest -c {file}"
subprocess.run(cmd, shell=True, check=True)