Step 1. 利用HIVE Hook LineageLogger 获取字段之间的依赖关系
注意:LineageLogger Hook 是Hive2.0版本
之后存在的,如果HIVE版本不够需要升级HIVE版本。PS.CDH有的HIVE版本添加此功能
下面以hive 2.5.3版本为例
添加参数
vim /usr/local/hive/conf/hive-site.xml
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.hadoop.hive.ql.hooks.LineageLogger</value>
</property>
配置hook输出
vim ${HIVE_HOME}/conf/hive-log4j2.properties
og4j.logger.org.apache.hadoop.hive.ql.hooks.LineageLogger=INFO
输出位置在 hive-log4j2.properties 的 property.hive.log.dir 参数
测试输出
hive> desc test_table;
OK
c1 string
c2 bigint
c3 int
hive> select c1, max(c2) as max_c2 from test_table group by c1;
hive.log 日志输出如下
JSON格式化如下
{
"edges": [{
"sources": [2],
"targets": [0],
"edgeType": "PROJECTION"
}, {
"sources": [3],
"targets": [1],
"expression": "max(ods.test_table.c2)",
"edgeType": "PROJECTION"
}],
"vertices": [{
"id": 0,
"vertexType": "COLUMN",
"vertexId": "c1"
}, {
"id": 1,
"vertexType": "COLUMN",
"vertexId": "max_c2"
}, {
"id": 2,
"vertexType": "COLUMN",
"vertexId": "ods.test_table.c1"
}, {
"id": 3,
"vertexType": "COLUMN",
"vertexId": "ods.test_table.c2"
}]
}
比如edges中第二条 sources [3], targets [1]
代表 ods.test_table.c2 到 max_c2
Step 2. 利用已有的血缘log 清洗后存入 Neo4j
from py2neo import Graph
class Neo4jUtil:
def __init__(self, url, username, password):
self.graph = Graph(url, auth=(username, password))
def empty(self):
self.graph.run("match (n) detach delete n")
# 创建节点
def create_column_node(self, node_name):
match_result = self.graph.run("MATCH (a:Column) WHERE a.name = '%s' RETURN a" % node_name).data()
if len(match_result) == 0:
self.graph.run("CREATE (n:Column { name: '%s' })" % node_name)
# 创建关系
def create_column_relation(self, origin_name, dest_name):
match_result = self.graph.run(
"MATCH (a:Column)-[r:beDepColumn]->(b:Column) WHERE a.name = '%s' AND b.name = '%s' RETURN r" %
(origin_name, dest_name)).data()
if len(match_result) == 0:
self