Griffin编译安装
前言
Griffin是一个应用于分布式数据系统中的开源数据质量解决方案,例如在Hadoop, Spark, Storm等分布式系统中,Griffin提供了一整套统一的流程来定义和检测数据集的质量并及时报告问题。Apache Griffin是一个开源的大数据数据质量解决方案,它支持批处理和流模式两种数据质量检测方式,可以从不同维度度量数据资产,从而提升数据的准确度和可信度。例如:离线任务执行完毕后检查源端和目标端的数据数量是否一致,源表的数据空值等
一、Griffin介绍起源
在eBay,当人们在处理大数据(Hadoop或者其它streaming系统)的时候,数据质量的检测是一个挑战。不同的团队开发了他们自己的工具在其专业领域检测和分析数据质量问题。于是我们希望能建立一个普遍适用的平台,提供共享基础设施和通用的特性来解决常见的数据质量问题,以此得到可信度高的数据。
目前来说,当数据量达到一定程度并且有跨越多个平台时(streaming数据和batch数据),数据数量验证将是十分费时费力的。拿eBay的实时个性化平台举个例子,每天我们都要处理大约600M的数据,而在如此复杂的环境和庞大的规模中,数据质量问题则成了一个很大的挑战。
在eBay的数据处理中,发现存在着如下问题:
当数据从不同的数据源流向不同的应用系统的时候,缺少端到端的统一视图来追踪数据沿袭(Data Lineage)。这也就导致了在识别和解决数据质量问题上要花费许多不必要的时间。
缺少一个实时的数据质量检测系统。我们需要这样一个系统:数据资产(Data Asset)注册,数据质量模型定义,数据质量结果可视化、可监控,当检测到问题时,可以及时发出警报。
缺乏一个共享平台和API服务,让每个项目组无需维护自己的软硬件环境就能解决常见的数据质量问题。
为了解决以上种种问题,我们决定开发Griffin这个平台。Griffin是一个应用于分布式数据系统中的开源数据质量解决方案,例如在Hadoop, Spark, Storm等分布式系统中,Griffin提供了一整套统一的流程来定义和检测数据集的质量并及时报告问题。
二、编译安装
-
环境准备
JDK 1.8 (版本或更高)
Maven(Apache Maven 3.6.3)
Mysql 数据库 (可以是 PostgreSQL,mysql版本5.7 )
npm(版本6.14.6)(version 6.0.0+,用于编译ui模块)
Scala (版本2.11.8)
Hadoop (版本3.0.0或更高版本)
Hive (版本2.1.1)
Spark (版本2.4.0)
Livy (版本0.5.0)
ElasticSearch(版本5.0或更高版本)
Zookeeper (版本3.4.5)
这里我使用的是ambari2.7.3搭建的hdp3.1.0.0平台
-
源码下载
# 这里我是直接从github下载的master分支的代码,最新的为0.7.0 wget https://github.com/apache/griffin/archive/griffin-0.6.0.tar.gz
-
目录分析
griffin目录下包括griffin-doc、measure、service和ui四个模块,其中griffin-doc负责存放Griffin的文档,measure负责与spark交互,执行统计任务,service使用spring boot作为服务实现,负责给ui模块提供交互所需的restful api,保存统计任务,展示统计结果。 源码导入构建完毕后,需要修改配置文件。
-
创建数据库quartz
#这里我使用的是MySQL,需要创建quartz对应的库和表 create database quartz; #再目录下导入src\main\resources\Init_quartz_mysql_innodb.sql文件
-
修改service模块下的配置文件
- src\main\resources\application.properties
# # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # # modified 服务名称,以及元数据使用的是MySQL spring.application.name=griffin_service #默认端口是8080,可能会有冲突,改成8085 server.port=8085 spring.datasource.url=jdbc:mysql://xxx.xxx.xxx.xxx:3306/quartz?useSSL=false&autoReconnect=true&failOverReadOnly=false spring.datasource.username=root spring.datasource.password=passw0rd spring.jpa.generate-ddl=true spring.datasource.driver-class-name=com.mysql.jdbc.Driver spring.jpa.show-sql=true # Hive metastore # 这里配置的值为`hive-site.xml`中的 hive.metastore.uris 配置项的值 hive.metastore.uris=thrift://node02:9083 hive.metastore.dbname=hive hive.hmshandler.retry.attempts=15 hive.hmshandler.retry.interval=2000ms #Hive jdbc hive.jdbc.className=org.apache.hive.jdbc.HiveDriver hive.jdbc.url=jdbc:hive2://localhost:10000/ hive.need.kerberos=false hive.keytab.user=xxx@xx.com hive.keytab.path=/path/to/keytab/file # Hive cache time cache.evict.hive.fixedRate.in.milliseconds=900000 # Kafka schema registry kafka.schema.registry.url=http://localhost:8081 # Update job instance state at regular intervals jobInstance.fixedDelay.in.milliseconds=60000 # Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds jobInstance.expired.milliseconds=604800000 # schedule predicate job every 5 minutes and repeat 12 times at most #interval time unit s:second m:minute h:hour d:day,only support these four units predicate.job.interval=5m predicate.job.repeat.count=12 # external properties directory location external.config.location= # external BATCH or STREAMING env external.env.location= # login strategy ("default" or "ldap") login.strategy=default # ldap ldap.url=ldap://hostname:port ldap.email=@example.com ldap.searchBase=DC=org,DC=example ldap.searchPattern=(sAMAccountName={0}) # hdfs default name #修改成namenode的地址 fs.defaultFS=hdfs://node01:8020 # elasticsearch配置 elasticsearch.host=9.186.66.183 elasticsearch.port=9200 elasticsearch.scheme=http # elasticsearch.user = user # elasticsearch.password = password # livy配置 livy.uri=http://node01:8999/batches livy.need.queue=false livy.task.max.concurrent.count=20 livy.task.submit.interval.second=3 livy.task.appId.retry.count=3 livy.need.kerberos=false livy.server.auth.kerberos.principal=livy/kerberos.principal livy.server.auth.kerberos.keytab=/path/to/livy/keytab/file # yarn url配置 yarn.uri=http://node01:8088 # griffin event listener internal.event.listeners=GriffinJobEventHook logging.file=logs/griffin-service.log
- src\main\resources\quartz.properties
#使用的mysql org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
- src\main\resources\sparkProperties.json,这里就在找传到hdfs对应的包
{ "file": "hdfs://node01:8020/griffin/griffin-measure.jar", "className": "org.apache.griffin.measure.Application", "queue": "default", "numExecutors": 2, "executorCores": 1, "driverMemory": "1g", "executorMemory": "1g", "conf": { "spark.yarn.dist.files": "hdfs://node01:8020/home/spark_conf/hive-site.xml" }, "files": [ ] }
- src\main\resources\env\env_batch.json
{ "spark": { "log.level": "WARN" }, "sinks": [ { "name": "console", "type": "CONSOLE", "config": { "max.log.lines": 10 } }, { "name": "hdfs", "type": "HDFS", "config": { "path": "hdfs://node01:8020/griffin/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } }, { "name": "elasticsearch", "type": "ELASTICSEARCH", "config": { "method": "post", "api": "http://9.186.66.183:9200/griffin/accuracy", "connection.timeout": "1m", "retry": 10 } } ], "griffin.checkpoint": [] }
- src\main\resources\env\env_streaming.json
{ "spark": { "log.level": "WARN", "checkpoint.dir": "hdfs://node01:8020/griffin/checkpoint/${JOB_NAME}", "init.clear": true, "batch.interval": "1m", "process.interval": "5m", "config": { "spark.default.parallelism": 4, "spark.task.maxFailures": 5, "spark.streaming.kafkaMaxRatePerPartition": 1000, "spark.streaming.concurrentJobs": 4, "spark.yarn.maxAppAttempts": 5, "spark.yarn.am.attemptFailuresValidityInterval": "1h", "spark.yarn.max.executor.failures": 120, "spark.yarn.executor.failuresValidityInterval": "1h", "spark.hadoop.fs.hdfs.impl.disable.cache": true } }, "sinks": [ { "type": "CONSOLE", "config": { "max.log.lines": 100 } }, { "type": "HDFS", "config": { "path": "hdfs://node01:8020/griffin/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } }, { "type": "ELASTICSEARCH", "config": { "method": "post", "api": "http://9.186.66.183:9200/griffin/accuracy" } } ], "griffin.checkpoint": [ { "type": "zk", "config": { "hosts": "node01:2181,node02:2181,node03:2181,node04:2181,", "namespace": "griffin/infocache", "lock.path": "lock", "mode": "persist", "init.clear": true, "close.clear": false } } ] }
-
ElasticSearch配置
#原因:elasticsearch7默认不在支持指定索引类型,默认索引类型是_doc,如果想改变,则配置include_type_name: true 即可(这个没有测试,官方文档说的,无论是否可行,建议不要这么做,因为elasticsearch8后就不在提供该字段)。官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html #这里使用的elasticsearch是7.8版本,官网给的请求不能支持,请与官网作对应 #以下是修改过的脚本 curl -H "Content-Type: application/json" -XPUT http://localhost:9200/griffin -d ' { "aliases": {}, "mappings": { "properties": { "name": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" }, "tmst": { "type": "date" } } }, "settings": { "index": { "number_of_replicas": "2", "number_of_shards": "5" } } } '
创建成功,返回如下
-
修改service模块的pom文件
#默认使用的是postgresql,这里切换到MySQL <!-- <dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <version>${postgresql.version}</version> </dependency>--> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>${mysql.java.version}</version> </dependency> #取消已经注释的mysql,profile #<!--if you need mysql, please uncomment mysql-connector-java --> <profiles> <!--if you need mysql, please uncomment mysql-connector-java --> <profile> <id>mysql</id> <activation> <property> <name>mysql</name> </property> </activation> </profile> <profile> <id>dev</id> <activation> <property> <name>dev</name> </property> </activation> </profile> <profile> <id>postgresql</id> <activation> <activeByDefault>true</activeByDefault> <property> <name>prod</name> </property> </activation> </profile> </profiles> #默认spring-boot-maven-plugins打包时是build-info修改为,并配置mainClass <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <version>${spring-boot-maven-plugin.version}</version> <executions> <execution> <goals> <goal>repackage</goal> </goals> </execution> </executions> <configuration> <mainClass>org.apache.griffin.core.GriffinWebApplication</mainClass> <executable>false</executable> </configuration> </plugin>
-
配置measure模块
-
src\main\resources\env-batch.json
{ "spark": { "log.level": "WARN", "config": { "spark.master": "node01" } }, "sinks": [ { "name": "console", "type": "CONSOLE", "config": { "max.log.lines": 10 } }, { "name": "hdfs", "type": "HDFS", "config": { "path": "hdfs://node01:8020/griffin/batch/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } }, { "name": "elasticsearch", "type": "ELASTICSEARCH", "config": { "method": "post", "api": "http://9.186.66.183:9200/griffin/accuracy", "connection.timeout": "1m", "retry": 10 } } ], "griffin.checkpoint": [] }
-
src\main\resources\env-streaming.json
{ "spark": { "log.level": "WARN", "checkpoint.dir": "hdfs://node01:8020/test/griffin/cp", "batch.interval": "2s", "process.interval": "10s", "init.clear": true, "config": { "spark.master": "node01", "spark.task.maxFailures": 5, "spark.streaming.kafkaMaxRatePerPartition": 1000, "spark.streaming.concurrentJobs": 4, "spark.yarn.maxAppAttempts": 5, "spark.yarn.am.attemptFailuresValidityInterval": "1h", "spark.yarn.max.executor.failures": 120, "spark.yarn.executor.failuresValidityInterval": "1h", "spark.hadoop.fs.hdfs.impl.disable.cache": true } }, "sinks": [ { "name": "consoleSink", "type": "CONSOLE", "config": { "max.log.lines": 100 } }, { "name": "hdfsSink", "type": "HDFS", "config": { "path": "hdfs://node01:8020/griffin/streaming/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } }, { "name": "elasticSink", "type": "ELASTICSEARCH", "config": { "method": "post", "api": "http://9.186.66.183:9200/griffin/accuracy" } } ], "griffin.checkpoint": [ { "type": "zk", "config": { "hosts": "node01:2181,node02:2181,node03:2181,node04:2181", "namespace": "griffin/infocache", "lock.path": "lock", "mode": "persist", "init.clear": true, "close.clear": false } } ] }
-
src\main\resources\config-streaming.json
{ "name": "prof_streaming", "process.type": "streaming", "data.sources": [ { "name": "source", "baseline": true, "connector": { "type": "kafka", "version": "0.8", "dataframe.name": "kafka", "config": { "kafka.config": { "bootstrap.servers": "10.147.177.107:9092", "group.id": "group1", "auto.offset.reset": "smallest", "auto.commit.enable": "false" }, "topics": "sss", "key.type": "java.lang.String", "value.type": "java.lang.String" }, "pre.proc": [ { "dsl.type": "df-ops", "in.dataframe.name": "kafka", "out.dataframe.name": "out1", "rule": "from_json" }, { "dsl.type": "spark-sql", "in.dataframe.name": "out1", "out.datafrmae.name": "out3", "rule": "select name, age from out1" } ] }, "checkpoint": { "file.path": "hdfs://node01:8020/griffin/streaming/dump/source", "info.path": "source", "ready.time.interval": "10s", "ready.time.delay": "0", "time.range": [ "0", "0" ] } } ], "evaluate.rule": { "rules": [ { "dsl.type": "griffin-dsl", "dq.type": "profiling", "out.dataframe.name": "prof", "rule": "select count(name) as `cnt`, max(age) as `max`, min(age) as `min` from source", "out": [ { "type": "metric", "name": "prof" } ] }, { "dsl.type": "griffin-dsl", "dq.type": "profiling", "out.dataframe.name": "grp", "rule": "select name, count(*) as `cnt` from source group by name", "out": [ { "type": "metric", "name": "name_group", "flatten": "array" } ] } ] }, "sinks": [ "elasticSink" ] }
-
以上为止,两个模块配置已完成
-
创建hdfs上对应的目录
#将对应的hive-site.xml上传,这里使用的ambari搭建的hdp环境,对应的文件在/etc/hive/3.1.0.0-78/0/hive-site.xml sudo -u hdfs hadoop fs -mkdir -p /home/spark_conf sudo -u hdfs hadoop fs -put /opt/cloudera/parcels/CDH/lib/hive/conf/hive-site.xml /home/spark_conf useradd griffin sudo -u hdfs hadoop fs -chown -R griffin /home/spark_conf #配置griffin-measure.jar在hdfs上的位置,spark.yarn.dist.files即为1.3.3那里hive-site.xml上传的位置。 sudo -u griffin hadoop fs -mkdir -p /griffin #配置 Griffin 的 env_batch.json sudo -u griffin hadoop fs -mkdir -p /griffin/persist # 配置 Griffin 的 env_streaming.json sudo -u griffin hadoop fs -mkdir -p /griffin/checkpoint #配置 Griffin 的measure的env-batch.json su griffin hadoop fs -mkdir -p /test/griffin/cp hadoop fs -mkdir -p /griffin/streaming/dump/source hadoop fs -mkdir -p /griffin/streaming/persist
-
编译对应的两个模块
#在idea中打开对应的terminal,进入到service目录 mvn -Dmaven.test.skip=true clean package #进入到对应的measure目录 mvn -Dmaven.test.skip=true clean package #将measure的jar包复制到对应的节点,并改名 mv measure-0.6.0.jar griffin-measure.jar #上传到hdfs对应的目录,这样做的目的主要是因为spark在yarn集群上执行任务时,需要到HDFS的/griffin目录下加载griffin-measure.jar,避免发生类org.apache.griffin.measure.Application找不到的错误。 hdfs dfs -put griffin-measure.jar /griffin/ #将service的jar包上传到对应的Linux节点上,后台启动 nohup java -jar griffin-service.jar>service.out 2>&1 &
-
UI模块
UI模块使用的angular js,使用ui里面的pom文件中的node版本(v6.11.3)和npm(3.10.10)版本,编译报错。这里我的本地安装了node和npm对应的环境
使用administrator权限进入到UI的angular目录下,下面有package.json,运行npm install安装对应的依赖,然后在ui\angular\node_modules\ .bin下面运行ng serve,前端则启动
默认的用户名密码是 : user/test
配置前端对应的后端服务在 angular\src\environments\environment.ts 中
Tips
导入griffin-doc\service\postman\griffin.json到postman中,可参考对应的measure和job的创建和查询
使用docker-compose可快速搭建对应的hdp环境并测试griffin
参考连接
官网链接
https://griffin.apache.org/docs/quickstart-cn.html
griffin-0.5.0踩坑过程
https://blog.csdn.net/weixin_51485976/article/details/109848261?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_baidulandingword-3&spm=1001.2101.3001.4242
docker部署Guide
n,运行npm install安装对应的依赖,然后在ui\angular\node_modules\ .bin下面运行ng serve,前端则启动
默认的用户名密码是 : user/test
[外链图片转存中…(img-lRvWFKoV-1609062082696)]
配置前端对应的后端服务在 angular\src\environments\environment.ts 中
[外链图片转存中…(img-A8OIqVGC-1609062082700)]
Tips
导入griffin-doc\service\postman\griffin.json到postman中,可参考对应的measure和job的创建和查询
使用docker-compose可快速搭建对应的hdp环境并测试griffin
参考连接
官网链接
https://griffin.apache.org/docs/quickstart-cn.html
griffin-0.5.0踩坑过程
https://blog.csdn.net/weixin_51485976/article/details/109848261?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_baidulandingword-3&spm=1001.2101.3001.4242
docker部署Guide
griffin-doc\docker\griffin-docker-guide.md