Apache Griffin 数据质量监控工具
官网:http://griffin.apache.org/docs/quickstart-cn.html
github: https://github.com/apache/griffin
参考:https://cwiki.apache.org/confluence/display/GRIFFIN/1.+Overview
测试流数据:http://griffin.apache.org/data/streaming/
测试批数据:http://griffin.apache.org/data/streaming/
一. 概述
数据质量检测方式:
Griffin是属于模型驱动的方案,基于目标数据集合或者源数据集(基准数据),用户可以选择不同的数据质量维度来执行目标数据质量的验证。支持两种类型的数据源:
batch数据:通过数据连接器从Hadoop平台收集数据
streaming数据:可以连接到诸如Kafka之类的消息系统来做近似实时数据分析
执行过程
batch数据质量监控原理:hive metadata中加载数据源
根据用户指定的数据质量检查的规则,将规则交给livy,livy以rest方式提交给yarn,yarn启动spark程序,利用Spark这种强大的计算能力,为数据质量做出检测分析。
Griffin 功能:
度量:精确度、完整性、及时性、唯一性、有效性、一致性。
异常监测:利用预先设定的规则,检测出不符合预期的数据,提供不符合规则数据的下载。
异常告警:通过邮件或门户报告数据质量问题。
可视化监测:利用控制面板来展现数据质量的状态。
实时性:可以实时进行数据质量检测,能够及时发现问题。
可扩展性:可用于多个数据系统仓库的数据校验。
可伸缩性:工作在大数据量的环境中,目前运行的数据量约1.2PB(eBay环境)。
自助服务:Griffin提供了一个简洁易用的用户界面,可以管理数据资产和数据质量规则;同时用户可以通过控制面板查看数据质量结果和自定义显示内容。
工作流程
- 注册数据,把想要检测数据质量的数据源注册到griffin。
- 配置度量模型,可以从数据质量维度来定义模型,如:精确度、完整性、及时性、唯一性等。
- 配置定时任务提交spark集群,定时检查数据。
- 在门户界面上查看指标,分析数据质量校验结果。
系统架构
主要分为,define,measure,analyze 三部分
define:主要负责定义数据质量统计的维度,比如数据质量统计的时间跨度、统计的目标(源端和目标端的数据数量是否一致,数据源里某一字段的非空的数量、不重复值的数量、最大值、最小值、top5的值数量等)
measure:主要负责执行统计任务,生成统计结果
analyze:主要负责保存与展示统计结果
目前支持:
Apache Giffin目前的数据源是支持HIVE, CUSTOM, AVRO, KAFKA。mysql和其他关系型数据库的扩展需要自己进行扩展
注意:
- 目前griffin只支持从UI创建accuracy measure ,其他的需要通过配置文件方式运行。
- 邮件通知功能 从源码上还没有发现如何去实现。
- 异常检测官方只是提供了一种实现思路,具体做起来还得深入了解。
二. Griffin-0.5.0 安装
安装依赖要求:
- JDK (1.8 or later versions)
- MySQL(version 5.6及以上)
- Hadoop (2.6.0 or later)
- Hive (version 2.x)
- Spark (version 2.2.1)
- Livy(livy-0.5.0-incubating)
- ElasticSearch (5.0 or later versions)
因为之前已安装CDH6.3.2, 已存在环境
JDK-1.8
Mysql-5.7.30
Hadoop-3.0.0-cdh6.3.2
Hive-2.1.1-cdh6.3.2
Spark-2.4.0+cdh6.3.2 7
Kafka.-2.2.1+cdh60…3.2
Livy-0.7.0
ElasticSearch-7.8.0
ES,Livy安装可参考上两篇。
2.1下载Griffin 0.5.0
官网http://griffin.apache.org/docs/latest.html
解压:
[root@cdh3 package]# pwd
/opt/package
[root@cdh3 package]# ll
total 271988
-rw-r--r-- 1 root root 92791460 Jun 28 08:07 apache-livy-0.7.0-incubating-bin.zip
-rw-r--r-- 1 root root 4405489 Jun 29 11:34 griffin-0.5.0-source-release.zip
-rw-r--r--. 1 root root 181310701 Jun 2 21:17 jdk-8u73-linux-x64.tar.gz
[root@cdh3 package]# unzip griffin-0.5.0-source-release.zip -d /opt/soft/
在下载的源码中service/src/main/resources/Init_quartz_mysql_innodb.sql找到sql脚本,上传到Mysql Service 所在机器,脚本权限不够修改下权限。griffin的job调度是依赖于quartz,该脚本进行初始化quartz表信息。
[root@cdh3 resources]# pwd
/opt/soft/griffin-0.5.0/service/src/main/resources
[root@cdh3 resources]# scp -r Init_quartz_mysql_innodb.sql root@cdh1:/opt/
#权限不够 修改下权限
[root@cdh1 opt]# chmod 777 Init_quartz_mysql_innodb.sql
2.2 Mysql配置
在MySQL中创建数据库quartz,然后执行Init_quartz_mysql_innodb.sql脚本初始化表信息。因为我之前设置了一个统一授权用户cdh,所以没单独建用户
mysql> CREATE DATABASE quartz DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)
你也可以自己建个用户使用下面命令:
CREATE USER ‘quartz’@’%’ IDENTIFIED BY ‘123456’; GRANT ALL PRIVILEGES ON . TO ‘quartz’@’%’;
执行后的表:
[root@cdh1 opt]# ll
total 12
drwxr-xr-x 2 root root 77 Jun 20 16:37 cdh6
drwxr-xr-x 8 cloudera-scm cloudera-scm 97 Jun 20 23:42 cloudera
-rwxrwxrwx 1 root root 7462 Jun 29 11:43 Init_quartz_mysql_innodb.sql
drwxr-xr-x. 8 root root 4096 Jun 28 15:34 package
drwxr-xr-x. 2 root root 6 Mar 26 2015 rh
drwxr-xr-x. 5 root root 60 Jun 24 17:41 soft
[root@cdh1 opt]# mysql -u cdh -p quartz < Init_quartz_mysql_innodb.sql
Enter password:
//执行后会在quartz下生成如下表
mysql> use quartz;
mysql> show tables;
+--------------------------+
| Tables_in_quartz |
+--------------------------+
| QRTZ_BLOB_TRIGGERS |
| QRTZ_CALENDARS |
| QRTZ_CRON_TRIGGERS |
| QRTZ_FIRED_TRIGGERS |
| QRTZ_JOB_DETAILS |
| QRTZ_LOCKS |
| QRTZ_PAUSED_TRIGGER_GRPS |
| QRTZ_SCHEDULER_STATE |
| QRTZ_SIMPLE_TRIGGERS |
| QRTZ_SIMPROP_TRIGGERS |
| QRTZ_TRIGGERS |
+--------------------------+
11 rows in set (0.00 sec)
2.3 组件的环境配置
环境配置我没有配,因为集群为CM的我直接略过,下面是别人提供的配法。
export下面的变量。或者创建一个griffin_env.sh文件,写入下面的内容,并将脚本配置到.bashrc
#!/bin/bash
export JAVA_HOME=/usr/local/zulu8
export HADOOP_HOME=/opt/hadoop-3.1.2
export HADOOP_COMMON_HOME=/opt/hadoop-3.1.2
export HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop-3.1.2/lib/native
export HADOOP_HDFS_HOME=/opt/hadoop-3.1.2
export HADOOP_INSTALL=/opt/hadoop-3.1.2
export HADOOP_MAPRED_HOME=/opt/hadoop-3.1.2
export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/opt/spark-2.4.3-bin-hadoop2.7
export LIVY_HOME=/opt/apache-livy-0.6.0-incubating-bin
export HIVE_HOME=/opt/apache-hive-3.1.1-bin
export YARN_HOME=/opt/hadoop-3.1.2
export SCALA_HOME=/usr/share/scala
export PATH=$PATH:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$LIVY_HOME/bin:$SCALA_HOME/bin
参考链接:https://blog.csdn.net/github_39577257/article/details/90607081
2.4 Hive 配置
griffin运行起来后 底层调用spark计算会依赖hive的配置文件,放在hdfs的目录上spark任务运行时候会直接引用。
[root@cdh3 resources]# hadoop fs -mkdir /home/spark_conf
[root@cdh3 package]# hadoop fs -put /etc/hive/conf/hive-site.xml /home/spark_conf
[root@cdh3 package]# hadoop fs -ls /home/spark_conf
Found 1 items
-rw-r--r-- 1 root supergroup 6708 2020-06-29 15:20 /home/spark_conf/hive-site.xml
[root@cdh3 package]#
2.5 配置 Griffin
1.修改service/src/main/resources/application.properties
[root@cdh1 resources]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/application.properties
#apache griffin application name 此处name要和后面的griffin server jar包所要改的名字一致
spring.application.name=griffin_service
#griffin server启动 默认端口8080
server.port=8090
#配置mysql,username是我的统一授权用户
spring.datasource.url=jdbc:mysql://cdh1:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username=cdh
spring.datasource.password=123456
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore 根据 /etc/hive/conf.cloudera.hive/hive-site.xml 填写
hive.metastore.uris=thrift://cdh2:9083
hive.metastore.dbname=hive
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry
kafka.schema.registry.url=http://cdh4:9092
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name 根据 /etc/hadoop/conf/core-site.xml 填写
fs.defaultFS=hdfs://cdh2:8020
# elasticsearch
elasticsearch.host=cdh3
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy
livy.uri=http://cdh3:8998/batches
#livy.need.queue=false
#livy.task.max.concurrent.count=20
#livy.task.submit.interval.second=3
#livy.task.appId.retry.count=3
# yarn url resourcemanager 地址
yarn.uri=http://cdh2:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook
# 压缩
server.compression.enabled=true
server.compression.mime-types=application/json,application/xml,text/html,text/xml,text/plain,application/javascript,text/css
2.配置service/src/main/resources/quartz.properties
不用动,保持默认
[root@cdh1 resources]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/quartz.properties
org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
3.配置service/src/main/resources/sparkProperties.json
[root@cdh1 resources]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/sparkProperties.json
{
"file": "hdfs://cdh2:8020/griffin/griffin-measure.jar",
"className": "org.apache.griffin.measure.Application",
"queue": "default",
"numExecutors": 2,
"executorCores": 1,
"driverMemory": "512m",
"executorMemory": "512m",
"conf": {
"spark.yarn.dist.files": "hdfs://cdh2:8020/home/spark_conf/hive-site.xml"
},
"files": [
]
}
4.配置service/src/main/resources/env/env_batch.json
批数据的env, sinks 你要将spark处理后的结果写入到位置。HDFS里作为备份,后续metrics显示指标的折线图数据需要到ES里获取。
[root@cdh1 env]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/env/env_batch.json
{
"spark": {
"log.level": "WARN"
},
"sinks": [
{
"type": "CONSOLE",
"config": {
"max.log.lines": 10
}
},
{
"type": "HDFS",
"config": {
"path": "hdfs://cdh2:8020/griffin/persist",
"max.persist.lines": 10000,
"max.lines.per.file": 10000
}
},
{
"type": "ELASTICSEARCH",
"config": {
"method": "post",
"api": "http://cdh3:9200/griffin/accuracy",
"connection.timeout": "1m",
"retry": 10
}
}
],
"griffin.checkpoint": []
}
5.配置service/src/main/resources/env/env_streaming.json
因为暂时不测streaming数据,以下配置无关痛痒。当测streaming数据的时候会再重新自定义env和dq两个配置文件。
[root@cdh1 env]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/env/env_streaming.json
{
"spark": {
"log.level": "WARN",
"checkpoint.dir": "hdfs://cdh2:8020/griffin/checkpoint/${JOB_NAME}",
"init.clear": true,
"batch.interval": "1m",
"process.interval": "5m",
"config": {
"spark.default.parallelism": 4,
"spark.task.maxFailures": 5,
"spark.streaming.kafkaMaxRatePerPartition": 1000,
"spark.streaming.concurrentJobs": 4,
"spark.yarn.maxAppAttempts": 5,
"spark.yarn.am.attemptFailuresValidityInterval": "1h",
"spark.yarn.max.executor.failures": 120,
"spark.yarn.executor.failuresValidityInterval": "1h",
"spark.hadoop.fs.hdfs.impl.disable.cache": true
}
},
"sinks": [
{
"type": "CONSOLE",
"config": {
"max.log.lines": 100
}
},
{
"type": "HDFS",
"config": {
"path": "hdfs://cdh2:8020/griffin/persist",
"max.persist.lines": 10000,
"max.lines.per.file": 10000
}
},
{
"type": "ELASTICSEARCH",
"config": {
"method": "post",
"api": "http://cdh3:9200/griffin/accuracy"
}
}
],
"griffin.checkpoint": [
{
"type": "zk",
"config": {
"hosts": "cdh2:2181,cdh3:2181,cdh4:3181",
"namespace": "griffin/infocache",
"lock.path": "lock",
"mode": "persist",
"init.clear": true,
"close.clear": false
}
}
]
}
ES配置
es中提前创建griffin索引,运行后的结果文件会post到这里,因为我的为7.8版本,所以使用了include_type_name=true
[root@cdh1 env]# curl -k -H "Content-Type: application/json" -X PUT http://cdh2:9200/griffin?include_type_name=true \
-d '{
"aliases": {},
"mappings": {
"accuracy": {
"properties": {
"name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"tmst": {
"type": "date"
}
}
}
},
"settings": {
"index": {
"number_of_replicas": "2",
"number_of_shards": "5"
}
}
}'
创建索引成功返回
{"acknowledged":true,"shards_acknowledged":true,"index":"griffin"}
ES基本操作,校验ES没问题
查看所有索引:
[root@cdh3 parcels]# curl -X GET http://cdh3:9200/_cat/indices
yellow open commodity FBbDiSUFRXy0d-VYHUMn-w 1 1 0 0 208b 208b
查看索引内容: 查看索引griffin内容
curl -X GET http://cdh3:9200/griffin/_search?pretty
查看节点信息
curl -X GET "cdh3:9200/_cat/nodes?v"
查看健康信息:
curl -X GET "cdh3:9200/_cat/heath?v"
查看索引信息:
curl -X GET "cdh3:9200/_cat/indices?v"
删除索引:
curl -X DELETE "cdh:9200/commoditytest"
创建索引:?pretty 美化json格式方式输出
curl -X PUT "cdh3:9200/commoditytest?pretty"
配置mysql驱动
原因为程序启动无法加载jdbc驱动类,因此编辑service/pom.xml,大概在113行,将注释的mysql-connector-java释放开。
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.java.version}</version>
</dependency>
确保jdbc已部署到使用的节点
2.6 编译griffin
[root@cdh1 griffin-0.5.0]# mvn clean
[root@cdh1 griffin-0.5.0]# mvn -T2C install -DskipTests
-T2C:一个CPU核心启动两个线程进行编译,可以加快源码编译的速度。
错误1:编译时候ui部分报错
[ERROR]
[ERROR] ERROR in /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts (4137,26): Cannot find name 'SVGElementTagNameMap'.
[ERROR] ERROR in /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts (4137,89): Cannot find name 'SVGElementTagNameMap'.
[ERROR]
[ERROR] npm ERR! Linux 3.10.0-514.el7.x86_64
[ERROR] npm ERR! argv "/opt/soft/griffin-0.5.0/ui/.tmp/node/node" "/opt/soft/griffin-0.5.0/ui/.tmp/node/node_modules/npm/bin/npm-cli.js" "run" "build "
[ERROR] npm ERR! node v6.11.3
[ERROR] npm ERR! npm v3.10.10
[ERROR] npm ERR! code ELIFECYCLE
[ERROR] npm ERR! griffin@0.0.0 build: `ng build`
[ERROR] npm ERR! Exit status 1
[ERROR] npm ERR!
[ERROR] npm ERR! Failed at the griffin@0.0.0 build script 'ng build'.
[ERROR] npm ERR! Make sure you have the latest version of node.js and npm installed.
[ERROR] npm ERR! If you do, this is most likely a problem with the griffin package,
[ERROR] npm ERR! not with npm itself.
[ERROR] npm ERR! Tell the author that this fails on your system:
[ERROR] npm ERR! ng build
[ERROR] npm ERR! You can get information on how to open an issue for this project with:
[ERROR] npm ERR! npm bugs griffin
[ERROR] npm ERR! Or if that isn't available, you can get their info via:
[ERROR] npm ERR! npm owner ls griffin
[ERROR] npm ERR! There is likely additional logging output above.
[ERROR]
[ERROR] npm ERR! Please include the following file with any support request:
[ERROR] npm ERR! /opt/soft/griffin-0.5.0/ui/angular/npm-debug.log
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO]
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [ 16.497 s]
[INFO] Apache Griffin :: UI :: Default UI ................. FAILURE [24:12 min]
[INFO] Apache Griffin :: Web Service ...................... SKIPPED
[INFO] Apache Griffin :: Measures ......................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24:28 min
[INFO] Finished at: 2020-06-30T15:04:04+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.6:npm (npm build) on project ui: Failed to run task: 'npm run build' faile d. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
vim /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts
删除4137这行
find<K extends keyof SVGElementTagNameMap>(selector_element: K | JQuery<K>): JQuery<SVGElementTagNameMap[K]>;
2.7 部署 jar包
创建目录
[root@cdh3 soft] mkdir /opt/soft/griffin-0.5.0
[root@cdh3 soft]# hadoop fs -mkdir -p /griffin/persist
[root@cdh3 soft]# hadoop fs -mkdir -p /griffin/checkpoint
[root@cdh3 soft]# hadoop fs -ls /griffin
Found 2 items
drwxr-xr-x - root supergroup 0 2020-06-30 15:55 /griffin/checkpoint
drwxr-xr-x - root supergroup 0 2020-06-30 15:55 /griffin/persist
```java
设置环境变量(可以不配置)
[root@cdh3 soft]# vim /etc/profile #添加如下变量
export GRIFFIN_HOME=/opt/soft/griffin-0.5.0
export PATH=
P
A
T
H
:
PATH:
PATH:GRIFFIN_HOME/bin
#保存后 source
[root@cdh3 soft]# source /etc/profile
```java
# 重命名measure、service,重命名的jar要和上面的配置文件application.properties 里的name一致
mv measure/target/measure-0.5.0.jar $GRIFFIN_HOME/griffin-measure.jar
mv service/target/service-0.5.0.jar $GRIFFIN_HOME/griffin-service.jar
# 将measure上传到HDFS
hadoop fs -put $GRIFFIN_HOME/griffin-measure.jar /griffin/
#griffin-service.jar 放入Griffin_home
启动server
# 启动之前请确保Hive的 metastore 服务正常开启
nohup java -jar $GRIFFIN_HOME/griffin-service.jar>$GRIFFIN_HOME/service.out 2>&1 &
tail -f service.out 查看启动日志及时排错,几秒后可登陆web http://cdh3:8090 user/test
需用用户密码也可登陆
以下截图是我已经跑过几个job的截图
Measures,显示用于定义的Measure列表,Measure可以理解为定义数据质量监控的一种规则或者模型。
创建Measures时候分以下四个数据质量模型
1.Accuracy 精确度 ,指对比两个数据集source/target,指定对比规则如大于,小于,等于,指定对比的区间。最后通过job调起的spark计算得到结果集。
2.Data Profiling 数据分析,定义一个源数据集,求得n个字段的最大,最小,count值等等
3. Publish 发布,用户如果通过配置文件而不是界面方式创建了Measure,并且spark运行了该质量模型,结果集会写入到 ES中,通过publish 定义一个同名的Mesaure,就会在界面的仪表盘中显示结果集。
4. json/yaml Mesaure用户自定义的Measure,配置文件也可以通过这个位置定义
Jobs,显示用户定义的job列表,job是定时调起用户定义的Measure。底层实现为quartz. 交给livy运行底层计算spark 任务。如果spark on yarn,livy会将任务交给yarn。
My DashBoard ,仪表盘,显示所有的结果集报表。
DataAssets 数据资产。可以作为数据源的列表,我这里显示的是hive 表。
三. 批数据精准度测试
hive两个外部表 按照Measure定义的规则进行对比,求得结果集
3.1数据准备
下载批测试数据脚本http://griffin.apache.org/data/batch/
[root@cdh3 griffin-0.5.0]# ll /opt/soft/griffin-0.5.0/testdata/
total 5420
-rw-r--r-- 1 root root 456 Dec 26 2018 create-table.hql
-rw-r--r-- 1 root root 12904 Jul 6 13:54 delta_src
-rw-r--r-- 1 root root 11785 Dec 26 2018 delta_tgt
-rw-r--r-- 1 root root 1351464 Dec 26 2018 demo_basic
-rw-r--r-- 1 root root 1364368 Jul 6 13:54 demo_src
-rw-r--r-- 1 root root 1363249 Jul 6 13:54 demo_tgt
-rwxr-xr-x 1 root root 142 Jul 6 14:11 gen_delta_src.sh
-rwxr-xr-x 1 root root 179 Dec 26 2018 gen_demo_data.sh
-rwxr-xr-x 1 root root 1704 Jul 1 09:22 gen-hive-data.sh
-rw-r--r-- 1 root root 190 Jul 1 09:11 insert-data.hql.template
根据create-table.hql脚本手动创建hive外部表:
CREATE EXTERNAL TABLE `demo_src`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY (
`dt` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs://cdh2:8020/griffin/data/batch/demo_src'
CREATE EXTERNAL TABLE `demo_tgt`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY (
`dt` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs://cdh2:8020/griffin/data/batch/demo_tgt';
权限不足的脚本 修改下权限:
chmod 755 .testdata/*.sh
根据自己环境修改gen-hive-data.sh脚本如下:
#!/bin/bash
#create table 手动创建hive外部表,这里不用了
#hive -f create-table.hql
echo "create table done"
#current hour
./gen_demo_data.sh
cur_date=`date +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"
#last hour
./gen_demo_data.sh
cur_date=`date -d '1 hour ago' +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"
#next hours
set +e
while true
do
./gen_demo_data.sh
cur_date=`date +%Y%m%d%H`
next_date=`date -d "+1hour" '+%Y%m%d%H'`
dt=${next_date:0:8}
hour=${next_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt:/dt=${dt}/hour=${hour}/_DONE
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"
sleep 3600
done
set -e
执行脚本,一次性在hive 的demo_tgt,demo_src 中生成上一个小时,当前小时,下一个小时,3个分区的数据,以后每小时都生成一个分区的数据。
[root@cdh3 testdata]# ./gen-hive-data.sh
create table done
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false
Loading data to table griffin_demo.demo_src partition (dt=20200709, hour=16)
OK
Time taken: 7.98 seconds
Loading data to table griffin_demo.demo_tgt partition (dt=20200709, hour=16)
OK
Time taken: 1.322 seconds
insert data [dt='20200709',hour='16'] done
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false
Loading data to table griffin_demo.demo_src partition (dt=20200709, hour=15)
OK
Time taken: 8.104 seconds
Loading data to table griffin_demo.demo_tgt partition (dt=20200709, hour=15)
OK
Time taken: 1.613 seconds
insert data [dt='20200709',hour='15'] done
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
^Ctouchz: `/griffin/data/batch/demo_src/dt=20200709/hour=17/_DONE': No such file or directory: `hdfs://cdh2:8020/griffin/data/batch/demo_src/dt=20200709/hour=17/_DONE'
touchz: `/griffin/data/batch/demo_tgt:/dt=20200709/hour=17/_DONE': No such file or directory: `hdfs://cdh2:8020/griffin/data/batch/demo_tgt:/dt=20200709/hour=17/_DONE'
insert data [dt='20200709',hour='17'] done
3.2 创建Accuracy Measure
griffin_demo中有demo_src,demo_target两个表,选择source 源,src表的age字段
选择target,demo_target表的age字段
选择两个字段对比规则
分区大小按小时区分,时区东八区,
为Measure 取名字,添加描述信息
3.2 创建Job运行Measure
定义半小时执行一次 0 0/30 * * * ?
点击运行2,每半小时执行一次, 点击运行1,只运行一次
最终可以看到metric
丢失的记录可以从Download miss sample 下载查看,丢失数据为json格式。
{“id”:124,“age”:1090,“desc”:“1090”,“dt”:“20200706”,“hour”:“12”,"__tmst":1594020363542}
{“id”:124,“age”:1159,“desc”:“1159”,“dt”:“20200706”,“hour”:“12”,"__tmst":1594020363542}
{“id”:124,“age”:1159,“desc”:“1159”,“dt”:“20200706”,“hour”:“14”,"__tmst":1594020363542}
四. 流数据精准度测试
kafka中创建两个topic,两个topic中的数据按Measure定义的规则进行对比,求得结果集
4.1 数据准备
下载流数据测试脚本放到kafka所在节点:http://griffin.apache.org/data/streaming/
[root@cdh4 griffin]# ll /opt/soft/griffin/testStream/
total 16
-rwxr-xr-x 1 root root 662 Jul 7 16:11 gen-data.sh
-rw-r--r-- 1 root root 616 Jul 7 15:52 source.temp
-rwxr-xr-x 1 root root 297 Jul 7 15:54 streaming-data.sh
-rw-r--r-- 1 root root 611 Jul 7 15:52 target.temp
修改脚本:
因为我的kafka版本并不老,所以对kafka部分做了修改。
vim gen-data.sh
#!/bin/bash
#current time
cur_time=`date +%Y-%m-%d_%H:%M:%S`
sed s/TIME/$cur_time/ ./source.temp > source.tp
sed s/TIME/$cur_time/ ./target.temp > target.tp
#create data
for row in 1 2 3 4 5 6 7 8 9 10
do
sed -n "${row}p" < source.tp > sline
cnt=`shuf -i1-2 -n1`
clr="red"
if [ $cnt == 2 ]; then clr="yellow"; fi
sed s/COLOR/$clr/ sline >> source.data
done
rm sline
cat target.tp > target.data
rm source.tp target.tp
#import data
kafka-console-producer --broker-list cdh4:9092 --topic source < source.data
kafka-console-producer --broker-list cdh4:9092 --topic target < target.data
rm source.data target.data
echo "insert data at ${cur_time}"
#!/bin/bash
#create topics
kafka-topics --create --zookeeper cdh4:2181 --replication-factor 1 --partitions 1 --topic source
kafka-topics --create --zookeeper cdh4:2181 --replication-factor 1 --partitions 1 --topic target
#every minute
set +e
while true
do
./gen-data.sh
sleep 60
done
set -e
执行脚本会每分钟向两个topic中发送数据。
[root@cdh4 testStream]# ./streaming-data.sh
4.2 编写配置文件
[root@cdh3 testStream]# ll
total 8
-rw-r--r-- 1 root root 2686 Jul 8 15:21 dqStream.json
-rw-r--r-- 1 root root 1231 Jul 8 09:12 envStream.json
[root@cdh3 griffin-0.5.0]# vim ./testStream/dqStream.json
{
"name": "streaming_accu",
"process.type": "streaming",
"data.sources": [
{
"name": "src",
"baseline": true,
"connectors": [
{
"type": "kafka",
"version": "2.2.1+cdh6.3.2",
"config": {
"kafka.config": {
"bootstrap.servers": "cdh4:9092",
"group.id": "griffin",
"auto.offset.reset": "largest",
"auto.commit.enable": "false"
},
"topics": "source",
"key.type": "java.lang.String",
"value.type": "java.lang.String"
},
"pre.proc": [
{
"dsl.type": "df-opr",
"rule": "from_json"
}
]
}
],
"checkpoint": {
"type": "json",
"file.path": "hdfs://cdh2:8020/griffin/streaming/dump/source",
"info.path": "source",
"ready.time.interval": "10s",
"ready.time.delay": "0",
"time.range": ["-5m", "0"],
"updatable": true
}
}, {
"name": "tgt",
"connectors": [
{
"type": "kafka",
"version": "2.2.1+cdh6.3.2",
"config": {
"kafka.config": {
"bootstrap.servers": "cdh4:9092",
"group.id": "griffin",
"auto.offset.reset": "largest",
"auto.commit.enable": "false"
},
"topics": "target",
"key.type": "java.lang.String",
"value.type": "java.lang.String"
},
"pre.proc": [
{
"dsl.type": "df-opr",
"rule": "from_json"
}
]
}
],
"checkpoint": {
"type": "json",
"file.path": "hdfs://cdh2:8020/griffin/streaming/dump/target",
"info.path": "target",
"ready.time.interval": "10s",
"ready.time.delay": "0",
"time.range": ["-1m", "0"]
}
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "accuracy",
"out.dataframe.name": "accu",
"rule": "src.id = tgt.id AND src.name = tgt.name AND src.color = tgt.color AND src.time = tgt.time",
"details": {
"source": "src",
"target": "tgt",
"miss": "miss_count",
"total": "total_count",
"matched": "matched_count"
},
"out":[
{
"type":"metric",
"name": "accu"
},
{
"type":"record",
"name": "missRecords"
}
]
}
]
},
"sinks": ["CONSOLE", "HDFS","ELASTICSEARCH"]
}
[root@cdh3 griffin-0.5.0]# vim ./testStream/envStream.json
{
"spark": {
"log.level": "WARN",
"checkpoint.dir": "hdfs://cdh2:8020/griffin/checkpoint",
"batch.interval": "20s",
"process.interval": "1m",
"init.clear": true,
"config": {
"spark.default.parallelism": 4,
"spark.task.maxFailures": 5,
"spark.streaming.kafkaMaxRatePerPartition": 1000,
"spark.streaming.concurrentJobs": 4,
"spark.yarn.maxAppAttempts": 5,
"spark.yarn.am.attemptFailuresValidityInterval": "1h",
"spark.yarn.max.executor.failures": 120,
"spark.yarn.executor.failuresValidityInterval": "1h",
"spark.hadoop.fs.hdfs.impl.disable.cache": true
}
},
"sinks": [
{
"type": "console"
},
{
"type": "hdfs",
"config": {
"path": "hdfs://cdh2:8020/griffin/persist"
}
},
{
"type": "elasticsearch",
"config": {
"method": "post",
"api": "http://cdh3:9200/griffin/accuracy"
}
}
],
"griffin.checkpoint": [
{
"type": "zk",
"config": {
"hosts": "cdh3:2181",
"namespace": "griffin/infocache",
"lock.path": "lock",
"mode": "persist",
"init.clear": true,
"close.clear": false
}
}
]
}
4.3 运行spark
提交spark任务,结果集写入控制台,hdfs,ES。
spark-submit \
--class org.apache.griffin.measure.Application \
--master yarn \
--deploy-mode client \
--queue default \
--driver-memory 512m \
--executor-memory 512m \
--num-executors 3 \
hdfs://cdh2:8020/griffin/griffin-measure.jar \
./testStream/envStream.json \
./testStream/dqStream.json
检查hdfs,ES是否生成结果集。
hdfs结果集示意图如下(后补的图)
UI界面发布自己的数据质量结果集Publish。保持MeasureName(dqStream.json中查看),MetricName(脚本打印信息中查看) 与配置文件相同才能识别,有个问题就是定义发布Publish了后,Jobs界面就不会显示之前的job了。
My Dashboard 仪表盘 查看Streaming数据质量管理结果。
结果是以二维表形式展示而不是折线图。这里怀疑是Measure type定义的问题。目前不知道如何解决。
报错2,运行job时候报错:
2020-07-01 13:15:00.145 ERROR 55618 --- [ryBean_Worker-4] o.a.g.c.j.SparkSubmitJob : Post to livy ERROR.
400 Bad Request {"msg":"Duplicate session name: Some(griffin) for session 36"}
解决:
删除 service/src/main/resources/sparkProperties 下 name那一行,重新编译
报错3,运行job 报错 不能创建jvm,
YARN Diagnostics: ], appId=null, appInfo={driverLogUrl=null, sparkUiUrl=null}, name=null, proxyUser=null, id=45, state=starting}, batches id : 45
2020-07-01 14:10:00.836 INFO 60336 --- [ryBean_Worker-4] o.a.g.c.j.LivyTaskSubmitHelper : retry get livy resultMap: {owner=null, log=[stdout: , Inv alid maximum heap size: -Xmx512mb, Error: Could not create the Java Virtual Machine., Error: A fatal exception has occurred. Program will exit.,
stderr: ,
解决:sparkProperties.json ,配置写错了原先512mb ->512m
vim service/src/main/resources/sparkProperties.json
"driverMemory": "512m",
"executorMemory": "512m"
,
问题3:metrics不显示仪表盘,servier 日志warn 如下
2020-07-02 15:23:13.308 WARN 56236 --- [/O dispatcher 1] o.e.c.RestClient : request [GET http://cdh3:9200/griffin/accuracy/_search?fil ter_path=hits.hits._source] returned 1 warnings: [299 Elasticsearch-7.8.0-757314695644ea9a1dc2fecd26d1a43856725e65 "[types removal] Specifying types in search requests is deprecated."]
解决:ES版本过高。创建索引时候有问题,创建时候如上加入include_type_name=true。