Apache Griffin-0.5.0 数据质量监控工具安装及运行batch，streaming数据质量监测demo

最新推荐文章于 2024-05-27 09:59:07 发布

途足

最新推荐文章于 2024-05-27 09:59:07 发布

阅读量3.5k

点赞数 4

文章标签：大数据 kafka spark hadoop

本文链接：https://blog.csdn.net/weixin_40004348/article/details/107191430

版权

Apache Griffin 数据质量监控工具

官网：http://griffin.apache.org/docs/quickstart-cn.html
github: https://github.com/apache/griffin
参考：https://cwiki.apache.org/confluence/display/GRIFFIN/1.+Overview
测试流数据：http://griffin.apache.org/data/streaming/
测试批数据：http://griffin.apache.org/data/streaming/

一. 概述

数据质量检测方式：

Griffin是属于模型驱动的方案，基于目标数据集合或者源数据集(基准数据)，用户可以选择不同的数据质量维度来执行目标数据质量的验证。支持两种类型的数据源：

batch数据：通过数据连接器从Hadoop平台收集数据

streaming数据：可以连接到诸如Kafka之类的消息系统来做近似实时数据分析

执行过程

batch数据质量监控原理：hive metadata中加载数据源
根据用户指定的数据质量检查的规则，将规则交给livy，livy以rest方式提交给yarn，yarn启动spark程序，利用Spark这种强大的计算能力，为数据质量做出检测分析。

Griffin 功能：

度量：精确度、完整性、及时性、唯一性、有效性、一致性。
异常监测：利用预先设定的规则，检测出不符合预期的数据，提供不符合规则数据的下载。
异常告警：通过邮件或门户报告数据质量问题。
可视化监测：利用控制面板来展现数据质量的状态。
实时性：可以实时进行数据质量检测，能够及时发现问题。
可扩展性：可用于多个数据系统仓库的数据校验。
可伸缩性：工作在大数据量的环境中，目前运行的数据量约1.2PB(eBay环境)。
自助服务：Griffin提供了一个简洁易用的用户界面，可以管理数据资产和数据质量规则；同时用户可以通过控制面板查看数据质量结果和自定义显示内容。

工作流程

注册数据，把想要检测数据质量的数据源注册到griffin。
配置度量模型，可以从数据质量维度来定义模型，如：精确度、完整性、及时性、唯一性等。
配置定时任务提交spark集群，定时检查数据。
在门户界面上查看指标，分析数据质量校验结果。

系统架构

主要分为，define，measure，analyze 三部分

define：主要负责定义数据质量统计的维度，比如数据质量统计的时间跨度、统计的目标（源端和目标端的数据数量是否一致，数据源里某一字段的非空的数量、不重复值的数量、最大值、最小值、top5的值数量等）

measure：主要负责执行统计任务，生成统计结果

analyze：主要负责保存与展示统计结果

目前支持：

Apache Giffin目前的数据源是支持HIVE, CUSTOM, AVRO, KAFKA。mysql和其他关系型数据库的扩展需要自己进行扩展

注意：

目前griffin只支持从UI创建accuracy measure ，其他的需要通过配置文件方式运行。
邮件通知功能从源码上还没有发现如何去实现。
异常检测官方只是提供了一种实现思路，具体做起来还得深入了解。

二. Griffin-0.5.0 安装

安装依赖要求：

JDK (1.8 or later versions)
MySQL(version 5.6及以上)
Hadoop (2.6.0 or later)
Hive (version 2.x)
Spark (version 2.2.1)
Livy（livy-0.5.0-incubating）
ElasticSearch (5.0 or later versions）

因为之前已安装CDH6.3.2, 已存在环境
JDK-1.8
Mysql-5.7.30
Hadoop-3.0.0-cdh6.3.2
Hive-2.1.1-cdh6.3.2
Spark-2.4.0+cdh6.3.2 7
Kafka.-2.2.1+cdh60…3.2
Livy-0.7.0
ElasticSearch-7.8.0

ES,Livy安装可参考上两篇。

2.1下载Griffin 0.5.0

官网http://griffin.apache.org/docs/latest.html

解压：

[root@cdh3 package]# pwd
/opt/package
[root@cdh3 package]# ll
total 271988
-rw-r--r--  1 root root  92791460 Jun 28 08:07 apache-livy-0.7.0-incubating-bin.zip
-rw-r--r--  1 root root   4405489 Jun 29 11:34 griffin-0.5.0-source-release.zip
-rw-r--r--. 1 root root 181310701 Jun  2 21:17 jdk-8u73-linux-x64.tar.gz
[root@cdh3 package]# unzip griffin-0.5.0-source-release.zip -d /opt/soft/

在下载的源码中service/src/main/resources/Init_quartz_mysql_innodb.sql找到sql脚本，上传到Mysql Service 所在机器，脚本权限不够修改下权限。griffin的job调度是依赖于quartz，该脚本进行初始化quartz表信息。

[root@cdh3 resources]# pwd
/opt/soft/griffin-0.5.0/service/src/main/resources
[root@cdh3 resources]# scp -r Init_quartz_mysql_innodb.sql root@cdh1:/opt/
#权限不够 修改下权限
[root@cdh1 opt]# chmod 777 Init_quartz_mysql_innodb.sql

2.2 Mysql配置

在MySQL中创建数据库quartz，然后执行Init_quartz_mysql_innodb.sql脚本初始化表信息。因为我之前设置了一个统一授权用户cdh，所以没单独建用户

mysql> CREATE DATABASE quartz DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

你也可以自己建个用户使用下面命令：

CREATE USER ‘quartz’@’%’ IDENTIFIED BY ‘123456’; GRANT ALL PRIVILEGES ON . TO ‘quartz’@’%’;

执行后的表：

[root@cdh1 opt]# ll
total 12
drwxr-xr-x  2 root         root           77 Jun 20 16:37 cdh6
drwxr-xr-x  8 cloudera-scm cloudera-scm   97 Jun 20 23:42 cloudera
-rwxrwxrwx  1 root         root         7462 Jun 29 11:43 Init_quartz_mysql_innodb.sql
drwxr-xr-x. 8 root         root         4096 Jun 28 15:34 package
drwxr-xr-x. 2 root         root            6 Mar 26  2015 rh
drwxr-xr-x. 5 root         root           60 Jun 24 17:41 soft
[root@cdh1 opt]# mysql -u cdh -p quartz  < Init_quartz_mysql_innodb.sql
Enter password:

//执行后会在quartz下生成如下表
mysql> use quartz;
mysql> show tables;
+--------------------------+
| Tables_in_quartz         |
+--------------------------+
| QRTZ_BLOB_TRIGGERS       |
| QRTZ_CALENDARS           |
| QRTZ_CRON_TRIGGERS       |
| QRTZ_FIRED_TRIGGERS      |
| QRTZ_JOB_DETAILS         |
| QRTZ_LOCKS               |
| QRTZ_PAUSED_TRIGGER_GRPS |
| QRTZ_SCHEDULER_STATE     |
| QRTZ_SIMPLE_TRIGGERS     |
| QRTZ_SIMPROP_TRIGGERS    |
| QRTZ_TRIGGERS            |
+--------------------------+
11 rows in set (0.00 sec)

2.3 组件的环境配置

环境配置我没有配，因为集群为CM的我直接略过，下面是别人提供的配法。
export下面的变量。或者创建一个griffin_env.sh文件，写入下面的内容，并将脚本配置到.bashrc

#!/bin/bash
export JAVA_HOME=/usr/local/zulu8

export HADOOP_HOME=/opt/hadoop-3.1.2
export HADOOP_COMMON_HOME=/opt/hadoop-3.1.2
export HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop-3.1.2/lib/native
export HADOOP_HDFS_HOME=/opt/hadoop-3.1.2
export HADOOP_INSTALL=/opt/hadoop-3.1.2
export HADOOP_MAPRED_HOME=/opt/hadoop-3.1.2
export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/opt/spark-2.4.3-bin-hadoop2.7
export LIVY_HOME=/opt/apache-livy-0.6.0-incubating-bin
export HIVE_HOME=/opt/apache-hive-3.1.1-bin
export YARN_HOME=/opt/hadoop-3.1.2
export SCALA_HOME=/usr/share/scala

export PATH=$PATH:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$LIVY_HOME/bin:$SCALA_HOME/bin

参考链接：https://blog.csdn.net/github_39577257/article/details/90607081

2.4 Hive 配置

griffin运行起来后底层调用spark计算会依赖hive的配置文件，放在hdfs的目录上spark任务运行时候会直接引用。

[root@cdh3 resources]# hadoop fs -mkdir /home/spark_conf

[root@cdh3 package]# hadoop fs -put /etc/hive/conf/hive-site.xml /home/spark_conf
[root@cdh3 package]# hadoop fs -ls /home/spark_conf
Found 1 items
-rw-r--r--   1 root supergroup       6708 2020-06-29 15:20 /home/spark_conf/hive-site.xml
[root@cdh3 package]#

2.5 配置 Griffin

1.修改service/src/main/resources/application.properties

[root@cdh1 resources]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/application.properties

#apache griffin application name 此处name要和后面的griffin server jar包所要改的名字一致
spring.application.name=griffin_service
#griffin server启动 默认端口8080
server.port=8090

#配置mysql，username是我的统一授权用户
spring.datasource.url=jdbc:mysql://cdh1:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username=cdh
spring.datasource.password=123456
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore  根据 /etc/hive/conf.cloudera.hive/hive-site.xml 填写
hive.metastore.uris=thrift://cdh2:9083
hive.metastore.dbname=hive
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry
kafka.schema.registry.url=http://cdh4:9092
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name  根据 /etc/hadoop/conf/core-site.xml 填写
fs.defaultFS=hdfs://cdh2:8020
# elasticsearch
elasticsearch.host=cdh3
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy
livy.uri=http://cdh3:8998/batches
#livy.need.queue=false
#livy.task.max.concurrent.count=20
#livy.task.submit.interval.second=3
#livy.task.appId.retry.count=3
# yarn url resourcemanager 地址
yarn.uri=http://cdh2:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook

# 压缩
server.compression.enabled=true
server.compression.mime-types=application/json,application/xml,text/html,text/xml,text/plain,application/javascript,text/css

2.配置service/src/main/resources/quartz.properties
不用动，保持默认

[root@cdh1 resources]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/quartz.properties

org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000

3.配置service/src/main/resources/sparkProperties.json

[root@cdh1 resources]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/sparkProperties.json

{
  "file": "hdfs://cdh2:8020/griffin/griffin-measure.jar",
  "className": "org.apache.griffin.measure.Application",
  "queue": "default",
  "numExecutors": 2,
  "executorCores": 1,
  "driverMemory": "512m",
  "executorMemory": "512m",
  "conf": {
    "spark.yarn.dist.files": "hdfs://cdh2:8020/home/spark_conf/hive-site.xml"
  },
  "files": [
  ]
}

4.配置service/src/main/resources/env/env_batch.json
批数据的env, sinks 你要将spark处理后的结果写入到位置。HDFS里作为备份，后续metrics显示指标的折线图数据需要到ES里获取。

[root@cdh1 env]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/env/env_batch.json

{
  "spark": {
    "log.level": "WARN"
  },
  "sinks": [
    {
      "type": "CONSOLE",
      "config": {
        "max.log.lines": 10
      }
    },
    {
      "type": "HDFS",
      "config": {
        "path": "hdfs://cdh2:8020/griffin/persist",
        "max.persist.lines": 10000,
        "max.lines.per.file": 10000
      }
    },
    {
      "type": "ELASTICSEARCH",
      "config": {
        "method": "post",
        "api": "http://cdh3:9200/griffin/accuracy",
        "connection.timeout": "1m",
        "retry": 10
      }
    }
  ],
  "griffin.checkpoint": []
}

5.配置service/src/main/resources/env/env_streaming.json
因为暂时不测streaming数据，以下配置无关痛痒。当测streaming数据的时候会再重新自定义env和dq两个配置文件。

[root@cdh1 env]# vim /opt/soft/griffin-0.5.0/service/src/main/resources/env/env_streaming.json

{
  "spark": {
    "log.level": "WARN",
    "checkpoint.dir": "hdfs://cdh2:8020/griffin/checkpoint/${JOB_NAME}",
    "init.clear": true,
    "batch.interval": "1m",
    "process.interval": "5m",
    "config": {
      "spark.default.parallelism": 4,
      "spark.task.maxFailures": 5,
      "spark.streaming.kafkaMaxRatePerPartition": 1000,
      "spark.streaming.concurrentJobs": 4,
      "spark.yarn.maxAppAttempts": 5,
      "spark.yarn.am.attemptFailuresValidityInterval": "1h",
      "spark.yarn.max.executor.failures": 120,
      "spark.yarn.executor.failuresValidityInterval": "1h",
      "spark.hadoop.fs.hdfs.impl.disable.cache": true
    }
  },
  "sinks": [
    {
      "type": "CONSOLE",
      "config": {
        "max.log.lines": 100
      }
    },
    {
      "type": "HDFS",
      "config": {
        "path": "hdfs://cdh2:8020/griffin/persist",
        "max.persist.lines": 10000,
        "max.lines.per.file": 10000
      }
    },
    {
      "type": "ELASTICSEARCH",
      "config": {
        "method": "post",
        "api": "http://cdh3:9200/griffin/accuracy"
      }
    }
  ],
  "griffin.checkpoint": [
    {
      "type": "zk",
      "config": {
        "hosts": "cdh2:2181,cdh3:2181,cdh4:3181",
        "namespace": "griffin/infocache",
        "lock.path": "lock",
        "mode": "persist",
        "init.clear": true,
        "close.clear": false
      }
    }
  ]
}

ES配置

es中提前创建griffin索引，运行后的结果文件会post到这里，因为我的为7.8版本，所以使用了include_type_name=true

[root@cdh1 env]# curl -k -H "Content-Type: application/json" -X PUT http://cdh2:9200/griffin?include_type_name=true \
 -d '{
    "aliases": {},
    "mappings": {
        "accuracy": {
            "properties": {
                "name": {
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    },
                    "type": "text"
                },
                "tmst": {
                    "type": "date"
                }
            }
        }
    },
    "settings": {
        "index": {
            "number_of_replicas": "2",
            "number_of_shards": "5"
        }
    }
}'

创建索引成功返回

{"acknowledged":true,"shards_acknowledged":true,"index":"griffin"}

ES基本操作，校验ES没问题
查看所有索引：

[root@cdh3 parcels]# curl -X GET http://cdh3:9200/_cat/indices 
yellow open commodity FBbDiSUFRXy0d-VYHUMn-w 1 1 0 0 208b 208b

查看索引内容：查看索引griffin内容

curl -X GET http://cdh3:9200/griffin/_search?pretty

查看节点信息

 curl -X GET "cdh3:9200/_cat/nodes?v"

查看健康信息：

  curl -X GET "cdh3:9200/_cat/heath?v"

查看索引信息：

curl -X GET "cdh3:9200/_cat/indices?v"

删除索引：

  curl -X DELETE "cdh:9200/commoditytest"

创建索引：?pretty 美化json格式方式输出

 curl -X PUT "cdh3:9200/commoditytest?pretty"

配置mysql驱动
原因为程序启动无法加载jdbc驱动类，因此编辑service/pom.xml，大概在113行，将注释的mysql-connector-java释放开。

     <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.java.version}</version>
        </dependency>

确保jdbc已部署到使用的节点

2.6 编译griffin

[root@cdh1 griffin-0.5.0]# mvn clean
[root@cdh1 griffin-0.5.0]# mvn -T2C install -DskipTests

-T2C：一个CPU核心启动两个线程进行编译，可以加快源码编译的速度。

错误1：编译时候ui部分报错

[ERROR]
[ERROR] ERROR in /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts (4137,26): Cannot find name 'SVGElementTagNameMap'.
[ERROR] ERROR in /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts (4137,89): Cannot find name 'SVGElementTagNameMap'.
[ERROR]
[ERROR] npm ERR! Linux 3.10.0-514.el7.x86_64
[ERROR] npm ERR! argv "/opt/soft/griffin-0.5.0/ui/.tmp/node/node" "/opt/soft/griffin-0.5.0/ui/.tmp/node/node_modules/npm/bin/npm-cli.js" "run" "build         "
[ERROR] npm ERR! node v6.11.3
[ERROR] npm ERR! npm  v3.10.10
[ERROR] npm ERR! code ELIFECYCLE
[ERROR] npm ERR! griffin@0.0.0 build: `ng build`
[ERROR] npm ERR! Exit status 1
[ERROR] npm ERR!
[ERROR] npm ERR! Failed at the griffin@0.0.0 build script 'ng build'.
[ERROR] npm ERR! Make sure you have the latest version of node.js and npm installed.
[ERROR] npm ERR! If you do, this is most likely a problem with the griffin package,
[ERROR] npm ERR! not with npm itself.
[ERROR] npm ERR! Tell the author that this fails on your system:
[ERROR] npm ERR!     ng build
[ERROR] npm ERR! You can get information on how to open an issue for this project with:
[ERROR] npm ERR!     npm bugs griffin
[ERROR] npm ERR! Or if that isn't available, you can get their info via:
[ERROR] npm ERR!     npm owner ls griffin
[ERROR] npm ERR! There is likely additional logging output above.
[ERROR]
[ERROR] npm ERR! Please include the following file with any support request:
[ERROR] npm ERR!     /opt/soft/griffin-0.5.0/ui/angular/npm-debug.log
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO]
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [ 16.497 s]
[INFO] Apache Griffin :: UI :: Default UI ................. FAILURE [24:12 min]
[INFO] Apache Griffin :: Web Service ...................... SKIPPED
[INFO] Apache Griffin :: Measures ......................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  24:28 min
[INFO] Finished at: 2020-06-30T15:04:04+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.6:npm (npm build) on project ui: Failed to run task: 'npm run build' faile         d. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command

vim /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts
删除4137这行

 find<K extends keyof SVGElementTagNameMap>(selector_element: K | JQuery<K>): JQuery<SVGElementTagNameMap[K]>;

在这里插入图片描述

2.7 部署 jar包

创建目录

[root@cdh3 soft] mkdir /opt/soft/griffin-0.5.0

[root@cdh3 soft]# hadoop fs -mkdir -p /griffin/persist
[root@cdh3 soft]# hadoop fs -mkdir -p /griffin/checkpoint
[root@cdh3 soft]# hadoop fs -ls /griffin
Found 2 items
drwxr-xr-x   - root supergroup          0 2020-06-30 15:55 /griffin/checkpoint
drwxr-xr-x   - root supergroup          0 2020-06-30 15:55 /griffin/persist
```java

设置环境变量（可以不配置）

[root@cdh3 soft]# vim /etc/profile #添加如下变量

export GRIFFIN_HOME=/opt/soft/griffin-0.5.0
export PATH= $P A T H :$ GRIFFIN_HOME/bin

#保存后 source
[root@cdh3 soft]# source /etc/profile


```java
# 重命名measure、service，重命名的jar要和上面的配置文件application.properties 里的name一致
mv measure/target/measure-0.5.0.jar $GRIFFIN_HOME/griffin-measure.jar
mv service/target/service-0.5.0.jar $GRIFFIN_HOME/griffin-service.jar

# 将measure上传到HDFS
hadoop fs -put $GRIFFIN_HOME/griffin-measure.jar /griffin/

#griffin-service.jar 放入Griffin_home

启动server

# 启动之前请确保Hive的 metastore 服务正常开启
nohup java -jar $GRIFFIN_HOME/griffin-service.jar>$GRIFFIN_HOME/service.out 2>&1 &

tail -f service.out 查看启动日志及时排错，几秒后可登陆web http://cdh3:8090 user/test
需用用户密码也可登陆
在这里插入图片描述
以下截图是我已经跑过几个job的截图

在这里插入图片描述
Measures，显示用于定义的Measure列表，Measure可以理解为定义数据质量监控的一种规则或者模型。

创建Measures时候分以下四个数据质量模型
1.Accuracy 精确度，指对比两个数据集source/target，指定对比规则如大于，小于，等于，指定对比的区间。最后通过job调起的spark计算得到结果集。
2.Data Profiling 数据分析，定义一个源数据集，求得n个字段的最大，最小，count值等等
3. Publish 发布，用户如果通过配置文件而不是界面方式创建了Measure，并且spark运行了该质量模型，结果集会写入到 ES中，通过publish 定义一个同名的Mesaure，就会在界面的仪表盘中显示结果集。
4. json/yaml Mesaure用户自定义的Measure，配置文件也可以通过这个位置定义
null

Jobs,显示用户定义的job列表，job是定时调起用户定义的Measure。底层实现为quartz. 交给livy运行底层计算spark 任务。如果spark on yarn，livy会将任务交给yarn。
在这里插入图片描述
My DashBoard ，仪表盘，显示所有的结果集报表。

DataAssets 数据资产。可以作为数据源的列表，我这里显示的是hive 表。

三. 批数据精准度测试

hive两个外部表按照Measure定义的规则进行对比，求得结果集

3.1数据准备

下载批测试数据脚本http://griffin.apache.org/data/batch/

[root@cdh3 griffin-0.5.0]# ll /opt/soft/griffin-0.5.0/testdata/
total 5420
-rw-r--r-- 1 root root     456 Dec 26  2018 create-table.hql
-rw-r--r-- 1 root root   12904 Jul  6 13:54 delta_src
-rw-r--r-- 1 root root   11785 Dec 26  2018 delta_tgt
-rw-r--r-- 1 root root 1351464 Dec 26  2018 demo_basic
-rw-r--r-- 1 root root 1364368 Jul  6 13:54 demo_src
-rw-r--r-- 1 root root 1363249 Jul  6 13:54 demo_tgt
-rwxr-xr-x 1 root root     142 Jul  6 14:11 gen_delta_src.sh
-rwxr-xr-x 1 root root     179 Dec 26  2018 gen_demo_data.sh
-rwxr-xr-x 1 root root    1704 Jul  1 09:22 gen-hive-data.sh
-rw-r--r-- 1 root root     190 Jul  1 09:11 insert-data.hql.template

根据create-table.hql脚本手动创建hive外部表：

CREATE EXTERNAL TABLE `demo_src`(
  `id` bigint,
  `age` int,
  `desc` string)
PARTITIONED BY (
  `dt` string,
  `hour` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
LOCATION
    'hdfs://cdh2:8020/griffin/data/batch/demo_src'

CREATE EXTERNAL TABLE `demo_tgt`(
  `id` bigint,
  `age` int,
  `desc` string)
PARTITIONED BY (
  `dt` string,
  `hour` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
LOCATION
  'hdfs://cdh2:8020/griffin/data/batch/demo_tgt';

权限不足的脚本修改下权限：

chmod 755 .testdata/*.sh

根据自己环境修改gen-hive-data.sh脚本如下：

#!/bin/bash

#create table 手动创建hive外部表，这里不用了
#hive -f create-table.hql
echo "create table done"

#current hour
./gen_demo_data.sh
cur_date=`date +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"

#last hour
./gen_demo_data.sh
cur_date=`date -d '1 hour ago' +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"

#next hours
set +e
while true
do
  ./gen_demo_data.sh
  cur_date=`date +%Y%m%d%H`
  next_date=`date -d "+1hour" '+%Y%m%d%H'`
  dt=${next_date:0:8}
  hour=${next_date:8:2}
  partition_date="dt='$dt',hour='$hour'"
  sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
  hive -f insert-data.hql
  src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
  tgt_done_path=/griffin/data/batch/demo_tgt：/dt=${dt}/hour=${hour}/_DONE
  hadoop fs -touchz ${src_done_path}
  hadoop fs -touchz ${tgt_done_path}
  echo "insert data [$partition_date] done"
  sleep 3600
done
set -e

执行脚本，一次性在hive 的demo_tgt，demo_src 中生成上一个小时，当前小时，下一个小时，3个分区的数据，以后每小时都生成一个分区的数据。

[root@cdh3 testdata]# ./gen-hive-data.sh
create table done
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false
Loading data to table griffin_demo.demo_src partition (dt=20200709, hour=16)
OK
Time taken: 7.98 seconds
Loading data to table griffin_demo.demo_tgt partition (dt=20200709, hour=16)
OK
Time taken: 1.322 seconds
insert data [dt='20200709',hour='16'] done
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false
Loading data to table griffin_demo.demo_src partition (dt=20200709, hour=15)
OK
Time taken: 8.104 seconds
Loading data to table griffin_demo.demo_tgt partition (dt=20200709, hour=15)
OK
Time taken: 1.613 seconds
insert data [dt='20200709',hour='15'] done
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
^Ctouchz: `/griffin/data/batch/demo_src/dt=20200709/hour=17/_DONE': No such file or directory: `hdfs://cdh2:8020/griffin/data/batch/demo_src/dt=20200709/hour=17/_DONE'
touchz: `/griffin/data/batch/demo_tgt：/dt=20200709/hour=17/_DONE': No such file or directory: `hdfs://cdh2:8020/griffin/data/batch/demo_tgt：/dt=20200709/hour=17/_DONE'
insert data [dt='20200709',hour='17'] done

3.2 创建Accuracy Measure

griffin_demo中有demo_src,demo_target两个表，选择source 源，src表的age字段
在这里插入图片描述
选择target，demo_target表的age字段

选择两个字段对比规则

分区大小按小时区分，时区东八区，
在这里插入图片描述
为Measure 取名字，添加描述信息

3.2 创建Job运行Measure

定义半小时执行一次 0 0/30 * * * ? 在这里插入图片描述
点击运行2，每半小时执行一次, 点击运行1，只运行一次

在这里插入图片描述

最终可以看到metric
在这里插入图片描述
丢失的记录可以从Download miss sample 下载查看，丢失数据为json格式。

{“id”:124,“age”:1090,“desc”:“1090”,“dt”:“20200706”,“hour”:“12”,"__tmst":1594020363542}
{“id”:124,“age”:1159,“desc”:“1159”,“dt”:“20200706”,“hour”:“12”,"__tmst":1594020363542}
{“id”:124,“age”:1159,“desc”:“1159”,“dt”:“20200706”,“hour”:“14”,"__tmst":1594020363542}

四. 流数据精准度测试

kafka中创建两个topic，两个topic中的数据按Measure定义的规则进行对比，求得结果集

4.1 数据准备

下载流数据测试脚本放到kafka所在节点：http://griffin.apache.org/data/streaming/

[root@cdh4 griffin]# ll /opt/soft/griffin/testStream/
total 16
-rwxr-xr-x 1 root root 662 Jul  7 16:11 gen-data.sh
-rw-r--r-- 1 root root 616 Jul  7 15:52 source.temp
-rwxr-xr-x 1 root root 297 Jul  7 15:54 streaming-data.sh
-rw-r--r-- 1 root root 611 Jul  7 15:52 target.temp

修改脚本：
因为我的kafka版本并不老，所以对kafka部分做了修改。
vim gen-data.sh

#!/bin/bash

#current time
cur_time=`date +%Y-%m-%d_%H:%M:%S`
sed s/TIME/$cur_time/ ./source.temp > source.tp
sed s/TIME/$cur_time/ ./target.temp > target.tp

#create data
for row in 1 2 3 4 5 6 7 8 9 10
do
  sed -n "${row}p" < source.tp > sline
  cnt=`shuf -i1-2 -n1`
  clr="red"
  if [ $cnt == 2 ]; then clr="yellow"; fi
  sed s/COLOR/$clr/ sline >> source.data
done
rm sline

cat target.tp > target.data

rm source.tp target.tp

#import data
kafka-console-producer --broker-list cdh4:9092 --topic source  < source.data
kafka-console-producer --broker-list cdh4:9092 --topic target  < target.data

rm source.data target.data

echo "insert data at ${cur_time}"

#!/bin/bash

#create topics
kafka-topics --create --zookeeper cdh4:2181 --replication-factor 1 --partitions 1 --topic source
kafka-topics --create --zookeeper cdh4:2181 --replication-factor 1 --partitions 1 --topic target

#every minute
set +e
while true
do
  ./gen-data.sh
  sleep 60
done
set -e

执行脚本会每分钟向两个topic中发送数据。

[root@cdh4 testStream]# ./streaming-data.sh

4.2 编写配置文件

[root@cdh3 testStream]# ll
total 8
-rw-r--r-- 1 root root 2686 Jul  8 15:21 dqStream.json
-rw-r--r-- 1 root root 1231 Jul  8 09:12 envStream.json

[root@cdh3 griffin-0.5.0]# vim ./testStream/dqStream.json

{
  "name": "streaming_accu",
  "process.type": "streaming",
  "data.sources": [
    {
      "name": "src",
      "baseline": true,
      "connectors": [
        {
          "type": "kafka",
          "version": "2.2.1+cdh6.3.2",
          "config": {
            "kafka.config": {
              "bootstrap.servers": "cdh4:9092",
              "group.id": "griffin",
              "auto.offset.reset": "largest",
              "auto.commit.enable": "false"
            },
            "topics": "source",
            "key.type": "java.lang.String",
            "value.type": "java.lang.String"
          },
          "pre.proc": [
            {
              "dsl.type": "df-opr",
              "rule": "from_json"
            }
          ]
        }
      ],
      "checkpoint": {
        "type": "json",
        "file.path": "hdfs://cdh2:8020/griffin/streaming/dump/source",
        "info.path": "source",
        "ready.time.interval": "10s",
        "ready.time.delay": "0",
        "time.range": ["-5m", "0"],
        "updatable": true
      }
      }, {
      "name": "tgt",
      "connectors": [
        {
          "type": "kafka",
          "version": "2.2.1+cdh6.3.2",
          "config": {
            "kafka.config": {
              "bootstrap.servers": "cdh4:9092",
              "group.id": "griffin",
              "auto.offset.reset": "largest",
              "auto.commit.enable": "false"
            },
            "topics": "target",
            "key.type": "java.lang.String",
            "value.type": "java.lang.String"
          },
          "pre.proc": [
            {
              "dsl.type": "df-opr",
              "rule": "from_json"
            }
          ]
        }
      ],
      "checkpoint": {
        "type": "json",
        "file.path": "hdfs://cdh2:8020/griffin/streaming/dump/target",
        "info.path": "target",
        "ready.time.interval": "10s",
        "ready.time.delay": "0",
        "time.range": ["-1m", "0"]
      }
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "out.dataframe.name": "accu",
        "rule": "src.id = tgt.id AND src.name = tgt.name AND src.color = tgt.color AND src.time = tgt.time",
        "details": {
          "source": "src",
          "target": "tgt",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "out":[
          {
            "type":"metric",
            "name": "accu"
          },
          {
            "type":"record",
            "name": "missRecords"
          }
        ]
      }
    ]
  },
  "sinks": ["CONSOLE", "HDFS","ELASTICSEARCH"]
}

 [root@cdh3 griffin-0.5.0]# vim ./testStream/envStream.json
 
 {
  "spark": {
    "log.level": "WARN",
    "checkpoint.dir": "hdfs://cdh2:8020/griffin/checkpoint",
    "batch.interval": "20s",
    "process.interval": "1m",
    "init.clear": true,
    "config": {
      "spark.default.parallelism": 4,
      "spark.task.maxFailures": 5,
      "spark.streaming.kafkaMaxRatePerPartition": 1000,
      "spark.streaming.concurrentJobs": 4,
      "spark.yarn.maxAppAttempts": 5,
      "spark.yarn.am.attemptFailuresValidityInterval": "1h",
      "spark.yarn.max.executor.failures": 120,
      "spark.yarn.executor.failuresValidityInterval": "1h",
      "spark.hadoop.fs.hdfs.impl.disable.cache": true
    }
  },
  "sinks": [
    {
      "type": "console"
    },
    {
      "type": "hdfs",
      "config": {
        "path": "hdfs://cdh2:8020/griffin/persist"
      }
    },
    {
      "type": "elasticsearch",
      "config": {
        "method": "post",
        "api": "http://cdh3:9200/griffin/accuracy"
      }
    }
  ],
  "griffin.checkpoint": [
    {
      "type": "zk",
      "config": {
        "hosts": "cdh3:2181",
        "namespace": "griffin/infocache",
        "lock.path": "lock",
        "mode": "persist",
        "init.clear": true,
        "close.clear": false
      }
    }
  ]
}

4.3 运行spark

提交spark任务，结果集写入控制台，hdfs，ES。

spark-submit \
--class org.apache.griffin.measure.Application \
--master yarn \
--deploy-mode client \
--queue default \
--driver-memory 512m \
--executor-memory 512m \
--num-executors 3 \
hdfs://cdh2:8020/griffin/griffin-measure.jar \
./testStream/envStream.json \
./testStream/dqStream.json

检查hdfs，ES是否生成结果集。
hdfs结果集示意图如下（后补的图）
在这里插入图片描述

UI界面发布自己的数据质量结果集Publish。保持MeasureName（dqStream.json中查看），MetricName（脚本打印信息中查看）与配置文件相同才能识别,有个问题就是定义发布Publish了后，Jobs界面就不会显示之前的job了。
在这里插入图片描述
My Dashboard 仪表盘查看Streaming数据质量管理结果。

在这里插入图片描述
结果是以二维表形式展示而不是折线图。这里怀疑是Measure type定义的问题。目前不知道如何解决。

报错2，运行job时候报错：

  2020-07-01 13:15:00.145 ERROR 55618 --- [ryBean_Worker-4] o.a.g.c.j.SparkSubmitJob                 : Post to livy ERROR.
 400 Bad Request {"msg":"Duplicate session name: Some(griffin) for session 36"}

解决：

删除 service/src/main/resources/sparkProperties 下 name那一行，重新编译

报错3，运行job 报错不能创建jvm，

YARN Diagnostics: ], appId=null, appInfo={driverLogUrl=null, sparkUiUrl=null}, name=null, proxyUser=null, id=45, state=starting}, batches id : 45
2020-07-01 14:10:00.836  INFO 60336 --- [ryBean_Worker-4] o.a.g.c.j.LivyTaskSubmitHelper           : retry get livy resultMap: {owner=null, log=[stdout: , Inv                            alid maximum heap size: -Xmx512mb, Error: Could not create the Java Virtual Machine., Error: A fatal exception has occurred. Program will exit.,
stderr: ,

解决：sparkProperties.json ，配置写错了原先512mb ->512m

vim service/src/main/resources/sparkProperties.json

 "driverMemory": "512m",
  "executorMemory": "512m"

问题3：metrics不显示仪表盘，servier 日志warn 如下

2020-07-02 15:23:13.308  WARN 56236 --- [/O dispatcher 1] o.e.c.RestClient                         : request [GET http://cdh3:9200/griffin/accuracy/_search?fil                        ter_path=hits.hits._source] returned 1 warnings: [299 Elasticsearch-7.8.0-757314695644ea9a1dc2fecd26d1a43856725e65 "[types removal] Specifying types in search                         requests is deprecated."]

解决：ES版本过高。创建索引时候有问题，创建时候如上加入include_type_name=true。

途足

关注

4
点赞
踩
10

收藏

觉得还不错? 一键收藏
7
评论
Apache Griffin-0.5.0 数据质量监控工具安装及运行batch，streaming数据质量监测demo

Apache Griffin 数据质量监控工具官网：http://griffin.apache.org/docs/quickstart-cn.htmlgithub:参考：https://cwiki.apache.org/confluence/display/GRIFFIN/1.+Overview测试流数据：http://griffin.apache.org/data/streaming/测试批数据：http://griffin.apache.org/data/streaming/一. 概述数据
复制链接

扫一扫