Atlas Type System
Atlas 类型系统,Atlas 允许用户为他们想要管理的元数据对象定义一个模型。该模型由称为 “类型” 的定义组成。被称为 “实体” 的 “类型” 实例表示被管理的实际元数据对象。类型系统是一个组件,允许用户定义和管理类型和实体。由 Atlas 管理的所有元数据对象(例如Hive表)都使用类型进行建模,并表示为实体。如果要在Atlas中存储新类型的元数据,需要了解类型系统组件的概念。
Atlas REST API 参考地址:http://atlas.apache.org/api/v2/
Types
Atlas中的 “类型” 定义了如何存储和访问特定类型的元数据对象。类型表示了所定义元数据对象的一个或多个属性集合。具有开发背景的用户可以将 “类型” 理解成面向对象的编程语言的 “类” 定义的或关系数据库的 “表模式”。
可以通过该API获取Atlas的所有类型:http://atlas:21000/api/atlas/v2/types/typedefs
下面通过该API获取hive_table类型的定义:http://atlas:21000/api/atlas/v2/types/typedef/name/hive_table
hive_table类型示例:
{
"category": "ENTITY",
"guid": "30a12b7c-faed-4ead-ad83-868893ebed93",
"createdBy": "cloudera-scm",
"updatedBy": "cloudera-scm",
"createTime": 1536203750750,
"updateTime": 1536203750750,
"version": 1,
"name": "hive_table",
"description": "hive_table",
"typeVersion": "1.1",
"options": {
"schemaElementsAttribute": "columns"
},
"attributeDefs": [
{
"name": "db",
"typeName": "hive_db",
"isOptional": false,
"cardinality": "SINGLE",
"valuesMinCount": 1,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "createTime",
"typeName": "date",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "lastAccessTime",
"typeName": "date",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "comment",
"typeName": "string",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "retention",
"typeName": "int",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "sd",
"typeName": "hive_storagedesc",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false,
"constraints": [
{
"type": "ownedRef"
}
]
},
{
"name": "partitionKeys",
"typeName": "array<hive_column>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false,
"constraints": [
{
"type": "ownedRef"
}
]
},
{
"name": "aliases",
"typeName": "array<string>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "columns",
"typeName": "array<hive_column>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false,
"constraints": [
{
"type": "ownedRef"
}
]
},
{
"name": "parameters",
"typeName": "map<string,string>",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "viewOriginalText",
"typeName": "string",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "viewExpandedText",
"typeName": "string",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "tableType",
"typeName": "string",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "temporary",
"typeName": "boolean",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": true,
"includeInNotification": false
}
],
"superTypes": [
"DataSet"
],
"subTypes": []
}
从上面的例子可以注意到以下几点:
- Atlas中的类型由 “name” 唯一标识,
- attributeDefs表示该类型中属性的定义
- 类型具有元类型。元类型表示 Atlas 中此模型的类型。 Atlas 有以下几种类型:
- 基本元类型: Int,String,Boolean等。
- 枚举元类型
- 集合元类型:例如Array,Map
- 复合元类型:Class,Struct,Trait
- 类型可以从称为 “superTypes” 的父类型 “extend” - 通过这种方式,它将拥有在 “supertype” 中定义的属性。这允许模型在一组相关类型等之间定义公共属性。这再次类似于面向对象语言如何定义类的超类的概念。 Atlas 中的类型也可以从多个超类型扩展。同时subTypes表示该类型的子类型
- 在该示例中,每个 hive_table 类型从预定义的超类型(称为 “DataSet”)扩展。稍后将提供关于此预定义类型的更多细节。
- 具有 “Class”,“Struct” 或 “Trait” 的元类型的类型可以具有属性集合。每个属性都有一个名称(例如 “name”)和一些其他关联的属性。可以使用表达式 type_name.attribute_name 来引用属性。还要注意,属性本身是使用 Atlas 元类型定义的。
- 在这个例子中,hive_table.name 是一个字符串,hive_table.aliases 是一个字符串数组,hive_table.db 引用一个类型的实例称为 hive_db 等等。
- 在属性中键入引用(如hive_table.db)。使用这样的属性,我们可以在 Atlas 中定义的两种类型之间的任意关系,从而构建丰富的模型。注意,也可以收集一个引用列表作为属性类型(例如 hive_table.columns,它表示从 hive_table 到 hive_column 类型的引用列表)
DataSet类型定义:
{
"category": "ENTITY",
"guid": "d31c0a02-6999-4f81-a62a-07d7654aec84",
"createdBy": "cloudera-scm",
"updatedBy": "cloudera-scm",
"createTime": 1536203676149,
"updateTime": 1536203676149,
"version": 1,
"name": "DataSet",
"description": "DataSet",
"typeVersion": "1.1",
"attributeDefs": [],
"superTypes": [
"Asset"
],
"subTypes": [
"rdbms_foreign_key",
"rdbms_db",
"kafka_topic",
"hive_table",
"sqoop_dbdatastore",
"hbase_column",
"rdbms_instance",
"falcon_feed",
"jms_topic",
"hbase_table",
"rdbms_table",
"rdbms_column",
"rdbms_index",
"hbase_column_family",
"access_info",
"hive_column",
"avro_type",
"fs_path"
]
}
可以看到DataSet有很多的子类型,部分Atlas自带的类型都继承自DataSet。
同时DataSet继承自Asset,Asset表示资产的意思,其中定义了一些通用的属性
Asset类型定义:
{
"category": "ENTITY",
"guid": "349a5c61-47c3-4f4b-9a79-7fd59454a73a",
"createdBy": "cloudera-scm",
"updatedBy": "cloudera-scm",
"createTime": 1536203676083,
"updateTime": 1536203676083,
"version": 1,
"name": "Asset",
"description": "Asset",
"typeVersion": "1.1",
"attributeDefs": [
{
"name": "name",
"typeName": "string",
"isOptional": false,
"cardinality": "SINGLE",
"valuesMinCount": 1,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": true,
"includeInNotification": false
},
{
"name": "description",
"typeName": "string",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "owner",
"typeName": "string",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": 0,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": true,
"includeInNotification": false
}
],
"superTypes": [
"Referenceable"
],
"subTypes": [
"rdbms_foreign_key",
"rdbms_db",
"rdbms_instance",
"DataSet",
"rdbms_table",
"rdbms_column",
"rdbms_index",
"Infrastructure",
"Process",
"avro_type",
"hbase_namespace",
"hive_db"
]
}
Asset类型中定义了3个属性
- name:名称
- owner:所属人
- description:描述
它也有不少子类型,表示资产的意义,比如说hive数据库hive_db,hbase命名空间hbase_namespace。它继承自Referenceable类型
Referenceable类型定义:
{
"category": "ENTITY",
"guid": "34c72533-2e80-4e5c-9226-e15b163f98d1",
"createdBy": "cloudera-scm",
"updatedBy": "cloudera-scm",
"createTime": 1536203673540,
"updateTime": 1536203673540,
"version": 1,
"name": "Referenceable",
"description": "Referenceable",
"typeVersion": "1.0",
"attributeDefs": [
{
"name": "qualifiedName",
"typeName": "string",
"isOptional": false,
"cardinality": "SINGLE",
"valuesMinCount": 1,
"valuesMaxCount": 1,
"isUnique": true,
"isIndexable": true,
"includeInNotification": false
}
],
"superTypes": [],
"subTypes": [
"hive_storagedesc",
"Asset"
]
}
该类型定义了一个非常重要的属性,qualifiedName, 该类型中唯一限定名,可以通过该属性配合类型名在Atlas中查找对应的唯一实体内容,注意与guid的区别,guid是全局唯一的
例如:
- 一个hive数据库的qualifiedName:test@primary
- 该数据库test下表test_table的qualifiedName:test.test_table@primary
- 该表test_table中字段name的qualifiedName:test.test_table.name@primary
@primary表示集群默认名字,通过如下配置集群名称,通过加上集群名称,在不同集群间唯一标识一个实体
atlas.cluster.name
primary
Process类型定义:
{
"category": "ENTITY",
"guid": "7c03ccad-29aa-4c5f-8a27-19b536068f69",
"createdBy": "cloudera-scm",
"updatedBy": "cloudera-scm",
"createTime": 1536203677547,
"updateTime": 1536203677547,
"version": 1,
"name": "Process",
"description": "Process",
"typeVersion": "1.1",
"attributeDefs": [
{
"name": "inputs",
"typeName": "array<DataSet>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "outputs",
"typeName": "array<DataSet>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
}
],
"superTypes": [
"Asset"
],
"subTypes": [
"falcon_feed_replication",
"falcon_process",
"falcon_feed_creation",
"sqoop_process",
"hive_column_lineage",
"storm_topology",
"hive_process"
]
}
-
Process类型继承自Asset类型,所以自带有name,owner,description,quailifiedName四种属性
-
它自己特有的inputs和outputs表示该过程的输入输出,它是Atlas血缘管理中所有类型的超类
在概念上,它可以用于表示任何数据变换操作。例如,将原始数据的 hive 表转换为存储某个聚合的另一个 hive 表的 ETL 过程可以是扩展过程类型的特定类型。流程类型有两个特定的属性,输入和输出。输入和输出都是 DataSet 实体的数组。因此,Process 类型的实例可以使用这些输入和输出来捕获 DataSet 的 lineage 如何演变。
Entities
表示实体,前面的Type理解为Java的类,则Entity可以理解为类对应的一个实例,该类的一个对象
hive_table类型的实体对象示例:
{
"referredEntities" : {
"779734cc-9011-4066-9bb1-25df6f28ac72" : {
"typeName" : "hive_column",
"attributes" : {
"owner" : "wangjian5185",
"qualifiedName" : "test.student.age@primary",
"name" : "age",
"description" : null,
"comment" : null,
"position" : 1,
"type" : "int",
"table" : {
"guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
"typeName" : "hive_table"
}
},
"guid" : "779734cc-9011-4066-9bb1-25df6f28ac72",
"status" : "ACTIVE",
"createdBy" : "admin",
"updatedBy" : "admin",
"createTime" : 1536215508751,
"updateTime" : 1536215508751,
"version" : 0
},
"c47aed54-e4d2-4080-aa7a-5428075f5b20" : {
"typeName" : "hive_column",
"attributes" : {
"owner" : "wangjian5185",
"qualifiedName" : "test.student.phone@primary",
"name" : "phone",
"description" : null,
"comment" : null,
"position" : 2,
"type" : "int",
"table" : {
"guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
"typeName" : "hive_table"
}
},
"guid" : "c47aed54-e4d2-4080-aa7a-5428075f5b20",
"status" : "ACTIVE",
"createdBy" : "admin",
"updatedBy" : "admin",
"createTime" : 1536215508751,
"updateTime" : 1536215508751,
"version" : 0
},
"7c4b4cd5-841d-409b-b38a-77ec8779e252" : {
"typeName" : "hive_column",
"attributes" : {
"owner" : "wangjian5185",
"qualifiedName" : "test.student.name@primary",
"name" : "name",
"description" : null,
"comment" : null,
"position" : 0,
"type" : "string",
"table" : {
"guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
"typeName" : "hive_table"
}
},
"guid" : "7c4b4cd5-841d-409b-b38a-77ec8779e252",
"status" : "ACTIVE",
"createdBy" : "admin",
"updatedBy" : "admin",
"createTime" : 1536215508751,
"updateTime" : 1536215508751,
"version" : 0
},
"a6038b00-ce2d-4612-9436-d63092d09182" : {
"typeName" : "hive_storagedesc",
"attributes" : {
"bucketCols" : null,
"qualifiedName" : "test.student@primary_storage",
"sortCols" : null,
"storedAsSubDirectories" : false,
"location" : "hdfs://cdhtest/user/hive/warehouse/test.db/student",
"compressed" : false,
"inputFormat" : "org.apache.hadoop.mapred.TextInputFormat",
"outputFormat" : "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"parameters" : null,
"table" : {
"guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
"typeName" : "hive_table"
},
"serdeInfo" : {
"typeName" : "hive_serde",
"attributes" : {
"serializationLib" : "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"name" : null,
"parameters" : null
}
},
"numBuckets" : 0
},
"guid" : "a6038b00-ce2d-4612-9436-d63092d09182",
"status" : "ACTIVE",
"createdBy" : "admin",
"updatedBy" : "admin",
"createTime" : 1536215508751,
"updateTime" : 1536215508751,
"version" : 0
}
},
"entity" : {
"typeName" : "hive_table",
"attributes" : {
"owner" : "wangjian5185",
"temporary" : false,
"lastAccessTime" : 1533807252000,
"aliases" : null,
"qualifiedName" : "test.student@primary",
"columns" : [ {
"guid" : "7c4b4cd5-841d-409b-b38a-77ec8779e252",
"typeName" : "hive_column"
}, {
"guid" : "779734cc-9011-4066-9bb1-25df6f28ac72",
"typeName" : "hive_column"
}, {
"guid" : "c47aed54-e4d2-4080-aa7a-5428075f5b20",
"typeName" : "hive_column"
} ],
"description" : null,
"viewExpandedText" : null,
"sd" : {
"guid" : "a6038b00-ce2d-4612-9436-d63092d09182",
"typeName" : "hive_storagedesc"
},
"tableType" : "MANAGED_TABLE",
"createTime" : 1533807252000,
"name" : "student",
"comment" : null,
"partitionKeys" : null,
"parameters" : {
"transient_lastDdlTime" : "1533807252"
},
"db" : {
"guid" : "a804165e-77ff-4c60-9ee7-956760577a1e",
"typeName" : "hive_db"
},
"retention" : 0,
"viewOriginalText" : null
},
"guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
"status" : "ACTIVE",
"createdBy" : "admin",
"updatedBy" : "admin",
"createTime" : 1536215508751,
"updateTime" : 1536281983181,
"version" : 0
}
}
表名:student 包含3个列:name,age,phone
- 开头referredEntities中包含了该table的引用的实体对象,即3个列实体和一个hive_storagedesc实体(表示该表的存储信息),实际存储时存储的为它们的guid
- entity中包含了该表student的一些信息,name,owner,guid,status,attribute等等
- 作为 Class Type 实例的每个实体都由唯一标识符 GUID 标识。此 GUID 由 Atlas 服务器在定义对象时生成,并在实体的整个生命周期内保持不变。在任何时间点,可以使用其 GUID 来访问该特定实体。
Attributes
在扩展时常用的一种类型,表示某个类型的附加属性,该类型具有以下属性:
此处以hive_table类型的db,createTime,columns属性为例
"attributeDefs" : [ {
"name" : "db",
"typeName" : "hive_db",
"isOptional" : false,
"cardinality" : "SINGLE",
"valuesMinCount" : 1,
"valuesMaxCount" : 1,
"isUnique" : false,
"isIndexable" : false
}, {
"name" : "createTime",
"typeName" : "date",
"isOptional" : true,
"cardinality" : "SINGLE",
"valuesMinCount" : 0,
"valuesMaxCount" : 1,
"isUnique" : false,
"isIndexable" : false
}, {
"name" : "columns",
"typeName" : "array<hive_column>",
"isOptional" : true,
"cardinality" : "SET",
"valuesMinCount" : 0,
"valuesMaxCount" : 2147483647,
"isUnique" : false,
"isIndexable" : false,
"constraints" : [ {
"type" : "ownedRef"
} ]
}
-
name:该属性名
-
typeName:该属性类型,包括基本类型,以及date,各种type类型,和集合类型等等
-
isOptional:是否可选,false表示该属性必须指定
-
cardinality:如下图三种,SINGLE(单个),LIST(可重复多个),SET(不可重复多个)
-
valuesMinCount:该属性最小个数
-
valuesMaxCount:该属性最大个数
-
isUnique:是否为唯一属性
-
此标志与索引相关。如果指定为唯一,这意味着为 JanusGraph 中的此属性创建一个特殊索引,允许基于等式的查找。
-
具有此标志的真实值的任何属性都被视为主键,以将此实体与其他实体区分开。因此,应注意确保此属性在现实世界中模拟独特的属性。
-
例如,考虑 hive_table 的 name 属性。孤立地,名称不是 hive_table 的唯一属性,因为具有相同名称的表可以存在于多个数据库中。如果 Atlas 在多个集群中存储 hive 表的元数据,即使一对(数据库名称,表名称)也不是唯一的。只有集群位置,数据库名称和表名称可以在物理世界中被视为唯一。
-
-
isIndexable:此标志指示此属性是否应该索引,以便可以使用属性值作为谓词来执行查找,并且可以高效地执行查找。
-
constraints:限制类型,该属性的限制类型,猜测可以通过该值来实现类似于MySQL中外键的功能,默认值有如下3个
Atlas创建类型与更新类型
Atlas创建access_info类型
public static AtlasTypesDef getAtlasTypesDef() {
//获取Atlas类型定义,在这里面可以定义所有类型
AtlasTypesDef def = new AtlasTypesDef();
//创建一个Entity定义类型,名称为access_info
AtlasEntityDef entityDef = new AtlasEntityDef("access_info");
//设置继承自DataSet类型
entityDef.setSuperTypes(Collections.singleton("DataSet"));
//创建人为
entityDef.setCreatedBy("wangjian");
//设置该类型版本号
entityDef.setVersion((long) 1.0);
//创建一个属性定义数组
List<AtlasStructDef.AtlasAttributeDef> attributeDefs = new ArrayList<AtlasStructDef.AtlasAttributeDef>();
//创建一个accessTime属性
AtlasStructDef.AtlasAttributeDef accessTime = new AtlasStructDef.AtlasAttributeDef();
//设置属性名称为accessTime ,访问时间
accessTime.setName("accessTime");
//类型为date
accessTime.setTypeName("date");
//设置为可选择的,表明没有该项,entity也能创建
accessTime.setIsOptional(true);
//设置为索引,后期可以通过该属性进行有效查找
accessTime.setIsIndexable(true);
//设置为不唯一的
accessTime.setIsUnique(false);
//单值的属性
accessTime.setCardinality(AtlasStructDef.AtlasAttributeDef.Cardinality.SINGLE);
//值最小数量为0,最大为1,必须指定,不指定默认为-1
accessTime.setValuesMinCount(0);
accessTime.setValuesMaxCount(1);
//设置访问对象引用,accessObject,类型为DataSet(hive_table,kafka_topic都继承自此类)
AtlasStructDef.AtlasAttributeDef accessObject = new AtlasStructDef.AtlasAttributeDef();
accessObject.setName("accessObject");
accessObject.setTypeName("DataSet");
accessObject.setIsOptional(true);
accessObject.setIsIndexable(true);
accessObject.setIsUnique(false);
accessObject.setCardinality(AtlasStructDef.AtlasAttributeDef.Cardinality.SINGLE);
accessObject.setValuesMinCount(0);
accessObject.setValuesMaxCount(1);
//添加属性到List
attributeDefs.add(accessTime);
attributeDefs.add(accessObject);
//设置属性定义
entityDef.setAttributeDefs(attributeDefs);
//设置类型定义到AtlasTypeDef
def.setEntityDefs(Collections.singletonList(entityDef));
return def;
}
Atlas更新kafka_topic类型,添加属性
/**
* 给kafka_topic类型添加sampleMessage属性
*/
public void updateKafkaTopic() throws AtlasServiceException {
//根据实体类型定义名获取实体类型定义
AtlasEntityDef kafkaTopic = clientV2.getEntityDefByName("kafka_topic");
//添加一个新的属性sample_message,表示kafka消息示例
kafkaTopic.getAttributeDefs().add(new AtlasStructDef.AtlasAttributeDef("sampleMessage",
"array<string>",
true,
AtlasStructDef.AtlasAttributeDef.Cardinality.LIST,
0,
10,
false,
false,
null
));
//将实体类型定义设置到一个新的AtlasTypeDef中
AtlasTypesDef atlasTypesDef = new AtlasTypesDef();
atlasTypesDef.getEntityDefs().add(kafkaTopic);
clientV2.updateAtlasTypeDefs(atlasTypesDef);
}
注意:添加后的属性不能删除,也不能更新,虽然有对应的接口,但是目前尚不支持