Atlas with Hive 安装总结

最新推荐文章于 2024-04-07 08:00:37 发布

yoshubom

最新推荐文章于 2024-04-07 08:00:37 发布

阅读量409

点赞数

文章标签： hive hbase hadoop atlas

本文链接：https://blog.csdn.net/yoshubom/article/details/126106436

版权

本文档详细记录了Apache Atlas单机版的安装步骤，包括设置内嵌的HBase和Solr，以及启动和停止相关组件。此外，还介绍了如何修改HBase的Zookeeper端口以避免冲突，并展示了如何配置Atlas以使用内嵌的Zookeeper和Solr。最后，文章阐述了如何配置Hive以添加Atlas Hook，并导入Hive元数据。

摘要由CSDN通过智能技术生成

Atlas with Hive 安装总结

架构组件

+-- Atlas --+
| HBase(ZK) |
| Kafka     | <----- Hive(hive hook)，实时导入元数据 -----> MySQL
| Solr      |                                     |
| REST      | <----- import-hive.sh <----- MetaStore，批量导入元数据
+-----------+

Atlas 单机版

基本安装

详见Atlas系列/Atlas 源码编译.md中集成hbase和solr版本的atlas
参考，https://atlas.apache.org/#/Installation

# 1. 解压安装atlas
tar -xf apache-atlas-2.1.0-server.tar.gz -C /opt/modules/
cd /opt/modules/ && mv apache-atlas-2.1.0 atlas-2.1.0-alone

# 清理没用的.cmd 文件
cd /opt/modules/atlas-2.1.0-alone
find -name '*.cmd' | xargs rm -rf 

# 2. 启动atlas
## 设置启动内嵌的hbase 和solr
### export MANAGE_LOCAL_HBASE=true
### export MANAGE_LOCAL_SOLR=true
## 实际上conf/atlas-env.sh 中已经包含以上两项，直接启动即可
bin/bin/atlas_start.py

# 3. 单独启动组件
## 3.1 单独启动内嵌的hbase
hbase/bin/start-hbase.sh

## 3.2 单独启动内嵌的solr
solr/bin/solr start -c -z localhost:2181 -p 8983
### 指定zookeeper 地址和solr 的端口

## 3.3 创建solr 的索引
#solr/bin/solr create -c vertex_index -d conf/solr/
#solr/bin/solr create -c edge_index -d conf/solr/ 
#solr/bin/solr create -c fulltext_index -d conf/solr/
### atlas 的索引储存在图数据库solr 中
### 图数据主要有三个要素：点 vertex，线(点与点的联系) edge，面(整个图)fulltext
### 实际测试中，单机版的atlas 不需要手动创建solr 的collection，现有索引可通过以下url 查看
### http://hadoop112:9838/solr/admin/collections?action=list
### 或者查看solr WebUI，默认端口配置在`bin/atlas_config.py`
### http://hadoop112:9838

## 停止内嵌的hbase
## 停止内嵌的solr
solr/bin/solr stop

## 查看atlas 的启动日志，启动过程大概要10 分钟
tail -f logs/application.log

补充修改

修改单机版HBase自带Zookeeper程序的端口为2182
避免与集群中的Zookeeper冲突(ZK默认端口为2181)
/opt/modules/atlas-2.1.0-alone/hbase/conf/hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <!-- 修改自带的ZK 程序端口为2182 -->
  </property>
    <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2182</value>
  </property>
  <!-- 以下配置是由首次Atlas 运行时自动创建的 -->
  <property>
    <name>hbase.rootdir</name>
    <value>file:///opt/modules/atlas-2.1.0-alone/data/hbase-root</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/opt/modules/atlas-2.1.0-alone/data/hbase-zookeeper-data</value>
  </property>
  <property>
    <name>hbase.master.info.port</name>
    <value>61510</value>
  </property>
  <property>
    <name>hbase.regionserver.info.port</name>
    <value>61530</value>
  </property>
  <property>
    <name>hbase.master.port</name>
    <value>61500</value>
  </property>
  <property>
    <name>hbase.regionserver.port</name>
    <value>61520</value>
</configuration>

修改atlas对应的Zookeeper配置，完整配置如下
/opt/modules/atlas-2.1.0-alone/conf/atlas-application.properties

atlas.graph.storage.backend=hbase2
atlas.graph.storage.hbase.table=apache_atlas_janus
# hostname 须为localhost，否则atlas_start.py 不会启动hbase
## 也就没有本地的zookeeper，solr 和内嵌的kafka 也就无法启动
atlas.graph.storage.hostname=localhost
atlas.graph.storage.hbase.regions-per-server=1
atlas.graph.storage.lock.wait-time=10000
atlas.EntityAuditRepository.impl=org.apache.atlas.repository.audit.HBaseBasedAuditRepository
atlas.graph.index.search.backend=solr
atlas.graph.index.search.solr.mode=cloud
# solr.zookeeper-url 修改为localhost:2182
atlas.graph.index.search.solr.zookeeper-url=localhost:2182
atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
atlas.graph.index.search.solr.zookeeper-session-timeout=60000
atlas.graph.index.search.solr.wait-searcher=true
atlas.graph.index.search.max-result-set-size=150
atlas.notification.embedded=true
atlas.kafka.data=${sys:atlas.home}/data/kafka
# kafka 的地址可设为主机名，让其他节点的hive hook 也能调用
atlas.kafka.zookeeper.connect=hadoop112:9026
atlas.kafka.bootstrap.servers=hadoop112:9027
atlas.kafka.zookeeper.session.timeout.ms=400
atlas.kafka.zookeeper.connection.timeout.ms=200
atlas.kafka.zookeeper.sync.time.ms=20
atlas.kafka.auto.commit.interval.ms=1000
atlas.kafka.hook.group.id=atlas
atlas.kafka.enable.auto.commit=false
atlas.kafka.auto.offset.reset=earliest
atlas.kafka.session.timeout.ms=30000
atlas.kafka.offsets.topic.replication.factor=1
atlas.kafka.poll.timeout.ms=1000
atlas.notification.create.topics=true
atlas.notification.replicas=1
atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
atlas.notification.log.failed.messages=true
atlas.notification.consumer.retry.interval=500
atlas.notification.hook.retry.interval=1000
atlas.enableTLS=false
atlas.authentication.method.kerberos=false
atlas.authentication.method.file=true
atlas.authentication.method.ldap.type=none
atlas.authentication.method.file.filename=${sys:atlas.home}/conf/users-credentials.properties
# atlas rest 接口可设为主机名，让其他节点也能调用
atlas.rest.address=http://hadoop112:21000
atlas.audit.hbase.tablename=apache_atlas_entity_audit
atlas.audit.zookeeper.session.timeout.ms=1000
# hbase.zookeeper.quorum 修改为localhost:2182
atlas.audit.hbase.zookeeper.quorum=localhost:2182
atlas.server.ha.enabled=false
atlas.authorizer.impl=simple
atlas.authorizer.simple.authz.policy.file=atlas-simple-authz-policy.json
atlas.rest-csrf.enabled=true
atlas.rest-csrf.browser-useragents-regex=^Mozilla.*,^Opera.*,^Chrome.*
atlas.rest-csrf.methods-to-ignore=GET,OPTIONS,HEAD,TRACE
atlas.rest-csrf.custom-header=X-XSRF-HEADER
atlas.metric.query.cache.ttlInSecs=900
atlas.search.gremlin.enable=false
atlas.ui.default.version=v1

Atlas 集群版

暂时略

导入Hive 表

Hive 需要运行MetaStore 程序
检查Hive配置，确保conf/hive-site.xml有MetaStore配置

  <!-- metastore uris -->
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop114:9083</value>
    <!--
    <value/>
    -->
  </property>

启动MetaStore服务

cd /opt/modules/hive-3.1.2
# 后台启动MetaStore
nohup bin/hive --service metastore &> log/metastore.log &

补充：如果使用MySQL而不是内嵌的Derby数据库储存元数据，是可以不用配置和启动MetaStore的
Atlas会通过Hive配置文件里的用户名和密码直接去连接MySQL，但使用MetaStore是最稳妥的

解压hive hook

# 解压并重命名
tar -xf apache-atlas-2.1.0-hive-hook.tar.gz -C /opt/modules/
cd /opt/modules/ && mv apache-atlas-hive-hook-2.1.0/ atlas-hive-hook-2.1.0/

配置hive 加上hook
在conf/hive-site.xml添加以下配置，参考，https://atlas.apache.org/#/HookHive

  <!-- atlas hive hook -->
  <property>
    <name>hive.exec.post.hooks</name>
    <value>org.apache.atlas.hive.hook.HiveHook</value>
  </property>

在conf/hive-env.sh添加hive hook的jar依赖包

export HADOOP_HOME=/opt/modules/hadoop-3.1.3
# 添加TEZ_HOME 和TEZ_JARS 包路径
export TEZ_HOME=/opt/modules/tez-0.10
export TEZ_CONF_DIR=$TEZ_HOME/conf
export TEZ_JARS=$(find $TEZ_HOME -name '*.jar' | xargs echo | tr ' ' ':')
export HIVE_AUX_JARS_PATH=$HADOOP_HOME/share/hadoop/common/hadoop-lzo-0.4.20.jar:$TEZ_JARS
# 在HIVE_AUX_JARS_PATH 后面添加Atlas Hive Hook 的依赖jar 包
export ATLAS_HIVE_HOOK_HOME=/opt/modules/atlas-hive-hook-2.1.0/hook/hive
export ATLAS_HIVE_HOOK_JARS=$(find $ATLAS_HIVE_HOOK_HOME -name '*.jar' | xargs echo | tr ' ' ':')
export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH:$ATLAS_HIVE_HOOK_JARS

拷贝atlas-application.properties到/opt/modules/hive-3.1.2/conf

# 获取atlas app 配置文件
cd /opt/modules/hive-3.1.2/conf &&\
  sftp hadoop112 <<< "get /opt/modules/atlas-2.1.0-alone/conf/atlas-application.properties"

运行Atlas 导入Hive 元数据的程序

# 使用Simple Command
cd /opt/modules/atlas-hive-hook-2.1.0 &&\
  HIVE_HOME=/opt/modules/hive-3.1.2 hook-bin/import-hive.sh
# 或者将HIVE_HOME 写入hook-bin/import-hive.sh
## 下次就可以直接调用了
cd /opt/modules/atlas-hive-hook-2.1.0 && hook-bin/import-hive.sh
## Hive Meta Data imported successfully!!!