Apache hadoop yarn 上运行基于 Tensorflow 框架的机器学习

机器学习开发和应用过程中包含的环节介绍和将基于 Tensorflow 框架的机器学习运行在 apache hadoop yarn 上的方法和流程
(Introduction to the links involved in development and application of machine learning, and the method and process of running machine learning based on the Tensorflow framework on Apache hadoop yarn.)

在这里插入图片描述

在机器学习开发和应用过程中包含众多的环节,其中主要的有:数据纳入、数据准备/制作、数据存储/加载、模型训练、模型部署、模型服务;我们在生产和非生产环境中生产和采集的数据,纳入机器学习系统后,首先我们会对数据进行初步的加工、处理和准备,然后存放到不同的地方,比如:分布式存储系统(HDFS)、文件系统;训练时将数据加载到模型,加载需要消耗大量的内存资源,训练需要消耗大量的CPU/GPU资源,随着数据量和模型的复杂度提高,单节点/单主机的硬件资源达到瓶颈,一个很好办法是让机器学习直接在多台机器上或分布式集群上运行。tensorflow 2.x 版本开始支持分布式训练,但不支持分布式集群(hadoop yarn),需要TonY组件来支持,TonY组件的职责:

  • 从分布式集群(hadoop yarn) 上获得机器学习训练所需的资源,如机器节点、内存和 CPU/GPU资源;
  • 提供和初始化tensorflow机器学习所需的环境;
    在这里插入图片描述

下文通过案例介绍基于 Tensorflow 框架的机器学习如何在hadoop yarn上运行,案例和数据来自tensorflow 电影评论文本分类(参考资料1),案例对电影评论文本按积极或消极评论进行评分。

1 数据准备

下载案例数据,解压后得到训练数据和测试数据,将数据存放到hdfs上;

-rw-r-----   3 hive hdfs    1641221 2021-11-16 15:57 hdfs://testcs/tmp/tensorflow/imdb/imdb_word_index.json
-rw-r-----   3 hive hdfs   13865830 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/x_test.npy
-rw-r-----   3 hive hdfs   14394714 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/x_train.npy
-rw-r-----   3 hive hdfs     200080 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/y_test.npy
-rw-r-----   3 hive hdfs     200080 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/y_train.npy

2 环境准备

由于集群主机上仅安装了基础的python运行环境,在实际机器学习项目中,我们可能需要各种python第三方包,为避免不同项目间运行冲突,我们通过 virtualenv 来实现虚拟化。

tar -xvf virtualenv-16.0.0.tar.gz #解压
python virtualenv-16.0.0/virtualenv.py venv #创建虚拟环境
. venv/bin/activate #激活虚拟环境
pip install tensorflow #安装包

安装完成后将虚拟环境打包,用于yarn集群上运行时使用。

zip -r venv.zip venv

3 创建项目

创建机器学习项目,将代码放在src目录下,由于训练和测试数据存储在hdfs上,不能以普通文件系统的方式读取,需要引用hadoop类库,使用tensorflow_io类库读取;

with tf.io.gfile.GFile('hdfs://testcs/tmp/tensorflow/imdb/x_train.npy', mode='rb') as r:
    x_train=np.load(r,allow_pickle=True)
    
with tf.io.gfile.GFile('hdfs://testcs/tmp/tensorflow/imdb/y_train.npy', mode='rb') as r:
    labels_train=np.load(r,allow_pickle=True)

最后创建tony配置文件,配置需要多少资源,比如2个实例,实例内存资源为4G;

vi tony-mypro01.xml

<configuration>
  <property>
    <name>tony.worker.instances</name>
    <value>2</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
</configuration>

开发完成后得到如下的项目结构;

/
├── MyPro01
│   ├── src
│   │   └── models
│   │       └── movie_comm02.py
│   └── tony-mypro01.xml
├── tony-cli-0.4.10-uber.jar
└── venv.zip

4 提交和运行

java -cp `hadoop classpath`:tony-cli-0.4.10-uber.jar:MyPro01/*:MyPro01 com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=venv.zip \
--src_dir=MyPro01/src \
--executes=models/movie_comm02.py \
--task_params="--data_dir hdfs://testcs/tmp/tensorflow/input --output_dir hdfs://testcs/tmp/tensorflow/output " \
--conf_file=MyPro01/tony-mypro01.xml \
--python_binary_path=venv/bin/python

参数说明见(参考资料2)。客户端提交作业到yarn resourcemanger,申请所需的资源,创建job;
在这里插入图片描述
job由1个applicationMaster和2个work构成 ,TonY初始化tensorflow环境后在两个work上训练;

21/11/20 17:56:30 INFO tony.TonyClient: Starting client..
21/11/20 17:56:30 INFO client.RMProxy: Connecting to ResourceManager at 192.168.1.13/192.168.1.10:8050
21/11/20 17:56:30 INFO client.AHSProxy: Connecting to Application History server at 192.168.1.12/192.168.1.11:10200
21/11/20 17:56:30 INFO conf.Configuration: found resource resource-types.xml at file:/opt/hadoop/etc/hadoop/resource-types.xml
21/11/20 17:56:33 INFO tony.TonyClient: Running with secure cluster mode. Fetching delegation tokens..
21/11/20 17:56:33 INFO tony.TonyClient: Fetching RM delegation token..
21/11/20 17:56:33 INFO tony.TonyClient: RM delegation token fetched.
21/11/20 17:56:33 INFO tony.TonyClient: Fetching HDFS delegation tokens for default, history and other namenodes...
21/11/20 17:56:33 INFO hdfs.DFSClient: Created token for hive: HDFS_DELEGATION_TOKEN owner=hive/testcs@testcsKDC, renewer=yarn, realUser=, issueDate=1637402193823, maxDate=1638006993823, sequenceNumber=9827, masterKeyId=635 on ha-hdfs:testcs
21/11/20 17:56:33 INFO security.TokenCache: Got delegation token for hdfs://testcs; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:testcs, Ident: (token for hive: HDFS_DELEGATION_TOKEN owner=hive/testcs@testcsKDC, renewer=yarn, realUser=, issueDate=1637402193823, maxDate=1638006993823, sequenceNumber=9827, masterKeyId=635)
21/11/20 17:56:33 INFO tony.TonyClient: Fetched HDFS delegation token.
21/11/20 17:56:33 INFO tony.TonyClient: Successfully fetched tokens.
21/11/20 17:56:34 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.ApplicationMaster 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
21/11/20 17:56:34 INFO tony.TonyClient: Submitting YARN application
21/11/20 17:56:34 INFO impl.TimelineClientImpl: Timeline service address: null
21/11/20 17:56:34 INFO impl.YarnClientImpl: Submitted application application_1623855961871_1372
21/11/20 17:56:34 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://192.168.1.13:8088/proxy/application_1623855961871_1372/
21/11/20 17:56:34 INFO tony.TonyClient: ResourceManager web address for application: http://192.168.1.13:8088/cluster/app/application_1623855961871_1372
21/11/20 17:56:41 INFO tony.TonyClient: Driver (application master) log url: http://192.168.1.12:8042/node/containerlogs/container_e75_1623855961871_1372_01_000001/hive
21/11/20 17:56:41 INFO tony.TonyClient: AM host: 192.168.1.12
21/11/20 17:56:41 INFO tony.TonyClient: AM RPC port: 42458
21/11/20 17:56:41 WARN ipc.Client: Exception encountered while connecting to the server : java.io.IOException: Connection reset by peer
21/11/20 17:56:43 INFO tony.TonyClient: ------  Application (re)starts, status of ALL tasks ------
21/11/20 17:56:43 INFO tony.TonyClient: RUNNING, worker, 0, http://192.168.1.10:8042/node/containerlogs/container_e75_1623855961871_1372_01_000002/hive
21/11/20 17:56:43 INFO tony.TonyClient: RUNNING, worker, 1, http://192.168.1.10:8042/node/containerlogs/container_e75_1623855961871_1372_01_000003/hive
21/11/20 17:58:04 INFO tony.TonyClient: ------ Task Status Updated ------
21/11/20 17:58:04 INFO tony.TonyClient: SUCCEEDED: worker [1] 
21/11/20 17:58:04 INFO tony.TonyClient: RUNNING: worker [0] 
21/11/20 17:58:07 INFO tony.TonyClient: ------ Task Status Updated ------
21/11/20 17:58:07 INFO tony.TonyClient: SUCCEEDED: worker [0, 1] 
21/11/20 17:58:07 INFO tony.TonyClient: -----  Application finished, status of ALL tasks -----
21/11/20 17:58:07 INFO tony.TonyClient: SUCCEEDED, worker, 0, http://192.168.1.10:8042/node/containerlogs/container_e75_1623855961871_1372_01_000002/hive
21/11/20 17:58:07 INFO tony.TonyClient: SUCCEEDED, worker, 1, http://192.168.1.10:8042/node/containerlogs/container_e75_1623855961871_1372_01_000003/hive
21/11/20 17:58:07 INFO tony.TonyClient: Application 1372 finished with YarnState=FINISHED, DSFinalStatus=SUCCEEDED, breaking monitoring loop.
21/11/20 17:58:07 INFO tony.TonyClient: Link for application_1623855961871_1372's events/metrics: https://localhost:19886/jobs/application_1623855961871_1372
21/11/20 17:58:07 INFO tony.TonyClient: Sending message to AM to stop.
21/11/20 17:58:07 INFO tony.TonyClient: Application completed successfully
21/11/20 17:58:07 INFO impl.YarnClientImpl: Killed application application_1623855961871_1372

训练过程中产生的信息我们可以在work日志中查找到。
在这里插入图片描述

5 总结

让基于 Tensorflow 框架的机器学习运行在 hadoop yarn 上,首先需要在yarn 节点上部署基础的python运行环境;开发机器学习项目时用virtualenv创建独立的python虚拟环境,安装项目所需的包,然后将虚拟环境打包,运行作业时连同机器学习代码一起提交到yarn集群,由TonY组件负责向yarn resourcemanager申请资源,初始化环境,运行机器学习。

参考资料

  • 1 https://tensorflow.google.cn/tutorials/keras/text_classification - 电影评论文本分类
  • 2 https://github.com/tony-framework/TonY - Tony

在这里插入图片描述

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Hadoop Submarine是一个能够在Apache Hadoop 3.1+版本上运行的工具。它可以通过安装Apache Hadoop 3.1的YARN来使用完整的Submarine功能和服务。经过实际使用,我们发现Apache Hadoop 3.1的YARN可以完全支持Hadoop 2.7+以上版本的HDFS系统。同时,Hadoop Submarine还可以与Zeppelin结合使用,解决数据和算法问题,并解决Azkaban的作业调度问题。通过使用Zeppelin、Hadoop Submarine和Azkaban这三个工具集,您可以获得一个零软件成本、开放源码的深度学习开发平台。\[1\]\[2\]如果您需要在多台机器上同步配置文件,可以使用命令"cd /home/commons/hadoop/etc/hadoop"进入配置文件目录,然后使用"scp *"命令将配置文件拷贝到其他两台机器上。启动和停止Hadoop的具体操作可以参考相关文档。\[3\] #### 引用[.reference_title] - *1* *2* [Submarine:在 Apache Hadoop运行深度学习框架](https://blog.csdn.net/cpongo2/article/details/89017275)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [一文理解Hadoop分布式存储和计算框架入门基础](https://blog.csdn.net/qq_20949471/article/details/126392680)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值