Apache hadoop yarn 上运行基于 Tensorflow 框架的机器学习

最新推荐文章于 2023-02-22 11:16:52 发布

苏言论

最新推荐文章于 2023-02-22 11:16:52 发布

阅读量2.1k

点赞数

分类专栏：机器学习 hadoop 文章标签：大数据机器学习 tensorflow

本文链接：https://blog.csdn.net/u010053670/article/details/121465482

版权

hadoop 同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

机器学习

1 篇文章 0 订阅

订阅专栏

内容目录

机器学习开发和应用过程中包含的环节介绍和将基于 Tensorflow 框架的机器学习运行在 apache hadoop yarn 上的方法和流程
(Introduction to the links involved in development and application of machine learning, and the method and process of running machine learning based on the Tensorflow framework on Apache hadoop yarn.)

在这里插入图片描述

在机器学习开发和应用过程中包含众多的环节，其中主要的有：数据纳入、数据准备/制作、数据存储/加载、模型训练、模型部署、模型服务；我们在生产和非生产环境中生产和采集的数据，纳入机器学习系统后，首先我们会对数据进行初步的加工、处理和准备，然后存放到不同的地方，比如：分布式存储系统(HDFS)、文件系统；训练时将数据加载到模型，加载需要消耗大量的内存资源，训练需要消耗大量的CPU/GPU资源，随着数据量和模型的复杂度提高，单节点/单主机的硬件资源达到瓶颈，一个很好办法是让机器学习直接在多台机器上或分布式集群上运行。tensorflow 2.x 版本开始支持分布式训练，但不支持分布式集群(hadoop yarn)，需要TonY组件来支持，TonY组件的职责：

从分布式集群(hadoop yarn) 上获得机器学习训练所需的资源，如机器节点、内存和 CPU/GPU资源；
提供和初始化tensorflow机器学习所需的环境；

下文通过案例介绍基于 Tensorflow 框架的机器学习如何在hadoop yarn上运行，案例和数据来自tensorflow 电影评论文本分类(参考资料1)，案例对电影评论文本按积极或消极评论进行评分。

1 数据准备

下载案例数据，解压后得到训练数据和测试数据，将数据存放到hdfs上；

-rw-r-----   3 hive hdfs    1641221 2021-11-16 15:57 hdfs://testcs/tmp/tensorflow/imdb/imdb_word_index.json
-rw-r-----   3 hive hdfs   13865830 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/x_test.npy
-rw-r-----   3 hive hdfs   14394714 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/x_train.npy
-rw-r-----   3 hive hdfs     200080 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/y_test.npy
-rw-r-----   3 hive hdfs     200080 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/y_train.npy

2 环境准备

由于集群主机上仅安装了基础的python运行环境，在实际机器学习项目中，我们可能需要各种python第三方包，为避免不同项目间运行冲突，我们通过 virtualenv 来实现虚拟化。

tar -xvf virtualenv-16.0.0.tar.gz #解压
python virtualenv-16.0.0/virtualenv.py venv #创建虚拟环境
. venv/bin/activate #激活虚拟环境
pip install tensorflow #安装包

安装完成后将虚拟环境打包，用于yarn集群上运行时使用。

zip -r venv.zip venv

3 创建项目

创建机器学习项目，将代码放在src目录下，由于训练和测试数据存储在hdfs上，不能以普通文件系统的方式读取，需要引用hadoop类库，使用tensorflow_io类库读取；

with tf.io.gfile.GFile('hdfs://testcs/tmp/tensorflow/imdb/x_train.npy', mode='rb') as r:
    x_train=np.load(r,allow_pickle=True)
    
with tf.io.gfile.GFile('hdfs://testcs/tmp/tensorflow/imdb/y_train.npy', mode='rb') as r:
    labels_train=np.load(r,allow_pickle=True)

最后创建tony配置文件，配置需要多少资源，比如2个实例，实例内存资源为4G；

vi tony-mypro01.xml

<configuration>
  <property>
    <name>tony.worker.instances</name>
    <value>2</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
</configuration>

开发完成后得到如下的项目结构；

/
├── MyPro01
│   ├── src
│   │   └── models
│   │       └── movie_comm02.py
│   └── tony-mypro01.xml
├── tony-cli-0.4.10-uber.jar
└── venv.zip

4 提交和运行

java -cp `hadoop classpath`:tony-cli-0.4.10-uber.jar:MyPro01/*:MyPro01 com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=venv.zip \
--src_dir=MyPro01/src \
--executes=models/movie_comm02.py \
--task_params="--data_dir hdfs://testcs/tmp/tensorflow/input --output_dir hdfs://testcs/tmp/tensorflow/output " \
--conf_file=MyPro01/tony-mypro01.xml \
--python_binary_path=venv/bin/python

参数说明见(参考资料2)。客户端提交作业到yarn resourcemanger，申请所需的资源，创建job；
在这里插入图片描述
job由1个applicationMaster和2个work构成，TonY初始化tensorflow环境后在两个work上训练；

21/11/20 17:56:30 INFO tony.TonyClient: Starting client..
21/11/20 17:56:30 INFO client.RMProxy: Connecting to ResourceManager at 192.168.1.13/192.168.1.10:8050
21/11/20 17:56:30 INFO client.AHSProxy: Connecting to Application History server at 192.168.1.12/192.168.1.11:10200
21/11/20 17:56:30 INFO conf.Configuration: found resource resource-types.xml at file:/opt/hadoop/etc/hadoop/resource-types.xml
21/11/20 17:56:33 INFO tony.TonyClient: Running with secure cluster mode. Fetching delegation tokens..
21/11/20 17:56:33 INFO tony.TonyClient: Fetching RM delegation token..
21/11/20 17:56:33 INFO tony.TonyClient: RM delegation token fetched.
21/11/20 17:56:33 INFO tony.TonyClient: Fetching HDFS delegation tokens for default, history and other namenodes...
21/11/20 17:56:33 INFO hdfs.DFSClient: Created token for hive: HDFS_DELEGATION_TOKEN owner=hive/testcs@testcsKDC, renewer=yarn, realUser=, issueDate=1637402193823, maxDate=1638006993823, sequenceNumber=9827, masterKeyId=635 on ha-hdfs:testcs
21/11/20 17:56:33 INFO security.TokenCache: Got delegation token for hdfs://testcs; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:testcs, Ident: (token for hive: HDFS_DELEGATION_TOKEN owner=hive/testcs@testcsKDC, renewer=yarn, realUser=, issueDate=1637402193823, maxDate=1638006993823, sequenceNumber=9827, masterKeyId=635)
21/11/20 17:56:33 INFO tony.TonyClient: Fetched HDFS delegation token.
21/11/20 17:56:33 INFO tony.TonyClient: Successfully fetched tokens.
21/11/20 17:56:34 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.ApplicationMaster 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
21/11/20 17:56:34 INFO tony.TonyClient: Submitting YARN application
21/11/20 17:56:34 INFO impl.TimelineClientImpl: Timeline service address: null
21/11/20 17:56:34 INFO impl.YarnClientImpl: Submitted application application_1623855961871_1372
21/11/20 17:56:34 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://192.168.1.13:8088/proxy/application_1623855961871_1372/
21/11/20 17:56:34 INFO tony.TonyClient: ResourceManager web address for application: http://192.168.1.13:8088/cluster/app/application_1623855961871_1372
21/11/20 17:56:41 INFO tony.TonyClient: Driver (application master) log url: http://192.168.1.12:8042/node/containerlogs/container_e75_1623855961871_1372_01_000001/hive
21/11/20 17:56:41 INFO tony.TonyClient: AM host: 192.168.1.12
21/11/20 17:56:41 INFO tony.TonyClient: AM RPC port: 42458
21/11/20 17:56:41 WARN ipc.Client: Exception encountered while connecting to the server : java.io.IOException: Connection reset by peer
21/11/20 17:56:43 INFO tony.TonyClient: ------  Application (re)starts, status of ALL tasks ------
21/11/20 17:56:43 INFO tony.TonyClient: RUNNING, worker, 0, http://192.168.1.10:8042/node/containerlogs/container_e75_1623855961871_1372_01_000002/hive
21/11/20 17:56:43 INFO tony.TonyClient: RUNNING, worker, 1, http://192.168.1.10:8042/node/containerlogs/container_e75_1623855961871_1372_01_000003/hive
21/11/20 17:58:04 INFO tony.TonyClient: ------ Task Status Updated ------
21/11/20 17:58:04 INFO tony.TonyClient: SUCCEEDED: worker [1] 
21/11/20 17:58:04 INFO tony.TonyClient: RUNNING: worker [0] 
21/11/20 17:58:07 INFO tony.TonyClient: ------ Task Status Updated ------
21/11/20 17:58:07 INFO tony.TonyClient: SUCCEEDED: worker [0, 1] 
21/11/20 17:58:07 INFO tony.TonyClient: -----  Application finished, status of ALL tasks -----
21/11/20 17:58:07 INFO tony.TonyClient: SUCCEEDED, worker, 0, http://192.168.1.10:8042/node/containerlogs/container_e75_1623855961871_1372_01_000002/hive
21/11/20 17:58:07 INFO tony.TonyClient: SUCCEEDED, worker, 1, http://192.168.1.10:8042/node/containerlogs/container_e75_1623855961871_1372_01_000003/hive
21/11/20 17:58:07 INFO tony.TonyClient: Application 1372 finished with YarnState=FINISHED, DSFinalStatus=SUCCEEDED, breaking monitoring loop.
21/11/20 17:58:07 INFO tony.TonyClient: Link for application_1623855961871_1372's events/metrics: https://localhost:19886/jobs/application_1623855961871_1372
21/11/20 17:58:07 INFO tony.TonyClient: Sending message to AM to stop.
21/11/20 17:58:07 INFO tony.TonyClient: Application completed successfully
21/11/20 17:58:07 INFO impl.YarnClientImpl: Killed application application_1623855961871_1372

训练过程中产生的信息我们可以在work日志中查找到。
在这里插入图片描述

5 总结

让基于 Tensorflow 框架的机器学习运行在 hadoop yarn 上，首先需要在yarn 节点上部署基础的python运行环境；开发机器学习项目时用virtualenv创建独立的python虚拟环境，安装项目所需的包，然后将虚拟环境打包，运行作业时连同机器学习代码一起提交到yarn集群，由TonY组件负责向yarn resourcemanager申请资源，初始化环境，运行机器学习。

参考资料

1 https://tensorflow.google.cn/tutorials/keras/text_classification - 电影评论文本分类
2 https://github.com/tony-framework/TonY - Tony

在这里插入图片描述

苏言论

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Apache hadoop yarn 上运行基于 Tensorflow 框架的机器学习

内容目录1 数据准备2 环境准备3 创建项目4 提交和运行5 总结参考资料机器学习开发和应用过程中包含的环节介绍和将基于 Tensorflow 框架的机器学习运行在 apache hadoop yarn 上的方法和流程(Introduction to the links involved in development and application of machine learning, and the method and process of running machine learning b
复制链接

扫一扫