【python】ray库使用

安装

注意事项:
在windows下,需要python版本3.7以上,详见https://docs.ray.io/en/latest/ray-overview/installation.html
在这里插入图片描述
本人python版本3.9
直接pip install ray进行安装;
需要的库:
pyarrow;
torch;
tensorflow;

案例运行

案例代码(torch)

import torch
import torch.nn as nn

import ray
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig


# If using GPUs, set this to True.
use_gpu = False


input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3


class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size) #输入层到隐含层
        self.relu = nn.ReLU()                           #激活函数
        self.layer2 = nn.Linear(layer_size, output_size)#隐含层到输出层

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input))) #前向传播


def train_loop_per_worker():
    dataset_shard = session.get_dataset_shard("train")
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    model = train.torch.prepare_model(model)

    for epoch in range(num_epochs):
        for batches in dataset_shard.iter_torch_batches(
            batch_size=32, dtypes=torch.float
        ):
            inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"]
            output = model(inputs)
            loss = loss_fn(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print(f"epoch: {epoch}, loss: {loss.item()}")

        session.report(
            {},
            checkpoint=Checkpoint.from_dict(
                dict(epoch=epoch, model=model.state_dict())
            ),
        )


train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)])
scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu)
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()

运行输出

2023-09-06 09:56:24,749	INFO worker.py:1621 -- Started a local Ray instance.
##启动本地Ray实例

2023-09-06 09:56:34,561	INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
这将使用冗长度为1的新输出引擎。要禁用新的输出并使用旧的输出引擎,可以设置环境变量RAY_AIR_NEW_OUTPUT=0。欲了解更多信息,请访问https://github.com/ray-project/ray/issues/36949

2023-09-06 09:56:34,571	INFO tensorboardx.py:178 -- pip install "ray[tune]" to see TensorBoard files.
pip安装“ray[tune]”来查看TensorBoard文件。

2023-09-06 09:56:34,572	WARNING callback.py:144 -- The TensorboardX logger cannot be instantiated because either TensorboardX or one of it's dependencies is not installed. Please make sure you have the latest version of TensorboardX installed: `pip install -U tensorboardx`
TensorboardX记录器不能被实例化,因为TensorboardX或它的一个依赖项没有安装。请确保你已经安装了最新版本的TensorboardX: ' pip install -U TensorboardX

2023-09-06 09:56:34,614	INFO data_parallel_trainer.py:404 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
在你的Ray集群中检测到GPU,但是GPU训练没有为这个训练器启用。要启用GPU训练,请确保在缩放配置中将' use_gpu '设置为True。
View detailed results here: c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34

Training started without custom configuration.
没有特定配置的训练开始。

(TrainTrainable pid=21644) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
在你的Ray集群中检测到GPU,但是GPU训练没有为这个训练器启用。要启用GPU训练,请确保在scaling配置中将' use_gpu '设置为True(TorchTrainer pid=21644) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(TorchTrainer pid=21644) Starting distributed worker processes: ['6896 (127.0.0.1)', '15156 (127.0.0.1)', '18360 (127.0.0.1)']
启动分布式工作进程
(RayTrainWorker pid=6896) Setting up process group for: env:// [rank=0, world_size=3]   建立进程组
(RayTrainWorker pid=6896) Moving model to device: cuda:0   将模型移动到设备
(SplitCoordinator pid=18160) Auto configuring locality_with_output=['675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5']
自动配置locality_with_output
(RayTrainWorker pid=6896) Wrapping provided model in DistributedDataParallel.   在distributeddataparliel中包装提供的模型
(SplitCoordinator pid=18160) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(3, equal=True)]
执行DAG InputDataBuffer[Input] ->OutputSplitter(分裂(3 = = True))
(SplitCoordinator pid=18160) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=2000000000.0), locality_with_output=['675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=18160) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
要查看详细的进度报告,请运行ray.data.DataContext.get_current().execution_options。verbose_progress = True '

(pid=18160) Running 0:   0%|          | 0/200 [00:00<?, ?it/s] 这是运行的进度
(RayTrainWorker pid=15156) D:\ANACONDA\anaconda3\lib\site-packages\torch\nn\modules\loss.py:528: UserWarning: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([32, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
目标大小(torch. size([32]))与输入大小(torch. size)不同。大小([32,1]))。这很可能会因为广播而导致错误的结果。请确保它们的尺寸相同。
(RayTrainWorker pid=15156)   return F.mse_loss(input, target, reduction=self.reduction)
(RayTrainWorker pid=15156) epoch: 0, loss: 13041.183593750个epoch,计算的损失是13041.18359375
(RayTrainWorker pid=15156) D:\ANACONDA\anaconda3\lib\site-packages\torch\nn\modules\loss.py:528: UserWarning: Using a target size (torch.Size([2])) that is different to the input size (torch.Size([2, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayTrainWorker pid=15156)   return F.mse_loss(input, target, reduction=self.reduction) [repeated 3x across cluster]
Training finished iteration 1 at 2023-09-06 09:58:54. Total running time: 2min 20s
训练在2023-09-06 09:58:54完成迭代1。总时长:220+------------------------------+
| Training result              |
+------------------------------+
| time_this_iter_s     130.029 |
| time_total_s         130.029 |
| training_iteration         1 |
+------------------------------+
(TorchTrainer pid=21644) Could not upload checkpoint to c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34 even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported. For large checkpoints or artifacts, consider increasing `SyncConfig(sync_timeout)` (current value: 1800 seconds).
无法上传检查点到c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34即使重试3次。请检查凭据是否过期以及是否支持远程文件系统。对于大型检查点或工件,考虑增加' SyncConfig(sync_timeout) '(当前值:1800)(TorchTrainer pid=21644) Last sync command failed: Sync process failed: [WinError 32] Failed copying 'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34/checkpoint_000000/.is_checkpoint' to '/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34/checkpoint_000000/.is_checkpoint'. Detail: [Windows error 32] 另一个程序正在使用此文件,进程无法访问。
复制失败:因为另一个程序正在使用此文件,进程无法访问。
(TorchTrainer pid=21644) 
(TorchTrainer pid=21644) Could not upload checkpoint to c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34 even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported. For large checkpoints or artifacts, consider increasing `SyncConfig(sync_timeout)` (current value: 1800 seconds).
无法上传检查点到c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34即使重试3次。请检查凭据是否过期以及是否支持远程文件系统。对于大型检查点或工件,考虑增加' SyncConfig(sync_timeout) '(当前值:1800)2023-09-06 09:59:20,462	WARNING tune.py:1122 -- Trial Runner checkpointing failed: Sync process failed: GetFileInfo() yielded path 'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34', which is outside base dir 'C:\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34'
GetFileInfo()生成路径'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34',这是在基础目录'C:\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34'之外

Training completed after 3 iterations at 2023-09-06 09:59:20. Total running time: 2min 45s
训练在3次迭代后于2023-09-06 09:59:20完成。总时长:245

解释

1、想查看TensorBoard文件可以安装:
pip install ray[tune]
错误:Could not install packages due to an OSError: [WinError 5] 拒绝访问。: ‘D:\ANACONDA\anaconda3\Lib\site-packages\google\~rotobuf\internal\_api_implementation.cp39-win_amd64.pyd’
Consider using the --user option or check the permissions.
解决:用pip install --user ray[tune]
结果:

WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
Installing collected packages: tensorboardX, pyarrow
  WARNING: The script plasma_store.exe is installed in 'C:\Users\lucia\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pyarrow-6.0.1 tensorboardX-2.6.2.2
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)

2、谷歌的tensorflow的tensorboard是一个web服务器,用于可视化神经网络的训练过程,它可视化标量值、图像、文本等,主要用于保存tensorflow中的事件,比如学习率、loss等。遗憾的是,其他深度学习框架缺少这样的工具,因此,tensorboardX就出现了,这个软件包的目的是让研究人员使用一个简单的界面来记录PyTorch中的事件(然后在tensorboard中显示可视化)。tensorboardX软件包目前支持记录标量、图像、音频、直方图、文本、嵌入和反向传播路径

3、

案例代码(tensorflow)

# -*- coding: utf-8 -*-
"""
Created on Wed Sep  6 11:50:48 2023

@author: lucia
"""
import ray
import tensorflow as tf

from ray.air import session
from ray.air.integrations.keras import ReportCheckpointCallback
from ray.train.tensorflow import TensorflowTrainer
from ray.air.config import ScalingConfig


# If using GPUs, set this to True.
use_gpu = False

a = 5
b = 10
size = 100


def build_model() -> tf.keras.Model:
    model = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=()),
            # Add feature dimension, expanding (batch_size,) to (batch_size, 1).
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(10),
            tf.keras.layers.Dense(1),
        ]
    )
    return model


def train_func(config: dict):
    batch_size = config.get("batch_size", 64)
    epochs = config.get("epochs", 3)

    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    with strategy.scope():
        # Model building/compiling need to be within `strategy.scope()`.
        multi_worker_model = build_model()
        multi_worker_model.compile(
            optimizer=tf.keras.optimizers.SGD(learning_rate=config.get("lr", 1e-3)),
            loss=tf.keras.losses.mean_squared_error,
            metrics=[tf.keras.metrics.mean_squared_error],
        )

    dataset = session.get_dataset_shard("train")

    results = []
    for _ in range(epochs):
        tf_dataset = dataset.to_tf(
            feature_columns="x", label_columns="y", batch_size=batch_size
        )
        history = multi_worker_model.fit(
            tf_dataset, callbacks=[ReportCheckpointCallback()]
        )
        results.append(history.history)
    return results


config = {"lr": 1e-3, "batch_size": 32, "epochs": 4}

train_dataset = ray.data.from_items(
    [{"x": x / 200, "y": 2 * x / 200} for x in range(200)]
)
scaling_config = ScalingConfig(num_workers=2, use_gpu=use_gpu)
trainer = TensorflowTrainer(
    train_loop_per_worker=train_func,
    train_loop_config=config,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()
print(result.metrics)

运行结果

1、报错:A Message class can only inherit from Message
解决:
在spyder中将 Tools->preferences -> python interpreter中的User Module Reloader关掉。
将Enable UMR、Show reloaded modules list 选项取消!!!
然后restart kernel即可!

2、

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

解决方案:
卸载protobuf 已经安装的版本
pip uninstall protobuf
安装3.19.0版本
pip install protobuf==3.19.0

3、重启内核,重新运行,如果再失败,再重新运行

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值