Ray介绍

最新推荐文章于 2024-04-24 13:35:29 发布

caoli4608

最新推荐文章于 2024-04-24 13:35:29 发布

阅读量1.3k

点赞数 5

分类专栏：机器学习文章标签： python 人工智能开发语言

本文链接：https://blog.csdn.net/caoli4608/article/details/127313994

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1、什么是ray

官方：Ray is a unified framework for scaling AI and Python applications.
我的理解：Ray是AI和python应用的分布式计算框架，并提供了统一的接口。
在这里插入图片描述

2、问题

单节点已经难以满足性能需求，需要集群来计算。

3、Ray的组件

RAY CORE
RAY AI RUNTIME（AIR）：Ray AI Runtime (AIR) is a scalable and unified toolkit for ML applications.
RAY LIBRARIES：AIR会调用的库

4、 RAY AI RUNTIME（AIR）

4.1、目标

Ray AIR提供统一、开放、无缝的接口，来简化机器学习的开发。
Ray AIR aims to simplify the ecosystem of machine learning frameworks, platforms, and tools. It does this by leveraging Ray to provide a seamless, unified, and open experience for scalable ML。

Seamless Dev to Prod: AIR reduces friction going from development to production. With Ray and AIR, the same Python code scales seamlessly from a laptop to a large cluster.
Unified ML API: AIR’s unified ML API enables swapping between popular frameworks, such as XGBoost, PyTorch, and HuggingFace, with just a single class change in your code.
Open and Extensible: AIR and Ray are fully open-source and can run on any cluster, cloud, or Kubernetes. Build custom components and integrations on top of scalable developer APIs.

4.2、组件

Datasets
Preprocessors
Trainers
Tuner
Checkpoints
Batch Predictor
Deployments

5、Ray Core

Ray的基础库。Ray Core provides a small number of core primitives (i.e., tasks, actors, objects) for building and scaling distributed applications.

5.1、组件

1. Tasks
Ray enables arbitrary functions to be executed asynchronously on separate Python workers. These asynchronous Ray functions are called “tasks”. Ray enables tasks to specify their resource requirements in terms of CPUs, GPUs, and custom resources. These resource requests are used by the cluster scheduler to distribute tasks across the cluster for parallelized execution.
2. Actors
Actors extend the Ray API from functions (tasks) to classes. An actor is essentially a stateful worker (or a service). When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. Like tasks, actors support CPU, GPU, and custom resource requirements.
3. Objects
In Ray, tasks and actors create and compute on objects. We refer to these objects as remote objects because they can be stored anywhere in a Ray cluster, and we use object refs to refer to them. Remote objects are cached in Ray’s distributed shared-memory object store, and there is one object store per node in the cluster. In the cluster setting, a remote object can live on one or many nodes, independent of who holds the object ref(s).
4. Placement Groups
Placement groups allow users to atomically reserve groups of resources across multiple nodes (i.e., gang scheduling). They can be then used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks.
5. Environment Dependencies
When Ray executes tasks and actors on remote machines, their environment dependencies (e.g., Python packages, local files, environment variables) must be available for the code to run. To address this problem, you can (1) prepare your dependencies on the cluster in advance using the Ray Cluster Launcher, or (2) use Ray’s runtime environments to install them on the fly.

6、Ray Cluster

A Ray cluster consists of a single head node and any number of connected worker nodes:
在这里插入图片描述

6.1 Head Node

Every Ray cluster has one node which is designated as the head node of the cluster. The head node is identical to other worker nodes, except that it also runs singleton processes responsible for cluster management such as the autoscaler and the Ray driver processes which run Ray jobs. Ray may schedule tasks and actors on the head node just like any other worker node, unless configured otherwise.

6.2 Worker Node

Worker nodes do not run any head node management processes, and serve only to run user code in Ray tasks and actors. They participate in distributed scheduling, as well as the storage and distribution of Ray objects in cluster memory.

6.3 Autoscaling

The Ray autoscaler is a process that runs on the head node (or as a sidecar container in the head pod if using Kubernetes). When the resource demands of the Ray workload exceed the current capacity of the cluster, the autoscaler will try to increase the number of worker nodes. When worker nodes sit idle, the autoscaler will remove worker nodes from the cluster.

It is important to understand that the autoscaler only reacts to task and actor resource requests, and not application metrics or physical resource utilization. To learn more about autoscaling, refer to the user guides for Ray clusters on VMs and Kubernetes.

6.4 Ray Jobs

The main method for running a workload on a Ray cluster is to use Ray Jobs. Ray Jobs enable users to submit locally developed-and-tested applications to a remote Ray cluster. Ray Job Submission simplifies the experience of packaging, deploying, and managing a Ray application.