Ray介绍

1、什么是ray

官方:Ray is a unified framework for scaling AI and Python applications.
我的理解:Ray是AI和python应用的分布式计算框架,并提供了统一的接口。
在这里插入图片描述

2、问题

单节点已经难以满足性能需求,需要集群来计算。

3、Ray的组件

  • RAY CORE
  • RAY AI RUNTIME(AIR):Ray AI Runtime (AIR) is a scalable and unified toolkit for ML applications.
  • RAY LIBRARIES:AIR会调用的库

4、 RAY AI RUNTIME(AIR)

4.1、目标

Ray AIR提供统一、开放、无缝的接口,来简化机器学习的开发。
Ray AIR aims to simplify the ecosystem of machine learning frameworks, platforms, and tools. It does this by leveraging Ray to provide a seamless, unified, and open experience for scalable ML。

  1. Seamless Dev to Prod: AIR reduces friction going from development to production. With Ray and AIR, the same Python code scales seamlessly from a laptop to a large cluster.
  2. Unified ML API: AIR’s unified ML API enables swapping between popular frameworks, such as XGBoost, PyTorch, and HuggingFace, with just a single class change in your code.
  3. Open and Extensible: AIR and Ray are fully open-source and can run on any cluster, cloud, or Kubernetes. Build custom components and integrations on top of scalable developer APIs.

4.2、组件

  1. Datasets
  2. Preprocessors
  3. Trainers
  4. Tuner
  5. Checkpoints
  6. Batch Predictor
  7. Deployments

5、Ray Core

Ray的基础库。Ray Core provides a small number of core primitives (i.e., tasks, actors, objects) for building and scaling distributed applications.

5.1、组件

1. Tasks
Ray enables arbitrary functions to be executed asynchronously on separate Python workers. These asynchronous Ray functions are called “tasks”. Ray enables tasks to specify their resource requirements in terms of CPUs, GPUs, and custom resources. These resource requests are used by the cluster scheduler to distribute tasks across the cluster for parallelized execution.
2. Actors
Actors extend the Ray API from functions (tasks) to classes. An actor is essentially a stateful worker (or a service). When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. Like tasks, actors support CPU, GPU, and custom resource requirements.
3. Objects
In Ray, tasks and actors create and compute on objects. We refer to these objects as remote objects because they can be stored anywhere in a Ray cluster, and we use object refs to refer to them. Remote objects are cached in Ray’s distributed shared-memory object store, and there is one object store per node in the cluster. In the cluster setting, a remote object can live on one or many nodes, independent of who holds the object ref(s).
4. Placement Groups
Placement groups allow users to atomically reserve groups of resources across multiple nodes (i.e., gang scheduling). They can be then used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks.
5. Environment Dependencies
When Ray executes tasks and actors on remote machines, their environment dependencies (e.g., Python packages, local files, environment variables) must be available for the code to run. To address this problem, you can (1) prepare your dependencies on the cluster in advance using the Ray Cluster Launcher, or (2) use Ray’s runtime environments to install them on the fly.

6、Ray Cluster

A Ray cluster consists of a single head node and any number of connected worker nodes:
在这里插入图片描述

6.1 Head Node

Every Ray cluster has one node which is designated as the head node of the cluster. The head node is identical to other worker nodes, except that it also runs singleton processes responsible for cluster management such as the autoscaler and the Ray driver processes which run Ray jobs. Ray may schedule tasks and actors on the head node just like any other worker node, unless configured otherwise.

6.2 Worker Node

Worker nodes do not run any head node management processes, and serve only to run user code in Ray tasks and actors. They participate in distributed scheduling, as well as the storage and distribution of Ray objects in cluster memory.

6.3 Autoscaling

The Ray autoscaler is a process that runs on the head node (or as a sidecar container in the head pod if using Kubernetes). When the resource demands of the Ray workload exceed the current capacity of the cluster, the autoscaler will try to increase the number of worker nodes. When worker nodes sit idle, the autoscaler will remove worker nodes from the cluster.

It is important to understand that the autoscaler only reacts to task and actor resource requests, and not application metrics or physical resource utilization. To learn more about autoscaling, refer to the user guides for Ray clusters on VMs and Kubernetes.

6.4 Ray Jobs

The main method for running a workload on a Ray cluster is to use Ray Jobs. Ray Jobs enable users to submit locally developed-and-tested applications to a remote Ray cluster. Ray Job Submission simplifies the experience of packaging, deploying, and managing a Ray application.


本文参考的Ray官方文档,做的总结:
https://docs.ray.io/en/master/index.html

  • 5
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值