Overview 概述
Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert. Ray minimizes the complexity of running your distributed individual and end-to-end machine learning workflows with these components:
Ray是一个开源的统一框架,用于扩展AI和Python应用程序,如机器学习。它为并行处理提供了计算层,因此您不需要成为分布式系统专家。Ray通过以下组件最大限度地降低了运行分布式个人和端到端机器学习工作流的复杂性:
- Scalable libraries for common machine learning tasks such as data preprocessing, distributed training, hyperparameter tuning, reinforcement learning, and model serving.可扩展的库,用于常见的机器学习任务,如数据预处理,分布式训练,超参数调整,强化学习和模型服务。
- Pythonic distributed computing primitives for parallelizing and scaling Python applications.用于并行化和扩展Python应用程序的Python分布式计算原语。
- Integrations and utilities for integrating and deploying a Ray cluster with existing tools and infrastructure such as Kubernetes, AWS, GCP, and Azure.集成和实用程序,用于将Ray集群与现有工具和基础设施(如Kubernetes,AWS,GCP和Azure)集成和部署。
For data scientists and machine learning practitioners, Ray lets you scale jobs without needing infrastructure expertise:
对于数据科学家和机器学习从业者来说,Ray可以让您在不需要基础设施专业知识的情况下扩展工作:
- Easily parallelize and distribute ML workloads across multiple nodes and GPUs.跨多个节点和GPU轻松并行化和分发ML工作负载。
- Leverage the ML ecosystem with native and extensible integrations.通过原生和可扩展的集成利用ML生态系统。
For ML platform builders and ML engineers, Ray:
对于ML平台构建者和ML工程师,Ray:
- Provides compute abstractions for creating a scalable and robust ML platform.提供计算抽象,用于创建可扩展且强大的ML平台。
- Provides a unified ML API that simplifies onboarding and integration with the broader ML ecosystem.提供统一的ML API,简化了与更广泛的ML生态系统的导入和集成。
- Reduces friction between development and production by enabling the same Python code to scale seamlessly from a laptop to a large cluster. 通过使相同的Python代码能够从笔记本电脑无缝扩展到大型集群,减少开发和生产之间的摩擦。
For distributed systems engineers, Ray automatically handles key processes:
对于分布式系统工程师,Ray自动处理关键流程:
- Orchestration–Managing the various components of a distributed system.分布式管理一个分布式系统的各种组件.
- Scheduling–Coordinating when and where tasks are executed.协调协调何时何地执行任务.
- Fault tolerance–Ensuring tasks complete regardless of inevitable points of failure.容错,确保任务完成,不管不可避免的故障点.
- Auto-scaling–Adjusting the number of resources allocated to dynamic demand.自动部署-调整分配给动态需求的资源数量。