分布式框架Ray报错

现象

运行代码,错误提示如下

E0307 19:02:16.041276 227845 228570 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=6f53dca1f451ca9445b95b1c0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2024-03-07 19:02:16,043	WARNING worker.py:1134 -- A worker died or was killed while executing task ffffffffffffffff45b95b1c0100.
Traceback (most recent call last):
  File "/home/a13/XMJ/PRIMAL2/PRIMAL2-main/driver.py", line 232, in <module>
    main()
  File "/home/a13/XMJ/PRIMAL2/PRIMAL2-main/driver.py", line 173, in main
    jobResults, metrics, info = ray.get(done_id)[0]
  File "/home/a13/anaconda3/envs/PRIMAL2/lib/python3.6/site-packages/ray/worker.py", line 1540, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=228047) cannot allocate memory for thread-local data: ABORT
E0307 19:02:16.077193 227845 228570 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=cd8f5689d0aa5a39f66d17ba0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=f66d17ba0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2024-03-07 19:02:16,084	WARNING worker.py:1134 -- A worker died or was killed while executing task fffffffffffffffff66d17ba0100.
(pid=228026) cannot allocate memory for thread-local data: ABORT
E0307 19:02:16.199574 227845 228570 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=55c3b2b635949d8144ee453c0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=44ee453c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2024-03-07 19:02:16,206	WARNING worker.py:1134 -- A worker died or was killed while executing task ffffffffffffffff44ee453c0100.
(pid=228046) cannot allocate memory for thread-local data: ABORT
(pid=228037) cannot allocate memory for thread-local data: ABORT
E0307 19:02:16.219187 227845 228570 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=6170691ebdfaeef6ef0a6c220100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2024-03-07 19:02:16,226	WARNING worker.py:1134 -- A worker died or was killed while executing task ffffffffffffffffef0a6c220100.

分析错误提示

首先出现的是Task failed: IOError: 14: Socket closed
然后出现了ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

1.Task failed: IOError: 14: Socket closed

搜索该错误找到如下链接

1.1.https://discuss.ray.io/t/raysgd-training-instability/1220/5

1.2.https://github.com/ray-project/ray/issues/9293

其中猜测如下:

  1. IO Socket 在一定时间后关闭相关的时间限制所致。
  2. 该机器的配置问题,例如可用内存不足。

1.3https://github.com/ray-project/ray/issues/5820

1.4https://github.com/marmotlab/PRIMAL2/issues/6

在Env_Builder.py中找不到这行“max_time += time_limit”代码,解决方案无效。

2.RayActorError

搜索找到官网https://docs.ray.io/en/latest/ray-core/api/doc/ray.exceptions.RayActorError.html
如果actor由于创建任务中抛出异常而死亡,RayActorError将包含creation_task_error
由于RayActorError不包含creation_task_error,因此不是actor由于创建任务中抛出异常而死亡。

3.读代码

错误定位为Ray框架的worker.py的get函数,get函数如下所示。具体错误定位在第63行的raise value。

def get(object_refs, timeout=None):
    """Get a remote object or a list of remote objects from the object store.

    This method blocks until the object corresponding to the object ref is
    available in the local object store. If this object is not in the local
    object store, it will be shipped from an object store that has it (once the
    object has been created). If object_refs is a list, then the objects
    corresponding to each object in the list will be returned.

    This method will issue a warning if it's running inside async context,
    you can use ``await object_ref`` instead of ``ray.get(object_ref)``. For
    a list of object refs, you can use ``await asyncio.gather(*object_refs)``.

    Args:
        object_refs: Object ref of the object to get or a list of object refs
            to get.
        timeout (Optional[float]): The maximum amount of time in seconds to
            wait before returning.

    Returns:
        A Python object or a list of Python objects.

    Raises:
        RayTimeoutError: A RayTimeoutError is raised if a timeout is set and
            the get takes longer than timeout to return.
        Exception: An exception is raised if the task that created the object
            or that created one of the objects raised an exception.
    """
    worker = global_worker
    worker.check_connected()

    if hasattr(
            worker,
            "core_worker") and worker.core_worker.current_actor_is_asyncio():
        global blocking_get_inside_async_warned
        if not blocking_get_inside_async_warned:
            logger.debug("Using blocking ray.get inside async actor. "
                         "This blocks the event loop. Please use `await` "
                         "on object ref with asyncio.gather if you want to "
                         "yield execution to the event loop instead.")
            blocking_get_inside_async_warned = True

    with profiling.profile("ray.get"):
        is_individual_id = isinstance(object_refs, ray.ObjectRef)
        if is_individual_id:
            object_refs = [object_refs]

        if not isinstance(object_refs, list):
            raise ValueError("'object_refs' must either be an object ref "
                             "or a list of object refs.")

        global last_task_error_raise_time
        # TODO(ujvl): Consider how to allow user to retrieve the ready objects.
        values = worker.get_objects(object_refs, timeout=timeout)
        for i, value in enumerate(values):
            if isinstance(value, RayError):
                last_task_error_raise_time = time.time()
                if isinstance(value, ray.exceptions.UnreconstructableError):
                    worker.core_worker.dump_object_store_memory_usage()
                if isinstance(value, RayTaskError):
                    raise value.as_instanceof_cause()
                else:
                    raise value

        # Run post processors.
        for post_processor in worker._post_get_hooks:
            values = post_processor(object_refs, values)

        if is_individual_id:
            values = values[0]
        return values
函数定义:

get 函数有两个参数:
object_refs:要获取的对象引用 (ObjectRef) 或对象引用列表。
timeout:可选参数,指定等待对象变得可用的最长时间(以秒为单位)。

功能说明:

get函数从对象存储中获取远程对象或远程对象列表。
该方法会阻塞,直到指定的对象在本地对象存储中可用。
如果这个对象不在本地,它将从拥有它的对象存储中传送(一旦对象创建)。

异步上下文的警告:

如果该函数检测到正在从异步上下文中调用它,它将发出警告。
官方建议不要使用 ray.get,而是对对象引用使用 wait 或对对象引用列表使用 wait asyncio.gather(*object_refs),以避免阻塞事件循环。

错误处理:

当检索对象时,该函数会检查 RayError 的实例。
如果遇到错误,例如 RayTaskError 或 UnreconstructableError,则会采取适当的操作,其中可能包括引发异常或转储对象存储内存使用情况以进行调试。

返回值:

如果原始输入是单个对象引用,则返回值是该对象。
如果输入是对象引用列表,则返回值是与引用对应的对象列表。

4.断点调试

运行至Ray框架的worker.py的get函数中以下句子时
values = worker.get_objects(object_refs, timeout=timeout)
显示values = [RayActorError()]
因此错误应该发生在get_objects函数。

5.get_objects函数

函数如下

    def get_objects(self, object_refs, timeout=None):
        """Get the values in the object store associated with the IDs.

        Return the values from the local object store for object_refs. This
        will block until all the values for object_refs have been written to
        the local object store.

        Args:
            object_refs (List[object_ref.ObjectRef]): A list of the object refs
                whose values should be retrieved.
            timeout (float): timeout (float): The maximum amount of time in
                seconds to wait before returning.
        """
        # Make sure that the values are object refs.
        for object_ref in object_refs:
            if not isinstance(object_ref, ObjectRef):
                raise TypeError(
                    "Attempting to call `get` on the value {}, "
                    "which is not an ray.ObjectRef.".format(object_ref))

        timeout_ms = int(timeout * 1000) if timeout else -1
        data_metadata_pairs = self.core_worker.get_objects(
            object_refs, self.current_task_id, timeout_ms)
        return self.deserialize_objects(data_metadata_pairs, object_refs)

get_objects函数获取本地对象存储中与 object_refs 关联的值。
函数将阻塞,直到object_refs的所有值都写入本地对象存储,或者超时。

5.1参数:
        object_refs:应检索的对象的对象引用(ObjectRef 实例)列表。
        timeout:一个可选的浮点数,指定等待所有对象被检索的最长时间(以秒为单位)。如果未指定,则可能无限期等待。
5.2错误处理:

如果object_refs 中的存在 object_ref 不是 ObjectRef 的实例,抛出TypeError。

5.3检索对象:

使用内部方法 self.core_worker.get_objects 从存储中检索对象,传递 object_refs、当前任务的 ID 和 timeout_ms。

5.4反序列化:

返回 self.deserialize_objects 的结果,从对象存储中获取原始数据(由 self.core_worker.get_objects 检索到的结果)并将其转换回应用程序可以使用的 Python 对象。

  • 22
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值