现象
运行代码,错误提示如下
E0307 19:02:16.041276 227845 228570 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=6f53dca1f451ca9445b95b1c0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2024-03-07 19:02:16,043 WARNING worker.py:1134 -- A worker died or was killed while executing task ffffffffffffffff45b95b1c0100.
Traceback (most recent call last):
File "/home/a13/XMJ/PRIMAL2/PRIMAL2-main/driver.py", line 232, in <module>
main()
File "/home/a13/XMJ/PRIMAL2/PRIMAL2-main/driver.py", line 173, in main
jobResults, metrics, info = ray.get(done_id)[0]
File "/home/a13/anaconda3/envs/PRIMAL2/lib/python3.6/site-packages/ray/worker.py", line 1540, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=228047) cannot allocate memory for thread-local data: ABORT
E0307 19:02:16.077193 227845 228570 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=cd8f5689d0aa5a39f66d17ba0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=f66d17ba0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2024-03-07 19:02:16,084 WARNING worker.py:1134 -- A worker died or was killed while executing task fffffffffffffffff66d17ba0100.
(pid=228026) cannot allocate memory for thread-local data: ABORT
E0307 19:02:16.199574 227845 228570 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=55c3b2b635949d8144ee453c0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=44ee453c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2024-03-07 19:02:16,206 WARNING worker.py:1134 -- A worker died or was killed while executing task ffffffffffffffff44ee453c0100.
(pid=228046) cannot allocate memory for thread-local data: ABORT
(pid=228037) cannot allocate memory for thread-local data: ABORT
E0307 19:02:16.219187 227845 228570 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=6170691ebdfaeef6ef0a6c220100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2024-03-07 19:02:16,226 WARNING worker.py:1134 -- A worker died or was killed while executing task ffffffffffffffffef0a6c220100.
分析错误提示
首先出现的是Task failed: IOError: 14: Socket closed
然后出现了ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
1.Task failed: IOError: 14: Socket closed
搜索该错误找到如下链接
1.1.https://discuss.ray.io/t/raysgd-training-instability/1220/5
1.2.https://github.com/ray-project/ray/issues/9293
其中猜测如下:
- IO Socket 在一定时间后关闭相关的时间限制所致。
- 该机器的配置问题,例如可用内存不足。
1.3https://github.com/ray-project/ray/issues/5820
1.4https://github.com/marmotlab/PRIMAL2/issues/6
在Env_Builder.py中找不到这行“max_time += time_limit”代码,解决方案无效。
2.RayActorError
搜索找到官网https://docs.ray.io/en/latest/ray-core/api/doc/ray.exceptions.RayActorError.html
如果actor由于创建任务中抛出异常而死亡,RayActorError将包含creation_task_error
由于RayActorError不包含creation_task_error,因此不是actor由于创建任务中抛出异常而死亡。
3.读代码
错误定位为Ray框架的worker.py的get函数,get函数如下所示。具体错误定位在第63行的raise value。
def get(object_refs, timeout=None):
"""Get a remote object or a list of remote objects from the object store.
This method blocks until the object corresponding to the object ref is
available in the local object store. If this object is not in the local
object store, it will be shipped from an object store that has it (once the
object has been created). If object_refs is a list, then the objects
corresponding to each object in the list will be returned.
This method will issue a warning if it's running inside async context,
you can use ``await object_ref`` instead of ``ray.get(object_ref)``. For
a list of object refs, you can use ``await asyncio.gather(*object_refs)``.
Args:
object_refs: Object ref of the object to get or a list of object refs
to get.
timeout (Optional[float]): The maximum amount of time in seconds to
wait before returning.
Returns:
A Python object or a list of Python objects.
Raises:
RayTimeoutError: A RayTimeoutError is raised if a timeout is set and
the get takes longer than timeout to return.
Exception: An exception is raised if the task that created the object
or that created one of the objects raised an exception.
"""
worker = global_worker
worker.check_connected()
if hasattr(
worker,
"core_worker") and worker.core_worker.current_actor_is_asyncio():
global blocking_get_inside_async_warned
if not blocking_get_inside_async_warned:
logger.debug("Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead.")
blocking_get_inside_async_warned = True
with profiling.profile("ray.get"):
is_individual_id = isinstance(object_refs, ray.ObjectRef)
if is_individual_id:
object_refs = [object_refs]
if not isinstance(object_refs, list):
raise ValueError("'object_refs' must either be an object ref "
"or a list of object refs.")
global last_task_error_raise_time
# TODO(ujvl): Consider how to allow user to retrieve the ready objects.
values = worker.get_objects(object_refs, timeout=timeout)
for i, value in enumerate(values):
if isinstance(value, RayError):
last_task_error_raise_time = time.time()
if isinstance(value, ray.exceptions.UnreconstructableError):
worker.core_worker.dump_object_store_memory_usage()
if isinstance(value, RayTaskError):
raise value.as_instanceof_cause()
else:
raise value
# Run post processors.
for post_processor in worker._post_get_hooks:
values = post_processor(object_refs, values)
if is_individual_id:
values = values[0]
return values
函数定义:
get 函数有两个参数:
object_refs:要获取的对象引用 (ObjectRef) 或对象引用列表。
timeout:可选参数,指定等待对象变得可用的最长时间(以秒为单位)。
功能说明:
get函数从对象存储中获取远程对象或远程对象列表。
该方法会阻塞,直到指定的对象在本地对象存储中可用。
如果这个对象不在本地,它将从拥有它的对象存储中传送(一旦对象创建)。
异步上下文的警告:
如果该函数检测到正在从异步上下文中调用它,它将发出警告。
官方建议不要使用 ray.get,而是对对象引用使用 wait 或对对象引用列表使用 wait asyncio.gather(*object_refs),以避免阻塞事件循环。
错误处理:
当检索对象时,该函数会检查 RayError 的实例。
如果遇到错误,例如 RayTaskError 或 UnreconstructableError,则会采取适当的操作,其中可能包括引发异常或转储对象存储内存使用情况以进行调试。
返回值:
如果原始输入是单个对象引用,则返回值是该对象。
如果输入是对象引用列表,则返回值是与引用对应的对象列表。
4.断点调试
运行至Ray框架的worker.py的get函数中以下句子时
values = worker.get_objects(object_refs, timeout=timeout)
显示values = [RayActorError()]
因此错误应该发生在get_objects函数。
5.get_objects函数
函数如下
def get_objects(self, object_refs, timeout=None):
"""Get the values in the object store associated with the IDs.
Return the values from the local object store for object_refs. This
will block until all the values for object_refs have been written to
the local object store.
Args:
object_refs (List[object_ref.ObjectRef]): A list of the object refs
whose values should be retrieved.
timeout (float): timeout (float): The maximum amount of time in
seconds to wait before returning.
"""
# Make sure that the values are object refs.
for object_ref in object_refs:
if not isinstance(object_ref, ObjectRef):
raise TypeError(
"Attempting to call `get` on the value {}, "
"which is not an ray.ObjectRef.".format(object_ref))
timeout_ms = int(timeout * 1000) if timeout else -1
data_metadata_pairs = self.core_worker.get_objects(
object_refs, self.current_task_id, timeout_ms)
return self.deserialize_objects(data_metadata_pairs, object_refs)
get_objects函数获取本地对象存储中与 object_refs 关联的值。
函数将阻塞,直到object_refs的所有值都写入本地对象存储,或者超时。
5.1参数:
object_refs:应检索的对象的对象引用(ObjectRef 实例)列表。
timeout:一个可选的浮点数,指定等待所有对象被检索的最长时间(以秒为单位)。如果未指定,则可能无限期等待。
5.2错误处理:
如果object_refs 中的存在 object_ref 不是 ObjectRef 的实例,抛出TypeError。
5.3检索对象:
使用内部方法 self.core_worker.get_objects 从存储中检索对象,传递 object_refs、当前任务的 ID 和 timeout_ms。
5.4反序列化:
返回 self.deserialize_objects 的结果,从对象存储中获取原始数据(由 self.core_worker.get_objects 检索到的结果)并将其转换回应用程序可以使用的 Python 对象。