获取更多资讯,赶快关注上面的公众号吧!
文章目录
最近发现了一个比较好玩的开源项目Schlably,其是一个基于Python和深度强化学习(DRL),用于进行调度问题实验的框架。它具有可扩展的gym环境和DRL-Agent,以及用于数据生成、训练和测试的相关功能。
1. 引言
生产调度(Production Scheduling, PS)是运筹学(Operations Research, OR)和优化中的一个重要且复杂的问题。它涉及在时间上分配资源以完成生产任务,目标是最小化时间、成本和资源使用等指标。PS问题在人工智能的运用中得到了广泛的关注,尤其是通过深度强化学习(Deep Reinforcement Learning, DRL)技术的应用。
在这个领域,尽管存在大量的实验和研究,但由于各实验的设置和解决方案常常仅有细微的差异,研究人员不得不不断重复相似的编程工作,这大大增加了研究的初始成本和难度。为了解决这个问题,德国伍伯塔尔大学开发了一个名为Schlably的Python框架,它提供了一整套工具来简化PS解决方案的开发和评估。
开源代码:https://github.com/tmdt-buw/schlably
使用手册:https://schlably.readthedocs.io/en/latest/index.html
2. 背景与相关工作
Schlably的开发初衷源于一个大学研究项目,该项目需要解决一个实际的生产调度问题,并与工业合作伙伴合作。因此,作者在开发早期就确定了一些设计目标,包括:提供开箱即用的DRL方法和启发式算法、覆盖不同的调度场景、支持详细的评估、并且易于代码交互,以便学生和研究人员快速上手。
作者还对现有的相关框架进行了评估和比较。Schlably的设计目标在于提供一种灵活、模块化、易于使用的实验框架,特别是在生成调度问题实例和集成第三方DRL库方面有显著优势。
3. 软件架构
Schlably的架构设计非常注重模块化和扩展性。其核心组件包括:
- 环境(Environment)
- 调度问题生成器(Scheduling Problem Generator)
- DRL代理算法(DRL Agent Algorithm)
- 日志记录和评估工具(Logging and Evaluation Tools)
这一设计不仅简化了实验设置,还使得不同组件之间的替换和扩展变得更加容易。
下图展示了Schlably的总体架构:
在Schlably的架构中,作者使用了强化学习中的Q学习算法。其更新公式如下:
Q ( s t , a t ) ← Q ( s t , a t ) + α [ r t + γ max a Q ( s t + 1 , a ) − Q ( s t , a t ) ] Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right] Q(st,at)←Q(st,at)+α[rt+γamaxQ(st+1,a)−Q(st,at)]
其中:
- Q ( s t , a t ) Q(s_t, a_t) Q(st,at)是在状态 $ s_t $ 下采取动作 $a_t $ 的 Q Q Q值。
- $ \alpha $ 是学习率。
- $ r_t $ 是当前时间步的奖励。
- $ \gamma $ 是折扣因子。
- $ \max_a Q(s_{t+1}, a) $ 是下一状态 $ s_{t+1} $ 下所有可能动作的最大 Q Q Q值。
4. 算法
Schlably中集成了多个调度算法,包括启发式算法、强化学习代理和求解器。
4.1 启发式算法
其中启发式算法包括:
- EDD: earliest due date
- SPT: shortest processing time first
- MTR: most tasks remaining
- LTR: least tasks remaining
- Random: random action
具体实现代码如下:
"""
This module provides the following scheduling heuristics as function:
- EDD: earliest due date
- SPT: shortest processing time first
- MTR: most tasks remaining
- LTR: least tasks remaining
- Random: random action
You can implement additional heuristics in this file by specifying a function that takes a list of tasks and an action
mask and returns the index of the job to be scheduled next.
If you want to call your heuristic via the HeuristicSelectionAgent or edit an existing shortcut,
adapt/extend the task_selection dict attribute of the HeuristicSelectionAgent class.
:Example:
Add a heuristic that returns zeros (this is not a practical example!)
1. Define the according function
.. code-block:: python
def return_0_heuristic(tasks: List[Task], action_mask: np.array) -> int:
return 0
2. Add the function to the task_selection dict within the HeuristicSelectionAgent class:
.. code-block:: python
self.task_selections = {
'rand': random_task,
'EDD': edd,
'SPT': spt,
'MTR': mtr,
'LTR': ltr,
'ZERO': return_0_heuristic
}
"""
import numpy as np
from typing import List
from src.data_generator.task import Task
def get_active_task_dict(tasks: List[Task]) -> dict:
"""
Helper function to determining the next unfinished task to be processed for each job
:param tasks: List of task objects, so one instance
:return: Dictionary containing the next tasks to be processed for each job
Would be an empty dictionary if all tasks were completed
"""
active_job_task_dict = {}
for task_i, task in enumerate(tasks):
if not task.done and task.job_index not in active_job_task_dict.keys():
active_job_task_dict[task.job_index] = task_i
return active_job_task_dict
def edd(tasks: List[Task], action_mask: np.array) -> int:
"""
EDD: earliest due date. Determines the job with the smallest deadline
:param tasks: List of task objects, so one instance
:param action_mask: Action mask from the environment that is to receive the action selected by this heuristic
:return: Index of the job selected according to the heuristic
"""
if np.sum(action_mask) == 1:
chosen_job = np.argmax(action_mask)
else:
num_jobs = action_mask.shape[0] - 1
num_tasks_per_job = len(tasks) / num_jobs
deadlines = np.full(num_jobs + 1, np.inf)
for job_i in range(num_jobs):
idx = int(num_tasks_per_job * job_i)
deadlines[job_i] = tasks[idx].deadline
deadlines = np.where(action_mask == 1, deadlines, np.full(deadlines.shape, np.inf))
chosen_job = np.argmin(deadlines)
return chosen_job
def spt(tasks: List[Task], action_mask: np.array) -> int:
"""
SPT: shortest processing time first. Determines the job of which the next unfinished task has the lowest runtime
:param tasks: List of task objects, so one instance
:param action_mask: Action mask from the environment that is to receive the action selected by this heuristic
:return: Index of the job selected according to the heuristic
"""
if np.sum(action_mask) == 1:
chosen_job = np.argmax(action_mask)
else:
num_jobs = action_mask.shape[0] - 1
runtimes = np.full(num_jobs + 1, np.inf)
active_task_dict = get_active_task_dict(tasks)
for i in range(num_jobs):
if i in active_task_dict.keys():
task_idx = active_task_dict[i]
runtimes[i] = tasks[task_idx].runtime
runtimes = np.where(action_mask == 1, runtimes, np.full(runtimes.shape, np.inf))
chosen_job = np.argmin(runtimes)
return chosen_job
def mtr(tasks: List[Task], action_mask: np.array) -> int:
"""
MTR: most tasks remaining. Determines the job with the least completed tasks
:param tasks: List of task objects, so one instance
:param action_mask: Action mask from the environment that is to receive the action selected by this heuristic
:return: Index of the job selected according to the heuristic
"""
if np.sum(action_mask) == 1:
chosen_job = np.argmax(action_mask)
else:
tasks_done = np.zeros(len(tasks) + 1)
possible_tasks = get_active_task_dict(tasks)
for _, task in enumerate(tasks):
if task.done and task.job_index in possible_tasks.keys():
tasks_done[possible_tasks[task.job_index]] += 1
task_mask = np.zeros(len(tasks) + 1)
for job_id, task_id in possible_tasks.items():
if action_mask[job_id] == 1:
task_mask[task_id] += 1
tasks_done = np.where(task_mask == 1, tasks_done, np.full(tasks_done.shape, np.inf))
tasks_done[-1] = np.inf
chosen_task = np.argmin(tasks_