论文笔记（1）Sort： Simple online and realtime tracking

最新推荐文章于 2023-01-17 14:39:10 发布

WAHAJA_1111

最新推荐文章于 2023-01-17 14:39:10 发布

阅读量1.3k

点赞数 1

分类专栏：论文阅读文章标签：机器学习人工智能深度学习算法

本文链接：https://blog.csdn.net/qq_39213580/article/details/109536278

版权

论文阅读专栏收录该内容

8 篇文章

订阅专栏

本文详细介绍SORT算法原理及其在多目标追踪中的应用。SORT算法通过结合卡尔曼滤波和匈牙利算法实现快速准确的目标跟踪，尤其适用于视频监控场景。文中还深入解析了卡尔曼滤波与匈牙利算法的工作原理，并提供了详细的源码解读。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

0. 前言

相关资料
- arxiv
- github
- 知乎讨论1、讨论2、讨论3
论文基本信息
- 领域：跟踪算法
- 作者：昆士兰科技大学、悉尼大学
- 发表刊物：IEEE International Conference on Image Processing (ICIP) 2016

1. 要解决什么问题

要解决：多目标追踪（Multi-Object Tracking, MOT)，关联前后帧的物体，如下图所示。（图片来自：多目标追踪论文分享SORT）
速度快：结合卡尔曼滤波和匈牙利算法，由于仅利用 IOU 重合度匹配前后帧，不利用任何与"外观"和"内容"特征，算法速度非常快，可达 260 FPS, 比一般多目标跟踪算法快20倍。
文章提出：仅仅换一个更好的检测器，就可以将目标跟踪表现提升18.9%。

2. 算法原理

2.1 SORT

SORT 全称 Simple Online and Realtime Tracking, 是一个多目标检测算法，基本思想是 Tracking-by-Detection 策略，用目标检测网络的检测结果，结合匈牙利算法和卡尔曼滤波算法实现前后帧物体关联和跟踪。与其他多目标跟踪不同之处在于，SORT 不需要用到检测框内的外观特征，仅根据边界框的位置和大小关联前后帧的物体。SORT 主要使用了 Faster rcnn +卡尔曼滤波+匈牙利算法。下图是一张 SORT 核心算法流程图(图来自Deep SORT多目标跟踪算法代码解析(上))：

在这里插入图片描述

Detections 是通过目标检测器得到的目标框，Tracks 是一段轨迹。核心是匹配的过程与卡尔曼滤波的预测和更新过程。

目标检测器得到目标框 Detections，同时卡尔曼滤波器预测当前的帧的Tracks, 然后将 Detections 和 Tracks 进行 IOU 匹配，最终得到的结果分为：
- Unmatched Tracks，这部分被认为是失配，Detection和Track无法匹配，如果失配持续了次，该目标ID将从图片中删除。
- Unmatched Detections, 这部分说明没有任意一个Track能匹配Detection, 所以要为这个detection分配一个新的track。
- Matched Track，这部分说明得到了匹配。
卡尔曼滤波可以根据Tracks状态预测下一帧的目标框状态。卡尔曼滤波更新是对观测值(匹配上的Track)和估计值更新所有track的状态。

2.2 经典卡尔曼滤波（Kalman Filter，KF）算法详解

此部分主要参考目标跟踪初探（DeepSORT），本文对KF公式不做具体推导，根据目前的需求只是简单应用，因此仅撰写 KF 算法的基础原理。
通俗来讲， KF 作用就是基于传感器的测量值来更新预测值，以达到更精确的估计。广泛应用于无人机、自动驾驶、卫星导航等领域。
在目标跟踪中，KF 分为两个阶段：
- 预测：预测 track 在下一时刻的位置。track 包括：
  - 均值：表示目标的位置信息，由bbox的中心坐标 (cx, cy)，宽高比r，高h，以及各自的速度变化值组成，由8维向量表示为 x = [cx, cy, r, h, vx, vy, vr, vh]，各个速度值初始化为0。（SORT 中， x = [cx,cy,s,r,vx,vy,vs]， s 是bbox框面积，r 是宽高比）
  - 协方差：表示目标位置信息的不确定性，由8x8的对角矩阵表示，矩阵中数字越大则表明不确定性越大，可以以任意值初始化。
- 更新：基于 detection 来更新预测的位置。
简单介绍下要用到的公式：
- 预测基于track在t-1时刻的状态来预测其在t时刻的状态。
- 矩阵F中的dt是当前帧和前一帧之间的差，将等号右边的矩阵乘法展开，可以得到cx’=cx+dt*vx，cy’=cy+dt*vy…，所以这里的卡尔曼滤波是一个匀速模型（Constant Velocity Model）。
- 在公式2中，P为track在t-1时刻的协方差，Q为系统的噪声矩阵，代表整个系统的可靠程度，一般初始化为很小的值，该公式预测t时刻的P’。
- 更新基于t时刻检测到的detection，校正与其关联的track的状态，得到一个更精确的结果。
- 在公式3中，z为detection的均值向量，不包含速度变化值，即z=[cx, cy, r, h]，H称为测量矩阵，它将track的均值向量x’映射到检测空间，该公式计算detection和track的均值误差；
- 在公式4中，R为检测器的噪声矩阵，它是一个4x4的对角矩阵，对角线上的值分别为中心点两个坐标以及宽高的噪声，以任意值初始化，一般设置宽高的噪声大于中心点的噪声，该公式先将协方差矩阵P’映射到检测空间，然后再加上噪声矩阵R；
- 公式5计算卡尔曼增益K，卡尔曼增益用于估计误差的重要程度；
- 公式6和公式7得到更新后的均值向量x和协方差矩阵P。

2.3 经典匈牙利算法详解（Hungarian Algorithm）

此部分主要参考目标跟踪初探（DeepSORT）；
带权重的匈牙利算法也叫 KM (Kuhn-Munkres Algorithm) 算法，所以实际上 SORT 使用的是 KM 算法，以下统称匈牙利算法；
匈牙利算法可以告诉我们当前帧的某个目标，是否与前一帧的某个目标相同。解决的是分配问题：假设有N个人和N个任务，每个任务可以任意分配给不同的人，已知每个人完成每个任务要花费的代价不尽相同，那么如何分配可以使得总的代价最小。举个例子，假设现在有3个任务，要分别分配给3个人，每个人完成各个任务所需代价矩阵（cost matrix）如下所示（这个代价可以是金钱、时间等等）：

在这里插入图片描述

怎样才能找到一个最优分配，使得完成所有任务花费的代价最小呢？匈牙利算法就是用来解决分配问题的一种方法，它基于定理：

如果代价矩阵的某一行或某一列同时加上或减去某个数，则这个新的代价矩阵的最优分配仍然是原代价矩阵的最优分配。
算法步骤
1. 对于矩阵的每一行，减去其中最小的元素
2. 对于矩阵的每一列，减去其中最小的元素
3. 用最少的水平线或垂直线覆盖矩阵中所有的0
4. 如果线的数量等于N，则找到了最优分配，算法结束，否则进入步骤5
5. 找到没有被任何线覆盖的最小元素，每个没被线覆盖的行减去这个元素，每个被线覆盖的列加上这个元素，返回步骤3
代入上面的例子演示一下：
1. 每一行最小的元素分别为15、20、20，减去得到：
1. 每一列最小的元素分别为0、20、5，减去得到：
1. 用最少的水平线或垂直线覆盖所有的0，得到：
1. 线的数量为2，小于3，进入下一步；
2. 现在没被覆盖的最小元素是5，没被覆盖的行（第一和第二行）减去5，得到：
被覆盖的列（第一列）加上5，得到：

跳转到step3，用最少的水平线或垂直线覆盖所有的0，得到：

step4：线的数量为3，满足条件，算法结束。显然，将任务2分配给第1个人、任务1分配给第2个人、任务3分配给第3个人时，总的代价最小（0+0+0=0）：

所以原矩阵的最小总代价为（40+20+25=85）：

代码实现 sklearn里的linear_assignment()函数以及scipy里的linear_sum_assignment()函数都实现了匈牙利算法，两者的返回值的形式不同：

import numpy as np 
from sklearn.utils.linear_assignment_ import linear_assignment
from scipy.optimize import linear_sum_assignment
 

cost_matrix = np.array([
    [15,40,45],
    [20,60,35],
    [20,40,25]
])
 
matches = linear_assignment(cost_matrix)
print('sklearn API result:\n', matches)
matches = linear_sum_assignment(cost_matrix)
print('scipy API result:\n', matches)
 

"""Outputs
sklearn API result:
 [[0 1]
  [1 0]
  [2 2]]
scipy API result:
 (array([0, 1, 2], dtype=int64), array([1, 0, 2], dtype=int64))
"""

SORT 作者提供的源码中使用的是 scipy 里的 linear_sum_assignment() 函数。

3. 源码解读

3.1 SORT

Sort 类中完成跟踪更新任务的是 update 方法，实现对输入的 detection 框更新最新的 bbox + id 信息。
update_with_pose_label 是笔者本人自定义实现的记录跟踪的 bbox 对应的人体形态函数，并记录到pose中，每个子列表位置代表tracklet的id，子列表中存储pose label。

class Sort(object):
	def __init__(self, max_age=1, min_hits=1, iou_threshold=0.3):
		"""
		Sets key parameters for SORT
		"""
		self.max_age = max_age # if a tracklet is unmatched over max_age times, remove it
		self.min_hits = min_hits 
		self.iou_threshold = iou_threshold
		self.trackers = []
		self.frame_count = 0

	def update(self, dets=np.empty((0, 5))):
		"""
		Params:
		dets - a numpy array of detections in the format [[x1,y1,x2,y2,score],[x1,y1,x2,y2,score],...]
		Requires: this method must be called once for each frame even with empty detections (use np.empty((0, 5)) for frames without detections).
		Returns the a similar array, where the last column is the object ID.

		NOTE: The number of objects returned may differ from the number of detections provided.
		"""
		self.frame_count += 1
		# get predicted locations from existing trackers.
		trks = np.zeros((len(self.trackers), 5))
		to_del = []
		ret = []
		for t, trk in enumerate(trks):
			pos = self.trackers[t].predict()[0]
			trk[:] = [pos[0], pos[1], pos[2], pos[3], 0]
			if np.any(np.isnan(pos)):
				to_del.append(t)
		trks = np.ma.compress_rows(np.ma.masked_invalid(trks))
		for t in reversed(to_del):
			self.trackers.pop(t)

        # get the matched, unmatched detections and unmantched trackers
		matched, unmatched_dets, unmatched_trks = associate_detections_to_trackers(dets, trks, self.iou_threshold)

		# update matched trackers with assigned detections
		for m in matched:
			self.trackers[m[1]].update(dets[m[0], :])

		# create and initialise new trackers for unmatched detections
		for i in unmatched_dets:
			trk = KalmanBoxTracker(dets[i,:])
			self.trackers.append(trk)
		i = len(self.trackers)


		for trk in reversed(self.trackers):
			d = trk.get_state()[0]
			if (trk.time_since_update < 1) and (trk.hit_streak >= self.min_hits or self.frame_count <= self.min_hits):
				ret.append(np.concatenate((d,[trk.id+1])).reshape(1,-1)) # +1 as MOT benchmark requires positive
			i -= 1
			# remove dead tracklet
			if(trk.time_since_update > self.max_age):
				self.trackers.pop(i)

		if(len(ret)>0):
			return np.concatenate(ret)
		return np.empty((0,5))

	# defined by wanghuijiao, to save person's pose label
	def update_with_pose_label(self, dets=np.empty((0, 6)), pose=[]):
		"""
		Params:
		dets - a numpy array of detections in the format [[x1,y1,x2,y2,score,category_id],[x1,y1,x2,y2,score,category_id],...]
		pose - a list of pose changing procedure 
		Requires: this method must be called once for each frame even with empty detections (use np.empty((0, 6)) for frames without detections).
		Returns:
		ret - a similar array,  [[x1,y1,x2,y2,score,category_id, object_id],[x1,y1,x2,y2,score,category_id, object_id],...]
		pose - a list of pose changing procedure 
		NOTE: The number of objects returned may differ from the number of detections provided.
		"""
		self.frame_count += 1
		# get predicted locations from existing trackers.
		trks = np.zeros((len(self.trackers), 5))
		to_del = []
		ret = []
		for t, trk in enumerate(trks):
			pos = self.trackers[t].predict()[0]
			trk[:] = [pos[0], pos[1], pos[2], pos[3], 0]
			if np.any(np.isnan(pos)):
				to_del.append(t)
		trks = np.ma.compress_rows(np.ma.masked_invalid(trks))
		for t in reversed(to_del):
			self.trackers.pop(t)

		if(len(dets)==0):
			return np.empty((0,5)), pose

		# get the matched, unmatched detections and unmantched trackers
		matched, unmatched_dets, unmatched_trks = associate_detections_to_trackers(dets[:, :5], trks, self.iou_threshold)

		# update matched trackers with assigned detections
		for m in matched:
			self.trackers[m[1]].update(dets[m[0], :])
			id = self.trackers[m[1]].id
			# update current pose to pose list
			new_label = action_number_to_label[dets[m[0], -1]]
			pose[id].append(new_label)

		# create and initialise new trackers for unmatched detections
		for i in unmatched_dets:
			trk = KalmanBoxTracker(dets[i,:])
			self.trackers.append(trk)
			newid_pose_list = [action_number_to_label[dets[i, -1]]]
			pose.append(newid_pose_list)

		i = len(self.trackers)

		for trk in reversed(self.trackers):
			d = trk.get_state()[0]
			if (trk.time_since_update < 1) and (trk.hit_streak >= self.min_hits or self.frame_count <= self.min_hits):
				ret.append(np.concatenate((d,[trk.id+1])).reshape(1,-1)) # +1 as MOT benchmark requires positive
			i -= 1
			# remove dead tracklet
			if(trk.time_since_update > self.max_age):
				self.trackers.pop(i)

		if(len(ret)>0):
			return np.concatenate(ret), pose

		return np.empty((0,5)), pose

3.2 卡尔曼滤波

本文重点是解读 SORT 算法，作者提供的卡尔曼滤波部分的代码如下。根据前文提及的公式，首先初始化卡尔曼公式所需参数 F、H、R、P、Q，不再赘述。

self.time_since_update 参数记录了 tracklet 自从上一次更新以来存活时间，每次没有匹配上时，该参数加1，输出 track 的 bbox 时，会检查该参数值，若大于 max_age，则从 tracker 中删除；
self.id 记录 tracker 对应 id，每新增一个tracker，自动分配；
self.history 记录最新一次预测的框的位置
self.hit_streak 记录有效跟踪时间，每次 KF 更新 +1，
self.age 记录 tracklet 的预测次数，SORT 中暂且没用到；
self.hits 记录 tracklet 的更新次数，SORT 中暂且没用到。

class KalmanBoxTracker(object):
	"""
	This class represents the internal state of individual tracked objects observed as bbox.
	"""
	count = 0
	def __init__(self,bbox):
		"""
		Initialises a tracker using initial bounding box.
		"""
		#define constant velocity model
		self.kf = KalmanFilter(dim_x=7, dim_z=4) 
		self.kf.F = np.array([[1,0,0,0,1,0,0],[0,1,0,0,0,1,0],[0,0,1,0,0,0,1],[0,0,0,1,0,0,0],[0,0,0,0,1,0,0],[0,0,0,0,0,1,0],[0,0,0,0,0,0,1]])
		# H is measurement matrix: It maps the mean vector x prime of the track to the detection space
		self.kf.H = np.array([[1,0,0,0,0,0,0],[0,1,0,0,0,0,0],[0,0,1,0,0,0,0],[0,0,0,1,0,0,0]]) 

		self.kf.R[2:,2:] *= 10. # Noise matrix of detector whose size is 4*4
		self.kf.P[4:,4:] *= 1000. # give high uncertainty to the unobservable initial velocities， P is Covariance  
		self.kf.P *= 10.
		self.kf.Q[-1,-1] *= 0.01 # Noise matrix whose size is 7*7
		self.kf.Q[4:,4:] *= 0.01

		self.kf.x[:4] = convert_bbox_to_z(bbox) # x is Mean
		self.time_since_update = 0 # keep the tracker alive max_age detections if not unmatched
		self.id = KalmanBoxTracker.count # tracker's id
		KalmanBoxTracker.count += 1
		self.history = []
		self.hits = 0
		self.hit_streak = 0
		self.age = 0

	def update(self,bbox):
		"""
		Updates the state vector with observed bbox.
		"""
		self.time_since_update = 0 # when matched with the detections, reset to 0
		self.history = []
		self.hits += 1
		self.hit_streak += 1
		self.kf.update(convert_bbox_to_z(bbox)) # update the trackers with matched detections

	def predict(self):
		"""
		Advances the state vector and returns the predicted bounding box estimate.
		"""
		# x is [cx,cy,s,r,vx,vy,vs], x[6] is the speed of s (the scale/area).
		# when predicted scale < 0, set the vs=0
		if((self.kf.x[6]+self.kf.x[2])<=0):
			self.kf.x[6] *= 0.0
		self.kf.predict()
		self.age += 1
		if(self.time_since_update>0):
			self.hit_streak = 0
		self.time_since_update += 1
		self.history.append(convert_x_to_bbox(self.kf.x))
		return self.history[-1]

	def get_state(self):
		"""
		Returns the current bounding box estimate.
		"""
		return convert_x_to_bbox(self.kf.x)

3.3 SORT 中匈牙利算法关联前后帧

用计算detection和tracker预测框之间的 IOU 分数矩阵作为权重矩阵，进行分配，得到最优分配 matched_indices，即匹配上的 bbox.
其次分别遍历输入的 detections 和 trackers ，排除已经匹配上的bbox，将其余bbox分为两类：
- unmatched_trackers: 可以理解为人走了，消失在视频中的人；
- unmatched_detections: 可以理解为人来了，新出现在视频中的人，或者之前没跟踪上的人；
所以最终输出是三个类别：matched_indices、unmatched_trackers、unmatched_detections

def associate_detections_to_trackers(detections,trackers,iou_threshold = 0.3):
	"""
	Assigns detections to tracked object (both represented as bounding boxes)

	Returns 3 lists of matches, unmatched_detections and unmatched_trackers
	"""
	if(len(trackers)==0):
		return np.empty((0,2),dtype=int), np.arange(len(detections)), np.empty((0,5),dtype=int)

	iou_matrix = iou_batch(detections, trackers)

	if min(iou_matrix.shape) > 0:
		a = (iou_matrix > iou_threshold).astype(np.int32)
		if a.sum(1).max() == 1 and a.sum(0).max() == 1:
			matched_indices = np.stack(np.where(a), axis=1)
		else:
			matched_indices = linear_assignment(-iou_matrix)
	else:
		matched_indices = np.empty(shape=(0,2))

	unmatched_detections = []
	for d, det in enumerate(detections):
		if(d not in matched_indices[:,0]):
			unmatched_detections.append(d)
	unmatched_trackers = []
	for t, trk in enumerate(trackers):
		if(t not in matched_indices[:,1]):
			unmatched_trackers.append(t)

	# filter out matched with low IOU
	matches = []
	for m in matched_indices:
		if(iou_matrix[m[0], m[1]]<iou_threshold):
			unmatched_detections.append(m[0])
			unmatched_trackers.append(m[1])
		else:
			matches.append(m.reshape(1,2))
	if(len(matches)==0):
		matches = np.empty((0,2),dtype=int)
	else:
		matches = np.concatenate(matches,axis=0)

	return matches, np.array(unmatched_detections), np.array(unmatched_trackers)

4. 达到了什么效果和优缺点

本人分别将 sort 跟 yolov4 & mmdet faster_rcnn_fpn_1x 人体形态检测网络结合，实现了对人体检测框跟踪，对单人跟踪有较好的效果。

yolov4+sort demo：/hdd01/wanghuijiao/test_video/yolov4_tearoom/sort_yolov4_h1m_tearoom_lc1_fps10.mp4
mmdet+sort demo: /hdd01/wanghuijiao/test_video/mmdet_tearoom/sort_mmdet_h1m_tearoom_lc1_fps10.mp4
优点：速度快，260 FPS
缺点：当两个人交叉错过时，容易发生 ID Switch