-
Dense Trajectories
-
densely sample feature points in each frame
-
track points in the video based on optical flow.
-
compute multiple descriptors along the trajectories of feature points to capture shape, appearance and motion information.
-
Dense Sampling
-
Sampling step size W=5 pixels
-
# spatial scales ≤ 8
-
Spatial scale increase: 1/2
-
Removing points in homogeneous areas:
-
T=0.001iI(i1,i2)
where (i1,i2) are eigenvalues of point i in image I (the auto-correlation matrix).
-
Descriptors
-
Trajectory shape descriptor(TR):
-
where L is the length of trajectory, and the displacement vectors
-
HOG – static appearance information
-
HOF – local motion information
-
MBH – motion descriptor for trajectories
-
-
Format of DTF features
The format of the computed features
The features are computed one by one, and each one in a single line, with the following format:
frameNum mean_x mean_y var_x var_y length scale x_pos y_pos t_pos Trajectory HOG HOF MBHx MBHy
The first 10 elements are information about the trajectory:
-
frameNum: The trajectory ends on which frame
-
mean_x: The mean value of the x coordinates of the trajectory
-
mean_y: The mean value of the y coordinates of the trajectory
-
var_x: The variance of the x coordinates of the trajectory
-
var_y: The variance of the y coordinates of the trajectory
-
length: The length of the trajectory
-
scale: The trajectory is computed on which scale
-
x_pos: The normalized x position w.r.t. the video (0~0.999), for spatio-temporal pyramid
-
y_pos: The normalized y position w.r.t. the video (0~0.999), for spatio-temporal pyramid
-
t_pos: The normalized t position w.r.t. the video (0~0.999), for spatio-temporal pyramid
The following element are five descriptors concatenated one by one:
-
Trajectory: 2x[trajectory length] (default 30 dimension)
-
HOG: 8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
-
HOF: 9x[spatial cells]x[spatial cells]x[temporal cells] (default 108 dimension)
-
MBHx: 8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
-
MBHy: 8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
-
Improved Dense Trajectories
-
Explicit camera motion estimation
-
Assumption: two consecutive frames are related by a homography.
-
Match feature points between frames using SURF descriptors and dense optical flow
-
Removing inconsistent matches due to humans: use a human detector to remove matches from human regions (computation expensive)
-
Estimate a homography with RANSAC with these matches
References:
- H Wang, C Schmid, Action recognition with improved trajectories, ICCV 2013
- H Wang, A Kläser, C Schmid, CL Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision, May 2013, Volume 103, Issue 1, pp 60-79