YouTube demo video

Hi, everyone! In this video, we will talk about our new work on Event stream based visual object tracking. The title is ”Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline”. In this work, we propose the first large-scale high-resolution dataset named EventVOT. Besides, we design a novel hierarchical knowledge distillation strategy, called HDETrack. Next, we will report our work from the perspectives of background motivation and paper methodology.

Regarding the background and motivation, I would like to first introduce what visual object tracking is. A short video can tell everyone the answer! From this video, we can conclude that visual object tracking is “to find and track a specific object of interest in successful video frames, combining visual features such as color, texture, shape, and motion information to determine the location of the object”. In real-life scenarios, we can see many applications related to visual object tracking, such as Video surveillance, Medical diagnosis, Intelligent transportation, Automatic driving and so on.

Some researchers have proposed event cameras based on biological inspiration. Compared to traditional RGB cameras, event cameras mainly capture the changes in  intensity of light generated by the movement of a target object to generate a continuous event signal asynchronously. In other words, event cameras only generate events when the target object is moving and there is a change in lighting intensity, and there is no event signal when the camera is completely stationary relative to the target. This also determines that event cameras have their unique advantages, such as high dynamic range, low latency, low energy consumption, privacy protection, and so on.

So, we can naturally think of using the advantages of event cameras to compensate for the shortcomings of traditional RGB cameras. “Revisiting color-event based tracking: A unified network, dataset, and metric”, CEUTrack has recently been proposed. The idea of this work is to use the fusion of RGB and Event dual-mode to achieve effective visual target tracking. This work effectively combines the advantages of traditional RGB cameras and novel Event cameras, achieving robust tracking results. However, dual mode data inputs will bring heavy computing burden to the network.

Based on the above work, we propose the Hierarchical Knowledge Distillation Framework for Event Stream based Tracking, called HDETrack. In the training phase, we first train a bimodal teacher network by inputting RGB and Event bimodal data, similar to CEUTrack. Then, we freeze all the parameters of the teacher network and train only one student network with a single modal Event input in the second stage. At the same time, we designed a Hierarchical Knowledge Distillation strategy, including three layers of knowledge distillation: similarity based distillation, feature based distillation, and response based distillation. Through multi-scale knowledge distillation, robust teacher network can effectively transfer knowledge to our student network, enabling them to learn efficiently. In addition, during the inference process, we only use a unimodal student network for inference, and after inputting image frames converted from events, we achieve fast inference results. In this way, through our carefully designed knowledge organization strategy, our student network can balance tracking speed with excellent performance. Specifically, when training on our proposed unimodal Event dataset, EventVOT, our teacher network inputs event data from multiple view, such as Event images and Event voxels.

Next, let me introduce our proposed large-scale high-resolution Event dataset, called EventVOT. Our dataset was captured by the high-definition event camera Prophesee, with a total of 19 categories including balls, animals, drones, and so on. Specifically, we flow the raw events of each video into a fixed 499 frame event frames format for easy annotation and process. Regarding the standards for collection and annotation, we follow the following points: Diversity of target categories, Diversity of data collection environments, Specific recording of event camera characteristics, High resolution and far-field event signal, Professional labeling company labeling and Data size.

Then, please enjoy the video demo of our dataset. This is the effect of continuously playing the raw EventVOT events in the form of fixed video frames…

In addition, we have designed 14 challenging attributes for this dataset, which include low illumination, fast motion, small targets, and so on. In response to these challenges, we compared our Tracker with other SOTA methods. As shown in the figure on the right, our method performs better than other tracking methods in scenarios such as target deformation and similar interferences.

In terms of quantitative analysis, we first conducted experiments on the EventVOT dataset we proposed. we re-train and report multiple SOTA trackers on the EventVOT dataset. Our results are also better than other SOTA trackers, including the Siamese trackers and Transformer trackers (STARK, MixFormer, PrDiMP, and so on.). From figure on the right, It can be seen that our method achieves a good balance between speed and accuracy, surpassing many excellent tracking methods.

In addition to our newly proposed EventVOT dataset, we also compare our tracker with other SOTA visual trackers on existing event-based tracking datasets, including FE240hz, VisEvent, and COESOT dataset. It can also be seen that our method has achieved good performance on these public datasets, which fully demonstrates the generalization and robustness of our tracking method.

In addition to the quantitative analysis mentioned above,  we visualize the tracking results of ours and other SOTA trackers on the EventVOT dataset. We can find that our tracking using event camera is an interesting and challenging task. These trackers perform well in simple scenarios, however, there is still significant room for improvement.

Besides, we also provide the response maps and similarity maps of Transformer network on our Baseline, student and teacher networks respectively. The target object regions are highlighted which means that tracker focuses on the real targets accurately. Clearly, the areas of focus of our student network guided by the teacher network are more accurate than the baseline approach, second only to the performance of the excellent teacher network

Thanks for your attention! If you are interested in our work, please feel free to contact us!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值