【CV】ICCV2015_Unsupervised Learning of Visual Representations using Videos

Unsupervised Learning of Visual Representations using Videos

 Note here: it's a learning note on Prof. Gupta's novel work published on ICCV2015. It's really exciting to know how unsupervised learning method can contribute to learn visual representations! Also, Feifei-Li's group published a paper on video representation using unsupervised method in ICCV2015 almost at the same time! I also wrote a review on it, check it here!

 Link: http://arxiv.org/pdf/1505.00687v2.pdf

 

Motivation:

- Supervised learning is popular for CNN to train an excellent model on various visual problems, while the application of unsupervised learning leaves blank.

- People learn concepts quickly without numerous instances for training, and we learning things in a dynamic, mostly unsupervised environment.

- We’re short of labeled video data to do supervised learning, but we can easily access to tons of unlabeled data through Internet, which can be made use of by unsupervised learning.

 

Proposed Model:

Target: learning visual representations from videos in an unsupervised way

Key idea: tracking of moving object provides supervision

Brief introduction:

- Objective function (constraint): capture the first patch p1 of a moving object, keep tracking of it and get another patch p2 after several frames, then randomly select a negative patch p- from other places. The idea of objective function constrains the distance of p1 and p2 in feature space should be shorter than distance of p1 and p-

 

- Selection of tracking patch: using IDT to obtain SURF interest points to find out which part of the frame moves most. Setting threshold on the ratio of SURF interest points to avoid noise and camera motion.

 

- Tracking: using KCF tracker to track the patch

 

- Overrall pipline:

Feed triplet into three identical CNN, put two fully-connected layers on the top of pooling-5 layer to project into feature space, then computing the ranking loss to back-propagate the network. (note that: these three CNN shares parameters)

 

 

Training strategy:

There’re many empirical details to train a more powerful CNN in this work, however I’m not going to dive into it, only give some brief reviews on some the trick.

- Choose of negative samples:

       - Random selection in the first 10 epochs of training

    - Hard negative mining in later epochs, we search for all the possible negative patches and choose the top K patches which give maximum loss

 

* Intuition on the result:

 

See from the table above, [unsup + fp(3 ensemble)] outperforms other methods on the detection task of bus, car, person and train, but falls far behind on detecting bird, cat, dog and sofa, which may give us some intuitions.

转载于:https://www.cnblogs.com/kanelim/p/5285906.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值