Andrej Karpathy的CVPR 2013 reading list



Unfortunately, I did not get a chance to attend this year’s CVPR conference for a few reasons — I didn’t have a paper, I started a summer internship at Google and didn’t want to put my awesome project (think: a ton of video data, knowledge graph, large neural nets) on hold, I’ve just finished 2 weeks of traveling and I felt that taking yet another week off was pushing it, and there was some reasoning from financial side too. Having said all that, in retrospect I still regret the decision. I missed out on a lot of fun catching up with friends, some great talks, juicy gossip, but worst, I was left with the daunting task of having to go through all the papers in my own time and by myself.

It took a day and a half, but I managed to get through most of it. I did not get a chance to go through details of the implementation in each paper, but my hope in what follows is to at least highlight the papers that peaked my curiosity because they seemed to address a good problem, appeared to have a clear exposition, and adopted an approach that feels reasonable at least on a high level. Let’s get to it:

Theme: 3D scene understanding

First, it’s exciting to see more people working on models of scenes in 3D world coordinates as opposed to 2D image coordinates. A rather extreme and exceptionally notable example is a paper from Jon Barron that essentially strives to “un-render” an image:

Jonathan Barron, Jitendra Malik
Intrinsic Scene Properties from a Single RGB-D Image 

From the abstract: “Our model takes as input a single RGB-D image and produces as output an improved depth map, a set of surface normals, a reflectance image, a shading image, and a spatially varying model of illumination.” Unfortunately, the algorithm assumes depth channel input (and it’s not clear that a straight-forward extension will make this work on RGB images, especially “in the wild” ) and some of the results (for example when you rotate the camera around the inferred structure) start to not look that great. However, I still think that this paper has a very high (how well it works) / (how difficult it is) ratio. I wonder if it’s possible to incorporate something more non-parametric for the shape model, and I wonder if this could ever work without the assumed input depth channel (I’m sure Jon must be very tired of that question :) ). Maybe it’s possible to use some non-parametric depth transfer work (SIFT Flow style) as the initialization for shape instead of the depth image? Also, a brief mention of a related paper that is a collaboration with Derek Hoiem’s lab: Boundary Cues for 3D Object Shape Recovery.

Wongun Cho, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese
Understanding Indoor Scenes using 3D Geometric Phrases

I was also excited about the above paper from Silvio Saverese’s lab where the authors attempt to jointly model detection, layout estimation and scene classification. In general, I like the idea of training “above” pixels vision models– models that aren’t necessarily concerned with appearance of objects but their spatial relationships in the world and their likelihoods of appearing in different contexts. In this work every scene is represented as a structured object — a scene parse graph. I think we will see more of this kind of work in the future as we get better and especially more plentiful detections of various objects in different parts of the scene. There will be a need for algorithms that take all the (noisy) detections and combine them into a coherent understanding. And I hope that understanding will take form of some structured object, not just a set of huge feature vectors.

Luca Del Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard
Understanding Bayesian rooms using composite 3D object models
 

Above: Another paper that tries to reason about geometric relationships of objects and parts in 3D scenes. Furthermore, there were three papers that reason about cuboids, geometry, physics and structure of the world. I like all of this:

Zhaoyin Jia, Andrew Gallagher, Ashutosh Saxena, Tsuhan Chen
3D-Based Reasoning with Blocks, Support, and Stability
Hao Jiang, Jianxiong Xiao
A Linear Approach to Matching Cuboids in RGBD Images
Bo Zheng, Yibiao Zhao, Joey C. Yu, Katsushi Ikeuchi, Song-Chung Zhu
Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

Just as an aside, I worked along this general direction myself in Fall of 2012 during my rotation with Sebastian Thrun, but on meshes acquired from Kinect Fusion. My ICRA paper was on efficiently identifying blob-like mesh segments that likely constitute entire objects, and my followup project was on parsing the segments into a more higher-level understanding in terms of shapes, physics, gravity, support relationships, etc. In the images below from left to right: an input mesh was first segmented, then I would hypothesize various cuboids, identify suspiciously coincidental relationships between them (for example geometric: one is almost parallel or perpendicular to another), and use this to clean up and refine all hypotheses in a joint optimization to produce a final, clean set of output cuboids (+relationships between them). Unfortunately, I didn’t end up continuing on this project after the Christmas break and never polished it into a paper, but I think there’s something interesting in that direction and judging from the papers above, several people had similar thoughts.

Speaking of 3D/meshes, here is a paper that had a CRF set up on a mesh for classification: Mesh Based Semantic Modelling for Indoor and Outdoor Scenes. In general, I hope to see more papers in Computer Vision conferences that reason about 3D structure of the world and work on meshes and segments. However, it is still difficult to see how we will ever move these methods “into the wild”. I encourage you to do the following exercise for yourself: look at random internet images or images captured by your friends on Facebook and think about how those images and these methods could ever meet. Every time I go through this exercise myself I end up demotivated. And with a headache.

Random “Normal” images. Can you spot the cuboids and reason about support relationships? Can your algorithm?

 

Theme: pushing the Deformable Part Model forward and similar detection models

First, a paper everyone should know about is of course the Best Paper award winner from Google:

Thomas Dean, Mark Ruzon, Mark Segal, Jon Shlens, Sudheendra Vijayanarasimhan, Jay Yagnik
Fast, Accurate Detection of 100,000 Object Classes on a Single Machine

The paper is about a hashing trick for replacing the (relatively expensive) convolution in the DPM model and some associated complexity analysis. But even more interesting than this paper itself is extrapolating it into future. In a few years we will have DPM models that can provide local likelihoods for presence of tens of thousands of objects on a single machine in few tens of seconds per image. Many of the detections will likely be noisy and false but there might be a lot of very interesting work on cleaning it up and making sense of it.

Also, can I brag at this point that I spent my (very fun) summer internship 2 years ago in Tom Dean’s group? Unfortunately, they did not have me work on this project :( Moving on though, this paper from Deva Ramanan and Xiaofeng Ren is interesting and I was wondering about who will be the first group that tries to go along this direction:

Xiaofeng Ren, Deva Ramanan
Histograms of Sparse Codes for Object Detection

Current DPM models are based on Histograms of Oriented Gradients features. However, it also happens that if you train a dictionary of 8 elements on image patches, you get precisely 8 oriented edges. In other words, a HOG cell is simply a special case of a <puts a deep learning hat on> Normalize -> Convolve with 8 filters -> Non-linearity -> Average Pool in Local Neighborhood </takes hat off>. Well, why not go beyond 8? And how much does it help? I expect a lot more progress to be done in this area, and I think we’ll soon be seeing more accurate DPM models that are based on more general filters than just 8 oriented edges (but I think we might need more training data for this to work well?). And in doing so, I think we will also see a lot more connections between DPM models and ConvNets and there will be insights to learn both ways.

Speaking of Deep Learning though, I will also briefly mention a paper from Yann LeCun’s group on training ConvNets to do (pedestrian) detection:

Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, Yann Lecun
Pedestrian Detection with Unsupervised and Multi-Stage Feature Learning

I’m also reasonably certain that we’ll be seeing many more models that look like Deep Learning on the bottom near the pixels supplying raw likelihoods per region, which then feed into something more “loopy” that pieces it together — something that more resembles a Probabilistic Graphical Model. By the way, Yann gave a talk at Scene Understanding workshop, and here are the slides.

I should also mention that I know at least one experienced, capable person (who will go unnamed) who tried for a very long time to get Deep Learning to work on Pedestrian Detection (or detection in general) and failed. I think it’s nice to look through slides that show Deep Learning models beating other methods, but Deep Learning is still not something you can just sprinkle on your problem and make it better, and I also wonder what did the trick in this work.  And why my friend couldn’t get it to work for so long. Anyway, in general definitely a line of work to keep an eye on!

Lastly, I will mention:

Ben Sapp, Ben Taskar
MODEC: Multimodal Decomposable Models for Human Pose Estimation

There have been a few recent papers that use HOG parts as atoms in interesting detection models, and this is a nice example of that trend. The paper also includes several nice pointers to this general line of work. I’ve decided a while ago now that I will go through all of them in detail and figure out the similarities/differences beyond just the high-level intuitions. Alas, that is now somewhere in the middle of my giant toread list and meanwhile, the list of these papers grows seemingly exponentially.

Theme: discovering object/scene parts

Lastly, as Tomasz has also noted on his own blog, there is a theme of papers that try to discover mid-level discriminative parts. I quite like all of this work, starting with Poselets, heading into Unsupervised Discovery of Mid-Level Discriminative Patches at ECCV 2012, What Makes Paris Look Like Paris? at SIGGRAPH 2012,  and at this year’s CVPR we have:

Mayank Juneja, Andrea Vedaldi, C. V. Jawahar, Andrew Zisserman
Blocks that Shout: Distinctive Parts for Scene Classification
Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Representing Videos using Mid-level Discriminative Patches
Subhransu Maji, Gregory Shakhnarovich
Part Discovery from Partial Correspondence

One more paper/line of work that I think is worth to be aware of and that I’m going to fold under this category because it partly shares the high-level goal is an object discovery paper through SIFT Flow / co-segmentation:

Michael Rubinstein, Armand Joulin, Johannes Kopf, Ce Liu
Unsupervised Join Object Discovery and Segmentation in Internet Images

 These all look like fun and promising full reads and I’m looking forward to seeing more work in this general area. Of course, one would like to be ambitious and automatically discover entire DPM models in huge datasets completely unsupervised and entirely deprecate the idea of bounding box training data. Images (or videos, more likely) go in, DPM models for objects/scenes come out. Anyone?

The unsorted, interesting miscellaneous

Here’s a final, unsorted list of a few more papers that have jumped at me as an interesting read based on a few minutes skim:

 

Yun Jiang, Hema Koppula and Ashutosh Saxena
Hallucinating Humans for Learning Object Affordances
[code available]

The basic argument behind this work is a sound one: our environments are designed for humans, so the human body should be an important component in reasoning about our scenes. I find the idea of sampling human poses on top of acquired geometry interesting and I’m happy to see someone working along these directions.
  ———————
Joseph J. Lim, C. Lawrence Zitnick, Piotr Dollar
Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection
[code available on Github]

Take little pieces of hand-drawn segmentations -> cluster -> train a Random Forest model for predicting them from local, surrounding evidence -> segmentation, contours, detection. It’s interesting that the human annotations for a segmentation task are treated as labels. I also had to include this paper just due to my academic crush on Piotr alone.
  ———————
Joseph Tighe, Svetlana Lazebnik
Finding Things: Image parsing with Regions and Per-Exemplar Detectors
[code available]

This paper merges bottom up segmentation+classification cues with top down exemplar SVM segmentations that are based on label transfer. Ideally, I think it would be much cleaner and interesting in the future if the two streams someone helped each other instead of being merely combined. That is, bottom up segmentation triggers some exemplar object detectors, which could feed down and refine the next pass of bottom up segmentation, which could lead to more detections? I’d just like to see a loopy process of “figuring out” an image, from the easy to the hard.
   ———————
James Steven Supancic, Deva Ramanan
Self-paced learning for long-term tracking
[code available on Github]

I’m bringing this paper up as an example of another trend to be aware of. It’s about tracking moving objects in videos. But, seeing Deva Ramanan behind this paper and reading in between the lines, I think it is fair to assume that they are interested in using these tracks for mining data that could be used to build better object detectors. Look out for a cool paper in near future that takes in YouTube videos unsupervised (or with weak supervision where people seed an initial bounding box), tracks objects and trains DPM-looking models for everything.
   ———————
C. Lawrence Zitnick, Devi Parikh
Bringing Semantics Into Focus Using Visual Abstraction

Here’s a fun, unique and controversial CVPR paper to be aware of!
   ———————
Jia Deng, Jonathan Krause, Li Fei-Fei

As many researchers who deal with a lot of visual classes are finding out, our confusion matrices have a lot of blocks and that’s where we lose the most accuracy. That is, we are at the stage where we can reliably tell the difference between an image of a car and that of a bird, but we are finding that differentiating within one or the other requires a whole different set of approaches. The core problem is that telling the difference between two species of birds (or cars, or people, or dogs, etc etc.) usually comes down to very local and minute differences, and the important features corresponding to these crucial regions can get washed out in the sea of irrelevance. In this paper, the authors collect these important regions through crowd sourcing and use them to construct a new “BubbleBank” feature vector. An interesting point is that the crowd is directly affecting the feature representation of the algorithm. With larger datasets, I wonder if it is possible to use the crowd-sourced bubbles as ground truth and attempt to discover these informative regions automatically through some cross-validation scheme?

   ———————
Haroon Idrees, Imran Saleemi, Cody Seibert, Mubarak Shah

This is a fun paper that caught my attention: The input is an image of the crowd and the output is the number of people in the crowd. In general, I like the theme of super-human vision and this is at least one example that I found. I’d hate to be the person collecting the ground truth for this one!

Additional notable code releases

Here is a list of a few code releases that could be of interest, in addition to the ones I linked to above (Though I think the recall on this list is probably low)

-  As part of “Efficient Large-Scale Structured Learning“, Steve Branson released a toolbox for “multiclass object detection, deformable part models, pose mixture models, localized attribute and classification models, online structured learning, probabilistic user models, and interactive annotation tools for labeling parts and object classes.”  C++. LINK

- OpenGM is a “OpenGM is a C++ template library for discrete factor graph models and distributive operations on these models. It includes state-of-the-art optimization and inference algorithms beyond message passing.”

- Here is some tracking code that was released with “Visual Tracking via Locality Sensitive Histograms“. Matlab.

- Mohammad Norouzi and David Fleet had a paper on “Cartesian K-means” which proposes a model similar to k-means that scales more gracefully  as the number of centers goes to billions or trillions in terms of the number of parameters in the model. They claim that their code will be available soon on Github.

Recap

So where does all this leave us? We’re seeing progress on several fronts:
- Good raw likelihoods per pixel. We’re starting to get good at segmentation and assignment of (not too many, flat) visual categories to pixels through various methods based on Features+Random Forests and also Deep Learning. For example, these can provide raw likelihoods at every pixel for various semantic categories (e.g. road-like stuff, porous stuff, car-like stuff, sky-like stuff etc).
- Detection is starting to work, and fast. We’re seeing a lot of progress on object detection, especially in time complexity during inference (hashing, sparse coding for part filters, etc.). I’d like to think that we are on cusp of getting likelihoods for tens of thousands of objects all around the image, noisy as they may be.
- Need to scale up data for training detectors. One current bottleneck to object detection is obtaining good training data for so many different categories. On this front, we’re seeing work that mines images using at most weak supervision to acquire object hypotheses: co-segmentation, part discovery, and we’re seeing work that paves the way for learning these models from video, taking advantage of temporal coherence.
- Fine-grained. We’re seeing work on disentangling broad categories of objects into finer-grained classes.
- Putting it all together. And finally, we’re seeing work on holistic scene understanding that will take all these raw likelihoods, produce a cleaned up version that makes more sense jointly, as well as work that will ground the result in 3D understanding.

The vision for the future

Overall, I eagerly anticipate time when we start putting all of this together and stop dealing with algorithms that touch pixels. I want to see Computer Vision move to a stage where our datasets will not consist of .jpg files but .json/.xml/.mat/.whatever files that contain a huge array of raw likelihoods for various objects, scenes, and things– class segmentations, object detections, scene category likelihoods, material classifications, objectness likelihoods, depth estimates, horizon lines, vanishing points, people detections, face detections, pose estimations, location landmark estimation results and product match likelihoods via more standard SIFT+Homography approaches, maybe even OCR results, etc, etc, etc. This should all feed into some nice structured prediction model.

I think we’re still not quite ready to begin merging it all (and there are significant engineering and practical challenges to doing so as well), but this type of modeling will be an exciting area to be in in a few years and for now, I can at least I see the puzzle pieces coming together and they’re starting to form… a picture.

Few additional random thoughts in closing

- Code. I’m very happy to see more people releasing their code and that this is becoming the norm
- Detective Papers? As great as it is to propose new models of this or that and showing that my number is higher than yours, there are not enough “detective papers” that try to do a more thorough analysis of where some subarea is, what the biggest challenges are and what the most common sources of error are. I’m distinctly reminded of CVPR 2011 “Unbiased Look at Dataset Bias” or an ECCV 2012 paper on “Diagnosing Error in Object Detectors“. I wish there was more diagnosing and less proposing and advertising.
- High mileage at low cost papers? A large portion of papers are proposing more complex models that perform few % better at some benchmark than previous state of the art. Where are all the much simpler models that do comparably to state of the art or even slightly worse? (Perhaps “Seeking the strongest rigid detector” is one of few examples I found) Ideally, we should strive to optimize not for the best performance, but for the best performance to model complexity to space/time complexity ratio. I do agree that it’s hard to measure the latter factors, but we should at least try. I hope these papers aren’t getting rejected.
- We still fail at sifting through our work. CVPR 2013 papers can be found on cvpapers but a much better list is available on cv-foundation. However, in both cases the way the papers are displayed still looks as if it was from 1990′s. Unfortunately, I was personally too busy this time to put together a more reasonable way to explore CVPR papers (like I did for NIPS 2012 (and by the way @benhammer followed up on it and created one for ICML, nice!)), but next time. Say NO to huge unstructured lists of accepted papers!

 

I’m eager to hear your own conclusions from this year’s conference! I hope I didn’t miss too many interesting papers and I’d warmly appreciate pointers in the comments to other interesting paper/theme/ideas/model I may have missed. And now I’m off to read all of this. Cheers!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值