CVPR 2013 reading list

CVPR 2013 reading list

http://karpathy.ca/myblog/

July 8, 2013

I did not attend

Unfortunately, I did not get a chance to attend this year’s CVPR conference for a few reasons — I didn’t have a paper, I started a summer internship at Google and didn’t want to put my awesome project (think: a ton of video data, knowledge graph, large neural nets) on hold, I’ve just finished 2 weeks of traveling and I felt that taking yet another week off was pushing it, and there was some reasoning from financial side too. Having said all that, in retrospect I still regret the decision. I missed out on a lot of fun catching up with friends, some great talks, juicy gossip, but worst, I was left with the daunting task of having to go through all the papers in my own time and by myself.

It took a day and a half, but I managed to get through most of it. I did not get a chance to go through details of the implementation in each paper, but my hope in what follows is to at least highlight the papers that peaked my curiosity because they seemed to address a good problem, appeared to have a clear exposition, and adopted an approach that feels reasonable at least on a high level. Let’s get to it:

Theme: 3D scene understanding

First, it’s exciting to see more people working on models of scenes in 3D world coordinates as opposed to 2D image coordinates. A rather extreme and exceptionally notable example is a paper from Jon Barron that essentially strives to “un-render” an image:

Jonathan Barron, Jitendra Malik
Intrinsic Scene Properties from a Single RGB-D Image

From the abstract: “Our model takes as input a single RGB-D image and produces as output an improved depth map, a set of surface normals, a reflectance image, a shading image, and a spatially varying model of illumination.” Unfortunately, the algorithm assumes depth channel input (and it’s not clear that a straight-forward extension will make this work on RGB images, especially “in the wild” ) and some of the results (for example when you rotate the camera around the inferred structure) start to not look that great. However, I still think that this paper has a very high (how well it works) / (how difficult it is) ratio. I wonder if it’s possible to incorporate something more non-parametric for the shape model, and I wonder if this could ever work without the assumed input depth channel (I’m sure Jon must be very tired of that question :)  ). Maybe it’s possible to use some non-parametric depth transfer work (SIFT Flow style) as the initialization for shape instead of the depth image? Also, a brief mention of a related paper that is a collaboration with Derek Hoiem’s lab: Boundary Cues for 3D Object Shape Recovery.

Wongun Cho, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese
Understanding Indoor Scenes using 3D Geometric Phrases

I was also excited about the above paper from Silvio Saverese’s lab where the authors attempt to jointly model detection, layout estimation and scene classification. In general, I like the idea of training “above” pixels vision models– models that aren’t necessarily concerned with appearance of objects but their spatial relationships in the world and their likelihoods of appearing in different contexts. In this work every scene is represented as a structured object — a scene parse graph. I think we will see more of this kind of work in the future as we get better and especially more plentiful detections of various objects in different parts of the scene. There will be a need for algorithms that take all the (noisy) detections and combine them into a coherent understanding. And I hope that understanding will take form of some structured object, not just a set of huge feature vectors.

Luca Del Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard
Understanding Bayesian rooms using composite 3D object models

Above: Another paper that tries to reason about geometric relationships of objects and parts in 3D scenes. Furthermore, there were three papers that reason about cuboids, geometry, physics and structure of the world. I like all of this:

Zhaoyin Jia, Andrew Gallagher, Ashutosh Saxena, Tsuhan Chen
3D-Based Reasoning with Blocks, Support, and Stability
Hao Jiang, Jianxiong Xiao
A Linear Approach to Matching Cuboids in RGBD Images
Bo Zheng, Yibiao Zhao, Joey C. Yu, Katsushi Ikeuchi, Song-Chung Zhu
Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

Just as an aside, I worked along this general direction myself in Fall of 2012 during my rotation with Sebastian Thrun, but on meshes acquired from Kinect Fusion. My ICRA paper was on efficiently identifying blob-like mesh segments that likely constitute entire objects, and my followup project was on parsing the segments into a more higher-level understanding in terms of shapes, physics, gravity, support relationships, etc. In the images below from left to right: an input mesh was first segmented, then I would hypothesize various cuboids, identify suspiciously coincidental relationships between them (for example geometric: one is almost parallel or perpendicular to another), and use this to clean up and refine all hypotheses in a joint optimization to produce a final, clean set of output cuboids (+relationships between them). Unfortunately, I didn’t end up continuing on this project after the Christmas break and never polished it into a paper, but I think there’s something interesting in that direction and judging from the papers above, several people had similar thoughts.

Speaking of 3D/meshes, here is a paper that had a CRF set up on a mesh for classification: Mesh Based Semantic Modelling for Indoor and Outdoor Scenes. In general, I hope to see more papers in Computer Vision conferences that reason about 3D structure of the world and work on meshes and segments. However, it is still difficult to see how we will ever move these methods “into the wild”. I encourage you to do the following exercise for yourself: look at random internet images or images captured by your friends on Facebook and think about how those images and these methods could ever meet. Every time I go through this exercise myself I end up demotivated. And with a headache.

Random “Normal” images. Can you spot the cuboids and reason about support relationships? Can your algorithm?

Theme: pushing the Deformable Part Model forward and similar detection models

First, a paper everyone should know about is of course the Best Paper award winner from Google:

Thomas Dean, Mark Ruzon, Mark Segal, Jon Shlens, Sudheendra Vijayanarasimhan, Jay Yagnik
Fast, Accurate Detection of 100,000 Object Classes on a Single Machine

The paper is about a hashing trick for replacing the (relatively expensive) convolution in the DPM model and some associated complexity analysis. But even more interesting than this paper itself is extrapolating it into future. In a few years we will have DPM models that can provide local likelihoods for presence of tens of thousands of objects on a single machine in few tens of seconds per image. Many of the detections will likely be noisy and false but there might be a lot of very interesting work on cleaning it up and making sense of it.

Also, can I brag at this point that I spent my (very fun) summer internship 2 years ago in Tom Dean’s group? Unfortunately, they did not have me work on this project :(  Moving on though, this paper from Deva Ramanan and Xiaofeng Ren is interesting and I was wondering about who will be the first group that tries to go along this direction:

Xiaofeng Ren, Deva Ramanan
Histograms of Sparse Codes for Object Detection

Current DPM models are based on Histograms of Oriented Gradients features. However, it also happens that if you train a dictionary of 8 elements on image patches, you get precisely 8 oriented edges. In other words, a HOG cell is simply a special case of a <puts a deep learning hat on> Normalize -> Convolve with 8 filters -> Non-linearity -> Average Pool in Local Neighborhood </takes hat off>. Well, why not go beyond 8? And how much does it help? I expect a lot more progress to be done in this area, and I think we’ll soon be seeing more accurate DPM models that are based on more general filters than just 8 oriented edges (but I think we might need more training data for this to work well?). And in doing so, I think we will also see a lot more connections between DPM models and ConvNets and there will be insights to learn both ways.

Speaking of Deep Learning though, I will also briefly mention a paper from Yann LeCun’s group on training ConvNets to do (pedestrian) detection:

Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, Yann Lecun
Pedestrian Detection with Unsupervised and Multi-Stage Feature Learning

I’m also reasonably certain that we’ll be seeing many more models that look like Deep Learning on the bottom near the pixels supplying raw likelihoods per region, which then feed into something more “loopy” that pieces it together — something that more resembles a Probabilistic Graphical Model. By the way, Yann gave a talk at Scene Understanding workshop, and here are the slides.

I should also mention that I know at least one experienced, capable person (who will go unnamed) who tried for a very long time to get Deep Learning to work on Pedestrian Detection (or detection in general) and failed. I think it’s nice to look through slides that show Deep Learning models beating other methods, but Deep Learning is still not something you can just sprinkle on your problem and make it better, and I also wonder what did the trick in this work.  And why my friend couldn’t get it to work for so long. Anyway, in general definitely a line of work to keep an eye on!

Lastly, I will mention:

Ben Sapp, Ben Taskar
MODEC: Multimodal Decomposable Models for Human Pose Estimation

There have been a few recent papers that use HOG parts as atoms in interesting detection models, and this is a nice example of that trend. The paper also includes several nice pointers to this general line of work. I’ve decided a while ago now that I will go through all of them in detail and figure out the similarities/differences beyond just the high-level intuitions. Alas, that is now somewhere in the middle of my giant toread list and meanwhile, the list of these papers grows seemingly exponentially.

Theme: discovering object/scene parts

Lastly, as Tomasz has also noted on his own blog, there is a theme of papers that try to discover mid-level discriminative parts. I quite like all of this work, starting with Poselets, heading into Unsupervised Discovery of Mid-Level Discriminative Patches at ECCV 2012, What Makes Paris Look Like Paris? at SIGGRAPH 2012,  and at this year’s CVPR we have:

Mayank Juneja, Andrea Vedaldi, C. V. Jawahar, Andrew Zisserman
Blocks that Shout: Distinctive Parts for Scene Classification
Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Representing Videos using Mid-level Discriminative Patches
Subhransu Maji, Gregory Shakhnarovich
Part Discovery from Partial Correspondence

One more paper/line of work that I think is worth to be aware of and that I’m going to fold under this category because it partly shares the high-level goal is an object discovery paper through SIFT Flow / co-segmentation:

Michael Rubinstein, Armand Joulin, Johannes Kopf, Ce Liu
Unsupervised Join Object Discovery and Segmentation in Internet Images

These all look like fun and promising full reads and I’m looking forward to seeing more work in this general area. Of course, one would like to be ambitious and automatically discover entire DPM models in huge datasets completely unsupervised and entirely deprecate the idea of bounding box training data. Images (or videos, more likely) go in, DPM models for objects/scenes come out. Anyone?

The unsorted, interesting miscellaneous

Here’s a final, unsorted list of a few more papers that have jumped at me as an interesting read based on a few minutes skim:

Yun Jiang, Hema Koppula and Ashutosh Saxena
Hallucinating Humans for Learning Object Affordances
[code available]

The basic argument behind this work is a sound one: our environments are designed for humans, so the human body should be an important component in reasoning about our scenes. I find the idea of sampling human poses on top of acquired geometry interesting and I’m happy to see someone working along these directions.
 ———————
Joseph J. Lim, C. Lawrence Zitnick, Piotr Dollar
Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection
[code available on Github]

Take little pieces of hand-drawn segmentations -> cluster -> train a Random Forest model for predicting them from local, surrounding evidence -> segmentation, contours, detection. It’s interesting that the human annotations for a segmentation task are treated as labels. I also had to include this paper just due to my academic crush on Piotr alone.
 ———————
Joseph Tighe, Svetlana Lazebnik
Finding Things: Image parsing with Regions and Per-Exemplar Detectors
[code available]

This paper merges bottom up segmentation+classification cues with top down exemplar SVM segmentations that are based on label transfer. Ideally, I think it would be much cleaner and interesting in the future if the two streams someone helped each other instead of being merely combined. That is, bottom up segmentation triggers some exemplar object detectors, which could feed down and refine the next pass of bottom up segmentation, which could lead to more detections? I’d just like to see a loopy process of “figuring out” an image, from the easy to the hard.
   ———————
James Steven Supancic, Deva Ramanan
Self-paced learning for long-term tracking
[code available on Github]

I’m bringing this paper up as an example of another trend to be aware of. It’s about tracking moving objects in videos. But, seeing Deva Ramanan behind this paper and reading in between the lines, I think it is fair to assume that they are interested in using these tracks for mining data that could be used to build better object detectors. Look out for a cool paper in near future that takes in YouTube videos unsupervised (or with weak supervision where people seed an initial bounding box), tracks objects and trains DPM-looking models for everything.
   ———————
C. Lawrence Zitnick, Devi Parikh
Bringing Semantics Into Focus Using Visual Abstraction

Here’s a fun, unique and controversial CVPR paper to be aware of!
   ———————
Jia Deng, Jonathan Krause, Li Fei-Fei

As many researchers who deal with a lot of visual classes are finding out, our confusion matrices have a lot of blocks and that’s where we lose the most accuracy. That is, we are at the stage where we can reliably tell the difference between an image of a car and that of a bird, but we are finding that differentiating within one or the other requires a whole different set of approaches. The core problem is that telling the difference between two species of birds (or cars, or people, or dogs, etc etc.) usually comes down to very local and minute differences, and the important features corresponding to these crucial regions can get washed out in the sea of irrelevance. In this paper, the authors collect these important regions through crowd sourcing and use them to construct a new “BubbleBank” feature vector. An interesting point is that the crowd is directly affecting the feature representation of the algorithm. With larger datasets, I wonder if it is possible to use the crowd-sourced bubbles as ground truth and attempt to discover these informative regions automatically through some cross-validation scheme?

   ———————
Haroon Idrees, Imran Saleemi, Cody Seibert, Mubarak Shah

This is a fun paper that caught my attention: The input is an image of the crowd and the output is the number of people in the crowd. In general, I like the theme of super-human vision and this is at least one example that I found. I’d hate to be the person collecting the ground truth for this one!

Additional notable code releases

Here is a list of a few code releases that could be of interest, in addition to the ones I linked to above (Though I think the recall on this list is probably low)

-  As part of “Efficient Large-Scale Structured Learning“, Steve Branson released a toolbox for “multiclass object detection, deformable part models, pose mixture models, localized attribute and classification models, online structured learning, probabilistic user models, and interactive annotation tools for labeling parts and object classes.”  C++. LINK

- OpenGM is a “OpenGM is a C++ template library for discrete factor graph models and distributive operations on these models. It includes state-of-the-art optimization and inference algorithms beyond message passing.”

- Here is some tracking code that was released with “Visual Tracking via Locality Sensitive Histograms“. Matlab.

- Mohammad Norouzi and David Fleet had a paper on “Cartesian K-means” which proposes a model similar to k-means that scales more gracefully  as the number of centers goes to billions or trillions in terms of the number of parameters in the model. They claim that their code will be available soon on Github.

Recap

So where does all this leave us? We’re seeing progress on several fronts:
- Good raw likelihoods per pixel. We’re starting to get good at segmentation and assignment of (not too many, flat) visual categories to pixels through various methods based on Features+Random Forests and also Deep Learning. For example, these can provide raw likelihoods at every pixel for various semantic categories (e.g. road-like stuff, porous stuff, car-like stuff, sky-like stuff etc).
- Detection is starting to work, and fast. We’re seeing a lot of progress on object detection, especially in time complexity during inference (hashing, sparse coding for part filters, etc.). I’d like to think that we are on cusp of getting likelihoods for tens of thousands of objects all around the image, noisy as they may be.
- Need to scale up data for training detectors. One current bottleneck to object detection is obtaining good training data for so many different categories. On this front, we’re seeing work that mines images using at most weak supervision to acquire object hypotheses: co-segmentation, part discovery, and we’re seeing work that paves the way for learning these models from video, taking advantage of temporal coherence.
- Fine-grained. We’re seeing work on disentangling broad categories of objects into finer-grained classes.
- Putting it all together. And finally, we’re seeing work on holistic scene understanding that will take all these raw likelihoods, produce a cleaned up version that makes more sense jointly, as well as work that will ground the result in 3D understanding.

The vision for the future

Overall, I eagerly anticipate time when we start putting all of this together and stop dealing with algorithms that touch pixels. I want to see Computer Vision move to a stage where our datasets will not consist of .jpg files but .json/.xml/.mat/.whatever files that contain a huge array of raw likelihoods for various objects, scenes, and things– class segmentations, object detections, scene category likelihoods, material classifications, objectness likelihoods, depth estimates, horizon lines, vanishing points, people detections, face detections, pose estimations, location landmark estimation results and product match likelihoods via more standard SIFT+Homography approaches, maybe even OCR results, etc, etc, etc. This should all feed into some nice structured prediction model.

I think we’re still not quite ready to begin merging it all (and there are significant engineering and practical challenges to doing so as well), but this type of modeling will be an exciting area to be in in a few years and for now, I can at least I see the puzzle pieces coming together and they’re starting to form… a picture.

Few additional random thoughts in closing

- Code. I’m very happy to see more people releasing their code and that this is becoming the norm
- Detective Papers? As great as it is to propose new models of this or that and showing that my number is higher than yours, there are not enough “detective papers” that try to do a more thorough analysis of where some subarea is, what the biggest challenges are and what the most common sources of error are. I’m distinctly reminded of CVPR 2011 “Unbiased Look at Dataset Bias” or an ECCV 2012 paper on “Diagnosing Error in Object Detectors“. I wish there was more diagnosing and less proposing and advertising.
- High mileage at low cost papers? A large portion of papers are proposing more complex models that perform few % better at some benchmark than previous state of the art. Where are all the much simpler models that do comparably to state of the art or even slightly worse? (Perhaps “Seeking the strongest rigid detector” is one of few examples I found) Ideally, we should strive to optimize not for the best performance, but for the best performance to model complexity to space/time complexity ratio. I do agree that it’s hard to measure the latter factors, but we should at least try. I hope these papers aren’t getting rejected.
- We still fail at sifting through our work. CVPR 2013 papers can be found on cvpapers but a much better list is available on cv-foundation. However, in both cases the way the papers are displayed still looks as if it was from 1990′s. Unfortunately, I was personally too busy this time to put together a more reasonable way to explore CVPR papers (like I did for NIPS 2012 (and by the way @benhammer followed up on it and created one for ICML, nice!)), but next time. Say NO to huge unstructured lists of accepted papers!

I’m eager to hear your own conclusions from this year’s conference! I hope I didn’t miss too many interesting papers and I’d warmly appreciate pointers in the comments to other interesting paper/theme/ideas/model I may have missed. And now I’m off to read all of this. Cheers!

| 15 Comments

Collection of Subjectively Interesting Papers via Pinterest

<shortrandomblogpost>

I understand that Pinterest is not a very popular service among academics, but I’ve found it to be useful for keeping track of papers that have made an impression on me. I also wish others took some time to curate their own lists, as it would help me build reading lists.

Find the board here: http://pinterest.com/karpathy/research/

Have I missed some awesome papers?

</shortrandomblogpost>

| 1 Comment

On Expediting the Discovery of Relevant Academic Literature

I wanted to share a few quick thoughts and analysis about the NIPS 2012 papers visualization page I put together a few weeks ago and maybe get a bit of discussion going about future. For those not familiar, very briefly, the page displays a list of all accepted papers to the conference, but also shows small paper thumbnails, list of top100 words in the paper color-coded based on topics, and offers functionality to sort all accepted papers based on a topic or similarity to any paper. This allows one to more quickly sift through the huge number of papers and quickly find the ones that are most relevant to them. The page went on to collect a few thousand hits over a period of few weeks (details in figures, for those interested).

The project started very innocently with my frustration of going through the accepted papers on the official NIPS page on a late Friday. It took only a few minutes before I threw my hands up, closed the tab and spent the next few AM hours putting together version 0.1, then thought I should release it because I found it personally useful, and then I added the LDA and other bells and whistles on top as it somehow became popular over the weekend.

My takeaway from the popularity of the page is that there is demand for these kinds of visualizations and interactive ranking schemes and I’ve become quite excited about possible future directions as a result. I already had a few suggestions from people NIPS ranging from personalized recommendations for particular authors to various other fancy visualizations and embeddings, or more social comment/voting features. I’m soliciting more thoughts.

But more generally, something I’m also excited about and already started to build is to extend this into a full-blown academic papers search engine. Because Google Scholar is … uh… okay, but I think one can go way beyond what’s done there in terms of presentation, and that presentation should not be underestimated. I wrote large chunks of the client/backend for it already in Python but I’m currently stuck on parsing papers from NIPS/ICML/etc. for previous years and creating the structured database out of the unstructured mess that exists out there spread across several pages and papers that change their format every year. That’s the bottleneck that requires a lot of manual and tedious effort and I’m not quite sure how to deal with it. I will probably end up spending the time to do a few conferences/years, and then if it tuns out that the service is at all interesting or useful, see if I can put it up on github, and get others to help crowdsource more data? (I already tried once year ago as it turns out, but that was mostly a failure due to some good reasons). The details are fuzzy, but at least the idea is that what you see on the page would be more of a special case of a search for NIPS 2012, which one could go on to refine interactively based on additional queries that bias ranking towards certain topics, authors, keywords, etc.

In the short term, I’d like to continue producing a similar page for conferences when I can spare the time. I also released the code on Github under my favorite licence (WTFPL licence — “Do What The Fuck You Want To Public License” :D ), and I welcome and encourage anyone to build on it and release their own versions. In longer term, my eyes are set more generally on visually nice, user-friendly (these two are very important in my mind) and competent academic search across conferences, perhaps starting with a particular niche first. As I mentioned, I am a little stuck on that one for now.

I welcome any thoughts on this page, its short-term use for upcoming conferences, and more generally on how we could go about building pages that help us expedite discovery and analysis of relevant academic literature.

| 1 Comment

Renewable Energy, Climate Change

I’ve recently become interested in sustainable energy, climate, green technologies, electric cars, etc. I’ve been reading random blogs and articles about these topics for a while, but only recently have I decided to investigate these issues more exhaustively after discovering that David MacKay (yes, the awesome Machine Learning / Physicist one) has written a (free PDF!) book about these topics called “Sustainable Energy: Without the hot air” in 2008. A few months after he published the book, he was appointed Chief Scientific Advisor to the Department of Energy and Climate Change in UK, where he now works 80% of his time (and 20% back at Cambridge). The book is an interesting read and I thought it would be fun to dedicate a blog post to my own (shorter) notes, interpretations and conclusions. Who knows, maybe I’ll sway a few of my readers to become just as obsessed :)

Problem statement: Climate and Energy. So here’s the problem. As humans, we do a lot of stuff (travel, eat, heat, build, etc.) and doing stuff requires energy. Presently, about 80% of that energy can be traced back as coming from burning fossil fuels (coal, oil, gas mined from Earth). Now, the problem is that fossil fuels are a finite resource and we are consuming it at alarming rates. In fact, at predicted rates we may run out of these precious resources in 50-150 years. The problem gets worse: fossil fuels are very useful form of matter that takes hundreds of millions of years to form under ground and can be used in all kinds of interesting ways (to create plastics, for example) other than simply burning them for energy.

But the problem gets Even worse: burning fossil fuels releases large amounts of Carbon Dioxide into the atmosphere and this is very worrying. CO2 is a greenhouse gas and a surplus of this gas in the atmosphere above what is nominally produced by Earth leads to warmer earth, which leads to rising temperatures, which among other things melts ice, and just generally causes a whole cascade of events that upset the balance of the entire ecosystem. The problem is that nobody is really certain just how fragile our ecosystem is when faced with this sudden rise of CO2 levels in these last ~200 years since the industrial revolution and there a lot of scary disaster scenarios involving feedback loops that end with irreversible damage done to the planet and its life. The bottom line is that Earth’s climate is a complex system that is infinitely precious, and by burning fossil fuels we are really stretching its limits and playing with fire. The conclusion is inescapable and clear: we need to significantly reduce the rate at which we burn fossil fuel, we need to do it very quickly, and we need to do it in face of ever-increasing demand for energy from the fast-paced society we live in. So, what are our options?

Renewable Energy options..  Lets first consider harvesting sustainable energy from the most preferable energy source: the sun. We are at constant bombardment at rate of about 174 petawatts (this is a LOT, by the way, human energy consumption is at about 0.01% of this) of FREE energy from the sun. About 30% is reflected back to space by Earth, but the other 70% is pumped into clouds, oceans and the land mass in various ways.

- Our first chance to harvest this energy is most directly though solar panels.
- A part of this energy goes into warming air, which rises and causes convection, wind, cyclones, etc. This 2nd grade sun’s energy can be harvested with wind turbines.
- Wind blows over oceans and causes waves. Waves are 3rd grade sun’s energy and can be harvested on ocean surfaces using wave energy converters.
- Heated water also evaporates and rains back down on Earth. Flowing water in rivers can be harvested as 2nd grade sun’s energy through hydro plants.
- Earth’s biomass absorbs sunlight through photosynthesis to create plants and animals that feed on plants. Plants, animals and their biproducts (for example, ethanol) can be harvested for energy, but this is not always considered renewable because a lot of other nutrients are consumed in the process.

All of the above options have their pros and cons that I will go into shortly. However, we are not done with Earth’s sources of energy! Earth has more energy stored in it that can be for all practical purposes considered renewable. Mainly:

- Earth has a huge amount of molten hot material under surface. This heat can be harvested by digging tunnels deep into our crust. That is, we can harvest geothermal energy.
- Moon exerts its gravitational influence on our planet and causes tides. Energy from all this water rising and falling can be harvested through tidal pools and similar technologies. Exactly where does this energy come from, you may ask? :)  What is being used up? It’s not obvious but the energy in tides can be  be traced back as coming from Earth’s rotational energy. Tides are actually “using up” Earth’s rotational energy and Earth is actually slowing down its rotation as a result (see Tidal acceleration)! For example, it turns out that around 600 million years ago, a day was about 22 hours.

Debatingly-renewable Energy options. Two other new contenders for energy harness the chemical energy (essentially) in configurations of molecules found on our planet. Mainly, I’m referring to energy stored in certain heavy elements such as Uranium, Plutonium and Thorium that can be harvested through Nuclear Fission, and binding energy due to strong nuclear force that can one day (maybe) be harnessed by fusing light molecules (such as heavy water, lithium) in Nuclear Fusion.

Pros/Cons, Economics. The short story is this. Photovoltaics is a young technology but shows the most long-term promise for clean, renewable energy and is my personal favorite by far. The main limitations right now is the cost of the technology. We simply haven’t yet figured out how we can build cheap solar panels and manufacture them at scale. Wind power is also a promising and clean resource that can be used alongside solar panels. However, both of these sources suffer from being intermittent because they rely on clouds and amount of wind. This is a big problem because it is difficult, lossy and expensive to store energy for release at will. Ideally, we would be able to extract exactly as much energy as is needed at any point in time and no more. Our best options for more reliable and stable energy are hydro (as water’s potential energy can be cheaply stored in dams) and geothermal energy. However, both of these are not very scalable so should be used in limited quantities to supplement wind and solar when supply is not meeting demand.

Nuclear fission is a controversial source of power due to worries about storage of radioactive biproducts that take on order of thousand years to dissipate and must be carefully stored deep underground. In addition, people are worried about nuclear material leaks through error/terrorism and the potential of these plants to be used as an excuse to build nuclear weapons. (see this discussion TED talk on whether or not we need Nuclear, from people who actually know what they’re talking about) . From what I understand though, it is also not clear if we can go on using fission forever because we are consuming minable Uranium that will run out on scale of hundred years or so. We may be able to use so-called fast breeder reactors while extracting Uranium from ocean which would allow the technology to yield a lot of energy over very long time-scales, but these are mostly conjectures at the moment. Similarly, Nuclear Fusion is presently the stuff of dreams and noone is certain if it will ever work. Projections of working reactors currently range in decades. However, if we were able to get nuclear fusion to work, it would completely and utterly dwarf all other renewable energy sources put together and provide clean energy for millions of years. Something to keep your eyes on! :)

Final opinion: My uneducated novice opinion on what government should do about this crysis, based on what I read so far: Crank up production of mostly solar and little more wind. Start reducing contribution from fossil fuels and more slowly that of nuclear power. (I oppose Nuclear power but I am also worried we may need it. For now, I choose to trust some research reports that suggest that we don’t need it and I also choose to believe that through research we can significantly improve solar technology and its scalability.)  Next, build a few less 2-billion-dollar carrier ships and throw first half of that money into photovoltaics research: Incentivise use of solar power through tax cuts to create additional demand and support startups and technolgy companies that enter this sector. Throw the other half into programs that support purely electric and self-driving vehicles. I also don’t think the popular opinion among ordinary people should be underestimated as a catalyst for change. Spend a last percent or two on making green technologies cool to ordinary people through propaganda programs– YouTube channels, viral videos, interactive sites, and getting popular media figures to endorse these technologies and educate their followers.

What can you do? Based on David MacKay’s analysis, the biggest energy sinks of an average person that can be influenced are Transportation (car) and Heating/Cooling in your house. So here’s what you should do: Buy Tesla Model S all-electric vehicle (or one of its descendants in the near future. These cars can now also be charged in Supercharger network for free. The Supercharger network gets power from solar, so you can be riding for free on pure sunlight!). Next, work on reducing your power-hungry heating system in the house. For this, consider getting Nest, the learning thermostat and also consider upgrading insulation in your house. Replace all your light bulbs with new, significantly more efficient LED lights. Finally, for extra cool points cover your roof with solar panels using, for example, Solar City.

Future plans. I dream of future in which we consume 100% renewable energy (mostly solar, wind, some hydro) and ride around exclusively in self-driving, fully electric vehicles. I’ve read through a few reports (like this one from Stanford) that outline plans to transition to 100% renewable energy usually by around 2050. Obama called for 80% renewable energy by 2035, but naturally some proponents think it is too ambitious. Meanwhile, I think Denmark is in the lead, as it has passed legislature that commits the country to 100% renewable by 2050. I hope to see more countries follow!

BONUS: some notes on my future home :)

| 6 Comments

The state of Computer Vision and AI: we are really, really far.


The picture above is funny.

But for me it is also one of those examples that make me sad about the outlook for AI and for Computer Vision. What would it take for a computer to understand this image as you or I do? I challenge you to think explicitly of all the pieces of knowledge that have to fall in place for it to make sense. Here is my short attempt:

- You recognize it is an image of a bunch of people and you understand they are in a hallway
- You recognize that there are 3 mirrors in the scene so some of those people are “fake” replicas from different viewpoints.
- You recognize Obama from the few pixels that make up his face. It helps that he is in his suit and that he is surrounded by other people with suits.
- You recognize that there’s a person standing on a scale, even though the scale occupies only very few white pixels that blend with the background. But, you’ve used the person’s pose and knowledge of how people interact with objects to figure it out.
- You recognize that Obama has his foot positioned just slightly on top of the scale. Notice the language I’m using: It is in terms of the 3D structure of the scene, not the position of the leg in the 2D coordinate system of the image.
- You know how physics works: Obama is leaning in on the scale, which applies a force on it. Scale measures force that is applied on it, that’s how it works => it will over-estimate the weight of the person standing on it.
- The person measuring his weight is not aware of Obama doing this. You derive this because you know his pose, you understand that the field of view of a person is finite, and you understand that he is not very likely to sense the slight push of Obama’s foot.
- You understand that people are self-conscious about their weight. You also understand that he is reading off the scale measurement, and that shortly the over-estimated weight will confuse him because it will probably be much higher than what he expects. In other words, you reason about implications of the events that are about to unfold seconds after this photo was taken, and especially about the thoughts and how they will develop inside people’s heads. You also reason about what pieces of information are available to people.
- There are people in the back who find the person’s imminent confusion funny. In other words you are reasoning about state of mind of people, and their view of the state of mind of another person. That’s getting frighteningly meta.
-  Finally, the fact that the perpetrator here is the president makes it maybe even a little more funnier. You understand what actions are more or less likely to be undertaken by different people based on their status and identity.

I could go on, but the point here is that you’ve used a HUGE amount of information in that half second when you look at the picture and laugh. Information about the 3D structure of the scene, confounding visual elements like mirrors, identities of people, affordances and how people interact with objects, physics (how a particular instrument works,  leaning and what that does), people, their tendency to be insecure about weight, you’ve reasoned about the situation from the point of view of the person on the scale, what he is aware of, what his intents are and what information is available to him, and you’ve reasoned about people reasoning about people. You’ve also thought about the dynamics of the scene and made guesses about how the situation will unfold in the next few seconds visually, how it will unfold in the thoughts of people involved, and you reasoned about how likely or unlikely it is for people of particular identity/status to carry out some action. Somehow all these things come together to “make sense” of the scene.

It is mind-boggling that all of the above inferences unfold from a brief glance at a 2D array of R,G,B values. The core issue issue is that the pixel values are just a tip of a huge iceberg and deriving the entire shape and size of the icerberg from prior knowledge is the most difficult task ahead of us. How can we even begin to go about writing an algorithm that can reason about the scene like I did? Forget for a moment the inference algorithm that is capable of putting all of this together; How do we even begin to gather data that can support these inferences (for example how a scale works)? How do we go about even giving the computer a chance?

Now consider that the state of the art techniques in Computer Vision are tested on things like Imagenet (task of assigning 1-of-k labels for entire images), or Pascal VOC detection challenge (+ include bounding boxes). There is also quite a bit of work on pose estimation, action recognition, etc., but it is all specific, disconnected, and only half works. I hate to say it but the state of CV and AI is pathetic when we consider the task ahead, and when we think about how we can ever go from here to there. The road ahead is long, uncertain and unclear.  I’ve seen some arguments that all we need is lots more data from images, video, maybe text and run some clever learning algorithm: maybe a better objective function, run SGD, maybe anneal the step size, use adagrad, or slap an L1 here and there and everything will just pop out. If we only had a few more tricks up our sleeves! But to me, examples like this illustrate that we are missing many crucial pieces of the puzzle and that a central problem will be as much about obtaining the right training data in the right form to support these inferences as it will be about making them. Thinking about the complexity and scale of the problem further, a seemingly inescapable conclusion for me is that we may also need embodiment, and that the only way to build computers that can interpret scenes like we do is to allow them to get exposed to all the years  of (structured, temporally coherent) experience we have,  ability to interact with the world, and some magical active learning/inference architecture that I can barely even imagine when I think backwards about what it should be capable of.

In any case, we are very, very far and this depresses me. What is the way forward? :(  Maybe I should just do a startup. I have a really cool idea for a mobile social local iPhone app.

EDIT: A friend pointed me to an awesome, relevant presentation by Josh Tenenbaum from AAAI 2012, “How to Grow a Mind: Statistics, Structure and Abstraction“.  I think we’re on the same page, except he’s probably at least 100x ahead of me.

| 18 Comments

Khan Academy + Computer Science

Exciting developments– Khan Academy recently revealed a neat interactive, live programming sandbox running Javascript on their website. I like that they went with Javascript + Processing library combo for this purpose. The idea is that the best way to get children interested in Computer Science is not to start getting them to write Hello World and Binary search, but to have them write cool interactive, visual demos and games. This has been my philosophy for a long time, and I’ve even tried to get my feet wet in this area by putting together a set of tutorials for making games in Python. Instead of me trying to motivate this, I recommend you read their blog post announcing this new initiative. If you’re interested in this topic, I would further recommend this neat lecture that inspired them to develop this in the first place (minutes 2-23 are most interesting and related).

Go ahead and check out the demos and starter code they’ve put together to demonstrate the power of the sandbox. For example, here are some animation demos. You write code on the left, and it is immediately executed and results are shown on the right. You can also do nifty things such as hold down the button over any number, slide mouse left or right to change it, and see the results right away on the right. Awesome!

Wasting no time, I jumped to create a few cool programs. For example,

- Here is a Mandelbrot set solver I put together in a few minutes
- Here is an N-body physical simulation of gravity , though admittedly it has a bit of numerical issues. Maybe I’ll try to upgrade it Runge-Kutta integration
- Here is a heart-drawing animation for fun :)

I also ported a few fun Canvas demos you can find on the internet into their API.:

- Lorenz Attractor . Go ahead and change the parameters to see how the attractor behaves! (Original attractor code taken from a gist)
- Tetris!! With the actual tetris code taken from a canvas coding blog post.

Anyway, the idea is that this sandbox allows for rapid prototyping of cool visualizations, and very easy sharing of code across people to make cool things. For example, someone took the Lorenz attractor and modified it so that it is animated within a few minutes. Awesome! Anyway, I think it will be a great tool for younglings who want to learn how to think like a programmer. I am also slightly envious as I had no such fun tools to draw on when I was young. Instead, I had to write PASCAL and program projection matrices in OpenGL to get things to move :(

I am looking forward to developments in this area! I hope they implement a way to explore all these cool programs, and that they provide a nicer and more comprehensive hand-held and well-documented and explained introduction through these demos, not just a few comments. But I’m sure that’s all coming.

| Leave a comment

CVPR 2012 Highlights

CVPR 2012 just ended in Providence and I wanted to quickly summarize some of my personal highlights, lessons and thoughts.

FREAK: Fast Retina Keypoint
FREAK is a new orientation-invarient binary descriptor proposed by Alexandre Alahi et al. It can be extracted on a patch by comparing two values in the gaussian pyramid to get every bit, similar to BRIEF. They show impressive results for discriminative power, speed, and also draw interesting connections to a model of early visual processing in the retina. Major bonus points are awarded for a beautiful C++ Open Source OpenCV compatible implementation on Github.
Philosophically and more generally, I have become a big fan of binary descriptors because they are not wasteful, in the sense that every single bit is utilized to its full potential and nothing is wasted describing the 10th decimal place. They also enable lightning-fast computation on some architectures. I’m looking forward to running a few experiments with this!
Tracking & SVMs for binary descriptors
Suppose you want to track a known object over time in an image stream. A standard way to do this would be to compute features keypoints on the object image (using SIFT-like keypoints, say), and use RANSAC with some distance metric to robustly estimate the homography to the keypoints in the scene. Simply doing this per frame can do a decent job of detecting the object, but in the tracking scenario you can do much better by training a discriminative model (an SVM, for example) for every keypoint, where you mine negative examples from patches in the scene that are everywhere around the keypoint. This has been established in the past, for example in the Predator system.
But now suppose you have binary descriptors, such as FREAK above. Normally it is lightning fast to compute hamming distances on these, but suddenly you have an SVM with float weights, so we’re back to slow dot products, right? Well, not necessarily thanks to this trick I noticed in this [ Efficient Online Structured Output Learning for Keypoint-Based Object Tracking [pdf]] paper. The idea is to train the SVM as normal on the bit vectors, but then approximate the trained weights as a linear combination of some binary basis vectors. In practice, you can use around 2 or 3 bit vectors that, when appropriately combined with (fast) bitwise operations and linear combinations thereafter produce a result that approximates the full dot product. The end result is that you can use a discriminative model with binary vectors and enjoy all the benefits of fast binary operations. All this comes at a small cost of accuracy due the approximation, but in practice it looks like this works!
Hedging your bets: richer outputs
I’d like to see more papers such as “Optimizing Accuracy-Specificity Trade-offs in Large Scale Visual Recognition” from Jia Deng. Jia works a lot with ImageNet, and working with these large datasets demands more interesting treatment of object categories than 1-of-K labels. The problem is that in almost all recognition tasks we work with rather arbitrarily chosen concepts that lazily slice through an entire rich, complex object hierarchy that contains a lot of compositional structure, attributes, etc. I’d like to see more work that acknowledges this aspect of the real world.
In this work, an image recognition system is described that can analyze an input image at various levels of confidence and layers of abstraction. For example if you provide an image of a car, it may tell you that it is 60% sure it’s a Smart Car, 90% sure it is a car, 95% that it is a vehicle, and 99% sure that it is an entity (the root node in the ImageNet hierarchy). I like this quite a lot philosophically, and I hope to see other algorithms that strive for richer outputs and predictions.
Speaking of rich outputs, I was pleased to see a few papers (such as the one above, from Pepik et al.) that try to go beyond bounding boxes, or even pixel-wise labelings. If we hope to build models of scenes in all their complexity, we will have to reason about all the contents of a scene and their spatial relationships in the true, 3D world. It should not be enough to stop at a bounding box. This particular paper improves only a tiny bit on previous state of the art though, so my immediate reaction (since I didn’t fully read the paper) is that there is more room for improvement here. However, I still like the philosophy.

100Hz pedestrian detection
This paper [pdf] by Rodrigo Benenson presented a very fast pedestrian detection algorithm. The author claimed at the oral that they can run the detector at 170Hz today. The detector is based on simple HOG model, and the reason they are able to run at such incredibly high speeds is that they use a trick from Piotr Dollar’s paper [The fastest detector in the west [pdf]] that shows how you can closely approximate features between scales. This allows them to train only a small set of svm models at different scales, but crucially they can get away with only computing the HOG features on a single scale.

Steerable Part Models
Here’s the problem: DPM model has all these part filters that you have to slide through your images, and it can get expensive as you get more and more parts for different objects, etc. The idea presented in this paper by Hamed Pirsiavash is to express all parts as a linear combination of a few basis parts. At test time, simply slide basis parts through the image and compute the outputs for all parts using the appropriately learned coefficients. The basis learning is very similar to sparse coding, where you iteratively solve convex problems holding some variables fixed. The authors are currently looking into using Sparse Coding as an alternative as well.
I liked this paper because it has strong connections to Deep Learning methods.  In fact, I think I can express this model’s feed forward computation as something like a Yann LeCun style convolutional network, which is rather interesting. The steps are always the same: filtering (AND), concatenation, normalizing & pooling (OR), alternating. For example, a single HOG cell is equivalent to filtering with gabors of 9 different directions, followed by normalization and average pooling.
Neural Networks: denoising and misc thoughts
This paper [ Image denoising: Can plain Neural Networks compete with BM3D? [pdf]] by Harold C. Burger shows that you can train a Multi Layer Perceptron to do image denoising (including, more interestingly, JPEG artifact “noise” when using high compression) and it will work well if your MLP is large enough, if you have a LOT of data, and if you are willing to train for a month. What was interesting to me was not the denoising, but my brief meditation after I saw the paper on strengths and weaknesses of MLPs in general. This might be obvious, but it seems to me that MLP’s excel at tasks where N >> D (i.e. much much more data than dimension) and especially when you can afford the training time. In these scenarios, MLP essentially parametrically encodes the right answer for every possible input. In other words, in this limit MLP becomes almost like a nearest neighbor regressor, except it is parametric. I think. Purely a speculation :)
Neural Networks and Averaging
Here’s a fun paper: “ Multi-column Deep Neural Networks for Image Classification“. What happens when you train a Yann LeCun style NN on CIFAR-10? You get about 16% error. If you retrain the network 8 times from different initializations, you consistently get about the same 16% result. But if you take these 8 networks and average their output you get 11%. You have to love model averaging…
Basically what I think is going on is that every network by itself covers the data with the right label, but also casts projections over the entire space outside of training instances that are all essentially of random label. However, if you have 8 such networks that all cast different random labels outside of the data, averaging their outputs washes out this effect and regularizes the final prediction. I suppose this is all just a complicated way of thinking about overfitting. There must be some interesting theory surrounding this.
Conditional Regression Forests for Human Pose Estimation
This is just an obligatory mention of this new Random forests paper, where Microsoft improves on their prior work in pose estimation from a depth image. The reason for this is that I’m currently in love with Random Forests philosophically, and I wish they were more popular as they are beautiful, elegant, super efficient and flexible models. They are certainly very popular among data scientists (for example, they are now used as _the_ blackbox baseline for many competitions at Kaggle), but they don’t get mentioned very often in academia or courses and I’m trying to figure out why. Anyway, in this paper they look at how one can better account for the variation in height and size of people.
This paper shows that patch level segmentation is basically a solved problems, and humans perform on par with our best algorithms. It is when you give people context of the entire image when they start to outperform the algorithms. I thought this was rather obvious, but the paper offers some quantitative support and they had cool demo at the poster.
This was an oral by Antonio Torralba that demonstrated a cool CSI style image analysis. Basically, if you have a video of a scene and someone occludes the source of light, they can become an accidental anti-pinhole, which lets you reconstruct the image of the scene behind the camera. Okay, this description doesn’t make sense but it’s a fun effect.
This paper has a lot of very interesting tips and tricks for dealing with large datasets. Definitely worth at least a skim. I plan to go through this myself in detail when I get time.
Misc notes
- There was a Nao robot that danced around in one of the demo rooms. I got the impression that they are targeting more of a K12 education with the robot, though.
- We had lobster for dinner. I felt bad eating it because it looked too close to alive.
- Sebastian Thrun gave a good talk on self driving car. I saw most of it before in previous talks on the same subject, and I can now reliably predict most of his jokes :)
- They kept running out of coffee throughout the day. I am of the opinion that coffee should never be sacrificed :(
I invite thoughts, and let me know if I missed something cool!
| Leave a comment

Musings on Intelligence: thought experiments

Isn’t intelligence just unbelievably annoying? How does it work? I spend many hours pondering this question. In this post I outline two of my more interesting thought experiments that aim to probe the answers. As I go through these in my head, I always think about how a robot could achieve these same “thoughts” or inferences. What kind of algorithms are required to at least approximately match my thinking process?

Though Experiment #1. Try this: fully introspect your thinking process while doing a random, routine task. Suppose you’re sitting at your desk in the office and suddenly decide to get some coffee from coffee shop across the street. Think about every detail of your thought process as you go along: You form a plan to go down to the street. The plan is hierarchical in nature: overall goal, waypoints, immediate plans of getting from A to B, all of your muscle contractions that get executed to meet each tiny goal on the way… Just before you walk out of your office you slow down a bit in front of the door because the hallway can be full of people who may be walking quickly and are unaware of you. In other words, you’re considering the possible dangers and planning ahead, minimizing the risk of undesirable outcomes. As you walk forwards, a person is coming across from you. You immediately infer the goal of that person: They are most likely trying to pass you and continue on their way down the hallway. You steer slightly to the right and you anticipate them moving slightly to the left. You walk down the steps and you’re about to open the door, but suddenly you notice a person coming in from the outside. Again, you understand that they want to come in to the building. You immediately infer that they are likely to open the door. You also notice that the other person is not looking at you but slightly down at their feet while walking, so you infer that they are probably unaware of you. You step aside and wait for them to open the door and pass. Finally, you get to the shop and you see a line. You understand how a line works: people line up and wait for their turn to order things. You line up at the end because that is the right thing to do. You don’t stand too far back and face elsewhere because other people who want to line up will be confused about whether or not you’re waiting in line…

I feel like I’m doing an injustice to this exercise, but in general it is overwhelming to think about all the tiny inferences my brain is automatically making at any time. Now, how could a robot match similar processes or inferences? How could it ever learn what a line is at a coffee shop? How is it represented as a data structure in its memory? Or the fact that the rule is to “line up at the end of the line”? How could it ever understand that that person on the other side of the glass door had his own goal, and that in that particular moment his goal was to get into the building? How could it ever understand that the other person also has their knowledge base, and that since they were not looking at the robot they did not know it was there? And how could it ever resolve to deciding that a particularly efficient way to handle the scenario was for it it to step aside and wait for them to pass?

Though Experiment #2. For my second thought experiment consider a slightly different setup. It is so ordinary and so boring, and yet from all my experiments, I believe it reveals a lot about intelligence. It is inspired by a real-world situation: I was talking to a friend of mine at a party, when after a brief pause of us both taking a casual sip of our beverage, my friend suddenly asked: “Did you see John?”. The inferences that unfolded during my tiny state of confusion, on the other hand, are extraordinary if you try to enumerate them explicitly:

- John is probably a person. It probably isn’t a movie, or a thousand other things.
- I can’t think of a John I know at the moment. I know John’s, but I don’t think my friend knows them.
- My friend would not ask me the question if he thought I did not know John. So he thinks I know a John.
- What is the set of people that we both know? Maybe I know John but only from seeing him? Maybe my friend doesn’t know that I don’t know him by name.
- What were we talking about just seconds before? We were talking about an assignment for a class.  Is John in the class as well? Is there a person in the class who we sometimes hang out with and who’s name I don’t know, but should?
- Why is my friend asking this question? How does it fit with what we’ve just been talking about? How does it fit with what my friend would want to know at this moment?
- Is he merely thinking out loud, and does not really expect me to know John?
- Is he asking about the past? Did we ever talk about some John? Or is John a guest at the party that my friend is merely trying to find?
- Did I not hear my friend correctly? Maybe he meant Jen? We both know a Jen, but she doesn’t fit too well into context of the conversation moments ago. Is my friend trying to change the topic? Is there something interesting that happened with Jen in the last few days and maybe I don’t know about it?
- Did my friend ever ask this question or a similar one before?

It feels like my brain went through hundreds of immediate hypothesis like the ones above, racing to make sense of the situation; Striving to make it consistent. It felt like in a millisecond it tried to fit every hypothesis to the available data, and it felt like it retrieved vast amounts of past knowledge not only about the context of the situation at that time, but also context of an entire past of my entire relationship with my friend, and the events that unfolded moments ago. It felt like it was trying to find a hypothesis that “clicked”. It considered not only my knowledge, but a model of knowledge of me from the perspective of my friend, and even my guess at his immediate intentions. In other words, somehow I maintain a model of what every person I know knows about me and the world, their attitudes toward me and the world, and the experiences and contexts we share. I also have an understanding of their personalities, and the kinds of things they are likely to talk to me about. Interestingly, I would also argue that I maintain a degree of certainty on every such piece of knowledge, sometimes only as a summary, and sometimes with pointers to events that led me to believe them.

It is quite amazing that our brains are capable of doing all this in fractions of a second, and they do it thousands and thousands of times a day. I believe that the process outlined above is at the heart of intelligence, in that it is just a single example of more general reasoning machinery that is used at any moment in time. The brain is, as best as I can describe it, a Hypothesis Generating Bayesian Scoring Machine. And don’t get excited, by Bayesian I only mean the very simple idea that we have priors and assign likelihoods for every possible hypothesis, and we combine them in some way to get a winner: the hypothesis that “clicks” the best. And as far as I can tell, the inference is most similar to a kind of hybrid Loopy BP / MCMC scheme, where proposals that are based on experience are used to initialize hypotheses, and where a belief propagation-like procedure derives their consequences before scoring them.

In conclusion, these depressing thought experiments tell me that we are, indeed, very very (very!!) far from Artificial Intelligence. How can we write algorithms that can automatically explain data by generating and scoring hypotheses, while considering the full context? How do we write algorithms that understand and model intent, knowledge and goals of other agents? I don’t have the answers, but one thing I do know is that there is no single machine learning system that I’ve heard of that I consider to be on the right path. I’m being harsh and my expectations are high, but my main concern is that our algorithms for the most part don’t think, they compute boring feed-forward functions that depend on a fixed set of conveniently chosen parameters. An algorithm that attempts to model a mind must have a certain scent of meta… a scent that I have yet to feel.

| 1 Comment

My Last quarter: projects, courses, endeavors

First quarter at Stanford was extremely busy but a lot of fun. Here is the list of endeavors that kept me entertained:

1. I took two courses: Machine Learning with Andrew Ng, and Computer Vision with Fei Fei Li. Both courses were fun, even though they contained mostly information I’ve learned already at UBC. Regardless, it was nice to hear it all again and get to practice it more.

2. I rotated in Daphne Koller’s lab and worked on the Latent Structural Support Vector Machine. The optimization for LSSVM’s is done in a coordinate-descent fashion: Latent variables h are inferred given the weights of the SVM w, and then w is inferred given h. I worked on an extension to the first step: instead of inferring a fixed value of h, one tries to maintain a probability distribution over h. When inferring w in the second step, an expectation is calculated over h instead of simply using a fixed value. The intuition is that the algorithm should not be too hasty to commit to a bad h, or it can get stuck in a bad minimum. Of course, one pays a computational price for this, but the question was: is it worth it? As far as my experiments went with my specific data, the answer seems to be no. This general meta-issue is one that keeps coming up over and over again: Do you spend computational effort doing the right thing, or do you compute the wrong thing many times faster? In practice, the latter can be surprisingly effective.

Most importantly though, I reaffirmed during this rotation that this kind of work is not something I find personally appealing. I don’t get excited about mathematically reformulating a problem in some slightly different form, and seeing it perform 1% better than state of the art on my favorite dataset. What motivates me best are more tangible projects that address large conceptual challenges. Projects that have the goal of AI in mind, or the goal of getting robots to live among us. Projects that have meta in them. Projects that can make me say embarrassing things, such as “This must be how the brain works”.

3. For my course project for both Computer Vision and Machine Learning, I was advised by Gary Bradski from Willow Garage and I worked on Object Detection. More specifically, I worked on extensions to the recently published (ICCV 2011) LINEMOD Object detector by Stefan Hinterstoisser. Stefan’s work is essentially on super-fast, optimized implementation of template matching that can be applied to RGBD images (such as those coming from the Kinect) for object instance detection. I chose this project because it had all the tags necessary to get me excited: Kinect, Willow Garage, Object Detection, Super-fast, Vision, and Robots. In addition, I have this strange feeling that despite all the efforts that go into building clever systems for object detection, it will be common in 20 years to solve practical problems with template matching, naive bayes, and bag of words models. In fact, I’m not entirely convinced that this is unlike what the visual cortex does in humans, at least for large portion of the low-level processing.

However, clearly it is not practical to have a separate template for every possible view and for every possible object, so there must be mechanisms in place to scale the naive object-centric template matching strategy. I investigated two ways of scaling the algorithm based on: 1. Simple intuition that not all parts of the image should receive the same amount of attention in terms of matching, because it is possible to reject boring regions of images as candidates for objects based on very coarse matching at low resolution. I was able to use this (trivial) intuition to speed up the algorithm 20x without any loss in recall. And 2: It would be nice if we didn’t have to have a separate, large template for entire objects. Instead, I explored a hough-voting approach where I detected little parts of objects, and had them vote for object center. The intuition is that, for example, if you detect a bottle cap with high certainty, then a bottle center should be somewhere below. This turned out not to work too well but I was so puzzled by it that I kept searching, and indeed, shortly after the report was due I uncovered a severe bug in the code base I was using as a black box for matching that would directly lead to bad performance in these part-based experiments. Unfortunate!

I liked working on this project a lot! You can read my final report here. [PDF]

4. Those of you who know me also know that I get very easily excited about anything Education. And since Andrew Ng’s Machine Learning class was offered to the public online for free last quarter, I did not hesitate and volunteered almost 10-15 hours a week helping to prepare the programming assignments for the class. Looking back, it was probably not the best choice considering my career as a researcher, but I do not regret my choice. It was a lot of fun being involved in something I consider to be so ground-breaking, and I really hope that all the new initiatives that seek to revolutionize online education, such as Coursera, Udacity, and MITx go on to become very successful. And I hope I earned some bragging rights, because I’ll be able to say that I was there, involved and at the heart of it when it all began.

This quarter I am rotating with Andrew Ng’s group working with Adam Coates, and I am taking Convex Optimization with Stephen Boyd and Probabilistic Graphical Models with Daphne Koller and Kevin Murphy. More on this later! :)

My “Values and Assumptions about Teaching and Learning”

Those of you who know me well may also know that I get very passionate about education. I can write a whole another 10-page post on some of my thoughts on Khan Academy, and more recently the MLclass, AIclass, DBclass, etc offered in Stanford. (By the way, update: I’ve volunteered to help make assignments for the ML class, and I LOVE to be a part of it). My name is on the “About us” page, and will go down in educational history! (ok just kidding, but I’m proud of it anyway :p)

For now, however, I wanted to share this writeup that I just randomly discovered hidden deep inside my Dropbox. It is my “Values and Assumptions about Teaching and Learning” that I submitted with my application for one of the top TA awards at University of British Columbia last year. My application was rejected (which, by the way, I am bitter about because I think my application was overall very strong and there is no other student I know who worked even close as hard as I did on my TA duties, who volunteered to TA more courses than was required, who volunteered many many more hours than he should have spent, who received identically near-perfect student evaluations every time…. I am normally a fairly modest person, but here I refuse. Ah well, hard work not recognized, fine with me.) Regardless, the writeup has some of my thoughts on what I learned while teaching (most of my experience was in teaching Tutorials – i.e. ~5-30 people per class with mean at around 20, and helping out students who worked on assignments in learning center). Forgive the slight cheeseness of it at times :)

————————————————–

When I sometimes help a group of students along as they try to complete some problem, I wonder if they realize that I, as a teacher, am also in a process of solving an extremely difficult problem: that of teaching. It is very hard to over-estimate the difficulty of being an effective teacher. Even a simple question from a struggling student is often just a tip of an iceberg: a brief manifestation of a deeper misunderstanding. The task of the teacher is not to simply answer the question (that’s easy!), but to first infer the exact shape and size of this iceberg, and then to address the source of the confusion. Over the last few years, I came to realize that teaching is one of the most intellectually demanding problems that I can hope to work on, and solving it correctly for some students, in some cases, is a great source of satisfaction.

I have accumulated many tips and tricks of teaching over the last two years, during which I conducted a tutorial almost every other day. In an effort to make my essay concrete, I will attempt to justify from experience a few of my core teaching principles. One of the first surprising discoveries I made when I started out was that being very comfortable with the content of the course was, paradoxically, detrimental to my ability to teach it. As I was trying to explain the material, I would frequently catch myself skipping over details in a problem derivation, simply because certain leaps of logic were obvious to me. For this reason, I volunteered to undertake the universally most hated task that a TA can have: marking assignments. Students are generally bad at conveying their misunderstanding, and are often even reluctant to admit it. A commonly occurring situation is that they aren’t even aware of it in the first place. Overall, getting my hands dirty and poring over students’ work in detail enabled me to more clearly understand the kinds of problems that often come up, it reminded me of all the little pieces of knowledge that I now take for granted, and ultimately led me to become a more effective instructor.

One of my other core principles was also strongly reinforced through personal experience. When I first started teaching, I felt very comfortable with the course material. After all, the course I taught only involved simple mathematics that I carried out many times since my first year. To my surprise, however, once I actually started teaching I realized that my understanding of these elementary concepts was only superficial, and often simply rule-driven. Forcing myself to make sense of it as I was explaining it to others led me directly toward a deeper understanding of all concepts and their relations. Similarly, as teachers we should encourage our students to not only passively absorb information, but to actively try to make sense of it through interaction, collaboration, and teaching.

My process of improvement as a teacher is not unlike the one that my students go through. We gradually learn to become better through long periods of sustained practice. I don’t pretend to have anything figured out, but eagerly look forward to learn more.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值