Artificial Intelligence and Robotics blog
Computer Vision
MIT Researchers develop new image-recognition software
Aug 12th
One part of the human brain that has not been focused on too much has been its visual-perception ability, which allows it to recognize, differentiate and quantify different objects. This is, of course, a vital aspect of the brain’s function, which directly links what we see through our eyes to the way we actually understand things in our mind. Now, scientists at MIT’s McGovern Institute for Brain Research have developed software that studies and learns from this ability, gathering information that can pave a new way forward for artificial intelligence.
The software, based on a mathematical model, studied a number of test subjects, who were given different object-recognition tasks. In one experiment, a group of people were asked to look at a picture of a street, and count the number of cars and pedestrians on it. The software’s eye-tracking system recorded the way their eyes moved, and then began predicting where they would first look to next in other tests. The information gathered confirmed that the human brain creates an outline of the image the eyes see, and initially recognizes objects that are more significant or stand out in some way. In a related set of tests, the people were given two objects – a square and a star, and their attention was split on both equally. When given a whole bunch of stars but only one square, the eyes first focused on the square, or the more unique shape. This whole process happens in a split second, but it shows that we always look for something special or unique in whatever we encounter.
When testing out the software on its own, it created a similar spatial map on which it recognized different objects. To do so, it ran down a list of features specific to the object it was asked to identify, and only focused on those that correspond to the said object, while ignored everything else that did not match the description. To truly mimic the human brain, this process will also have to become much faster, but MIT’s researchers are confident they can get it close to that level.
[source]
LuminAR to shine a light on the future
Jun 16th
You might think that some devices in the modern age have reached their maximum development level, such as the common desk-lamp, but you would be wrong. Natan Linder, a student from The Massachusetts Institute of Technology (MIT) has created a robotic version that can not only light your room, but project internet pages on your desk as well.
It is an upgrade on the AUR lamp from 2007, which tracks movements around a desk or table and can alter the color, focus, and strength of its light to suit the user’s needs. The LuminAR comes with those abilities, and much more.
The robotic arm can move about on its own, and combines a vision system with a pico projector, wireless computer and camera. When turned on, the projector will look for a flat space around your room on which to display images. Since it can project more than one internet window, you can check your email and browse another website at the same time.
A gestural system similar to touch-screen technology is used to interact with the projected interface, which makes it very easy and convenient to operate. Another really cool feature is that it can scan and recognize the things in your room, such as a Coke bottle. In one of the demonstrations, the LuminAR scanned the soft drink and opened up a Coca Cola webpage. Basically it knows what it’s doing – and it knows what you’re doing too.
One problem that had to be addressed was making sure the laser doesn’t produce a weaker signal when the surface is farther away, but Lindar found a projector that can stay in focus regardless of the distance. A certain bet is that every student from grade school to university will want to have one of these when they become available in the near future.
LuminAR video demonstration follows.
Bing augmented reality maps demo
Feb 13th
Microsoft Research who brought us some wonderful technologies such as the incredible Photosynth continue to impress with a much improved web mapping application integrated with the company’s new Bing search engine.
During the TED 2010 conference, Microsoft engineer Blaise Aguera y Arcas demoed the new Bing augmented reality maps showing real-time registration of video taken with a smart phone and street-view type maps. He showed how the live video can be overlayed over the static images and additional information about the area can be accessed via a Web interface. Much of this is made possible because of the advanced computer vision technology that has been developed in the past decade at Microsoft Research. The Seadragon technology is the back-end that makes it possible to manipulate such vast amounts of data in real-time. Microsoft has also integrated Photosynth and Worldwide telescope into their maps product.
You are probably wondering what does this have to do with robotics other than the fact that it is a very impressive application? I can imagine robots using Bing maps to keep localized within a city. One of the most difficult and important problem in robotics is that of Simultaneous Localization and Mapping. Bing maps solve the mapping problem and the new vision techniques (with a bit of help from GPS) can be used to solve the localization problem. The registered video can be used by a robot to localized itself when it goes out to buy your weekly groceries.
You can watch the 10-minute demo below; I bet that it won’t be long before Microsoft makes these new features available to us all for free.
A biologically-inspired vision system
Dec 6th
Researchers at the Rowland Institute, Harvard, and McGovern Institute for Brain Sciences, MIT, are developing new, biologically-inspired vision systems taking advantage of faster computers. Their goal is to create vision systems for image understanding that can be as accurate as biological systems and more specifically the human visual system. The researchers have developed a new method that allows them to evaluate many different vision systems and quickly determine which are best suited for scene understanding. In a PLoS Computational Biology paper, the researchers show that their method performs better than current state-of-the-art computer vision systems when tested using standard data sets.
If you don’t want to read the paper, then you should at least watch the below video in its entirety. In the video, lead researcher David Cox explains at a high level how biological vision works and how their computational system mimics it to achieve the results presented in the paper.
The Visual Memex Model: Modeling object relationships for scene understanding
Nov 23rd
Computer vision researchers have for decades been trying to develop algorithms for scene understanding from images and/or video. No doubt, they have made huge progress towards this goal. For example, in the last decade new methods for feature-based object recognition have been developed with a large degree of robustness to scale, viewpoint and illumination changes. Such methods are what makes products such as Photosynth possible today.
However, full scene understanding has continued to elude researchers. Alexei Efros’ group at CMU are now proposing a new method for scene understanding that looks at the individual objects in a scene and their spatial relationships. The Visual Memex Model as they call it is a new method for encoding information about specific objects and their visual similarity and contextual relationships.
The insight behind the Visual Memex model is that the traditional approach in computer vision that objects belong to well defined categories is not correct and that an exemplar-based definition of categories is more suitable. Using evidence from psychology, cognitive neuroscience and other disciplines Efros and his student Tomasz Malisiewicz argue that this new approach is more suitable for scene understanding in computer vision.
In their proposed model, objects are represented by examples of their appearance in images comprising the vertices of a graph in which the edges represent either visual similarity between exemplars or contextual relationships, e.g., a person is often seen next to a car. These relationships are learned automatically from data using state-of-the-art machine learning methods. Given this graph, new objects are first matched to an exemplar and then the contextual relationships are used for scene understanding. An experimental evaluation using a large database of images shows that the proposed method performs better than category-based systems. I suspect that this work may actually cause a small change in thinking within the computer vision community.
The authors will present their work on the Visual Memex Model next month at the 23rd annual Neural Information Processing Systems (NIPS) conference. You can get a copy of the full paper here (pdf).
PhotoSketch: Taking image composition to the next level
Oct 6th
We don’t often write about work in Computer Vision but when we do we always present something that is likely to blow your mind (see VideoTrace and Adobe’s Interactive Video Editing.)
Enter PhotoSketch, work that was recently published in ACM’s Transactions on Graphics. The application takes as input a rough, hand drawn sketch of a scene and then generates a realistic image using information found on the Internet. The results are magnificent to say the least considering the difficulty in filtering the vast amount of noise in online images. The application requires that the user supplies labels with each object sketched; these are used to search for relevant images online. The object outline along with automated segmentation algorithms are used to extract specific objects from the photos found, e,g, a dog, an elephant, a car, a Frisbee etc. Another mechanism is used to blend the segmented images together and generate appropriate shadows. The results are jaw dropping to say the least considering how little input from the user is required.
The video below explains the process in more detail.
AutoStitch: Panoramic image creation for the iPhone
Jun 23rd

The iPhone app store may be home for some 50,000 applications but few are actually worth the download, free of charge or not. One of these few applications that are a must have for every iPhone owner is AutoStitch by Cloudburst Research Inc.
This nifty application can create panoramic or wide-angle images from any arrangement of photos and minimal user guidance. The latter two features are what makes AutoStitch such a great application. Any arrangement of photos means that you don’t have to take your photos in sequence for the stitching to work; the application can figure out the overlap between images on its own by the use of state-of-the-art feature-based image matching techniques. And minimal user guidance means that you only have to point the application to the directory where all your photos are stored and it will gladly stitch together all those photos that belong to the same scene. The photo at the top of this post is an example panorama created using this application (the image is copyright Cloudburst Research Inc.)
If you recall, I wrote in the past about image stitching software including the high resolution Gigapan system, the desktop version of AutoStitch, and the open source autopano-sift. Microsoft’s Photosynth maybe the one related application that is best known to you. Similar software is available today in several commercial image processing software packages. Interestingly, the founders of Cloudburst are also the first to publish the method that makes Autostitch and all these other panorama creation software work so well.
Learn more about the iPhone Autostich application at the company’s website here or you can buy the application from the App Store for $1.99 here.
New book: Computer Vision Algorithms and Applications
Jun 19th
There is a new computer vision book in the works but a well know researcher Richard Szeliski of Microsoft Research. If you don’t spend much time reading conference proceedings and journal articles you may not have heard of the author before. However, you know him indirectly from some of his work that Microsoft has started publicizing in the last few years. Richard Szeliski is one of the main people behind Photosynth, the software that allows users to create stunning panoramic images from collections of digital photos without much hassle.
The new book is based on the lectures of a computer vision course that Szeliski has taught with some of his colleagues at the University of Washington. The book starts with a description of some basic concepts on image formation and processing. It then continues to cover a large number of advanced topics in feature detection and matching, segmentation, calibration, structure from motion, image stitching, computational photography, stereo, recognition, and image-based rendering. The book presents recently published work by a variety of computer vision researchers from across the globe so many of the chapters describe state-of-the-art methods.
That said, Szelinksi’s new book is not finished yet so don’t rush to the bookstore to buy it. While still working on the project, the author makes the most current draft available online for anyone to download and read. He seeks the community’s feedback in making this a book worth having. I have read parts of it and it looks like it is going to be a great book when finished. You can download the latest draft here.
MITRE immersive spherical vision system
Mar 1st
Immersive vision systems for teleoperation are a valuable tool for many applications including inspections of old buildings, pipelines and sewages, search and rescue, and military, i.e., detection and neutralization of roadside bombs. Immersive systems work by presenting a virtual representation of the world seen via a camera that is situated away from the operator (note: I use the term virtual representation of the world liberally here because what the operator sees is actually images of the real world captured by a remote camera; I call it virtual because a explained later these images arrive delayed which means that the real world may have change since the data was collected.)
The operator views slices of the virtual world using a head mounted display while a sensor detects his movements and updates the view accordingly. A large problem with such methods is the latency between the operator moving his head and the system updating his view; the latency often comes from the fact that the remote camera has a limited field of view meaning that every time the operator moves his head, a pan-tilt mechanical unit has to reposition the camera delaying the relay of the images and making it difficult to operate the remote system often causing lots of distraught for the operator.
MITRE scientists have worked out a solution to this latency problem replacing the limited field of view camera on the pan-tilt unit with a spherical vision camera which has no moving parts. The camera of choice is the Ladybug commercially sold by Point Grey Research in Canada. Stanford’s Urban Challenge autonomous car also used the same camera for part of its perception system. This spherical vision system consists of 6 cameras which capture images simultaneously covering a large portion of the view sphere around it. Software stitches the images together into a single view in real-time. These spherical vision images are then available to the operator to view for any orientation of his head. The latency I mentioned earlier is thus eliminated by the fact that the camera need not be repositioned every time the operator moves his head. Moreover, more than one operators can be using the system looking in different directions.
The below promotional video shows the capabilities of the immersive spherical vision system including some of its potential applications. The true power of the system is clearly visible in the part of the video where a car driver is shown driving down a street while perceiving the world in real-time via a head mounted display.
Shaving digitally
Feb 11th
Have you ever browsed old photographs only to have wondered how much better you could have looked had that bushy beard not been hiding your face at the time? Or are you contemplating shaving your beard of many years but you would like to see the final result before the event just in case? It is now possible to satisfy you curiosity using a new image-based shaving technique developed at CMU’s Robotics Institute.
Virtual shaving is achieved by modeling human faces in images as a set of layers that can be separated and manipulated at will; in fact, the method is not only good for shaving but also for adding a beard or even transferring a beard from one photo to another. If you find that you are losing your hair, you could potentially use this method to virtually enhance your photograph with a full set of luscious hair.
The CMU researchers utilized a machine learning method to predict what a person’s face looks like under a beard making it possible to reconstruct features not visible in the original image. Using a large database of faces with and without beards (properly labeled for supervised learning,) the researchers learned two subspaces (one for each class of faces) and a model of the differences between the two subspaces. These are later used to transform a bearded face image to a non-bearded one and vice versa. Below is an example beard removal from the published paper.

The result shown above was obtained after training with a small number of images; the total number of images used was about 1200, i.e., a very small number for machine learning to be the most effective. As the authors proposed in their Eurographics 2008 paper, a larger set of images available for training should provide a large improvement on the final result. Finally, the same method is general enough to be used for other digital image manipulation tasks such as removing one’s glasses.

Recent Comments