Virtual bees help robots see in 3D (calling Michael Crichton)

Copying the humble honeybee's foraging methods could give robots better 3D vision, researchers say. Robot explorers could identify points of interest by mimicking the way bees alert others of promising foraging spots.

Explorer bees report the location of a new food source, like an inviting flowerbed, by dancing on a special area of honeycomb when they return to the hive (see How vibes from dancing honeybees create a buzz on the dance floor).

A new type of stereoscopic computer vision system takes inspiration from this trick. It was developed by Gustavo Olague and Cesar Puente, from the Center for Scientific Investigation and Higher Education of Ensenada (CICESE) in Mexico.

A computer can generate 3D information using two cameras by comparing the view captured from different angles. It is, however, computationally intensive to do this for large scenes. Complicated statistical techniques can be used to pick out important features of a scene for further analysis, but this is still time-consuming.

The system developed by Olague and Puente is far simpler, they claim. It uses virtual honeybees to home in on potential points of interest, which can then be rendered in 3D. Simulated "explorer" bees are programmed to seek out features of potential interest in a 2D picture, based on criteria such as texture and edges. This can, for example, lead them to focus on a person or a prominent object in an otherwise empty room.

The honey bee software starts by randomly assigning explorer bees to different parts of an image. After identifying features of potential interest, these explorers recruit other virtual bees, known as "harvesters", to investigate in more detail. The explorers recruit harvesters in proportion to their interest in an area, meaning the most promising areas get the most attention.

If the harvesters also find the area interesting, they focus on it too. The system can then render it in 3D, based on all the bees' movements. This could eventually help a robot navigate or interact with its surrounding more efficiently.

"This algorithm can save time," Olague told New Scientist. "The harvesters are targeted by the explorers to look only at promising areas."

In testing, Olague and Puente used up to 8000 virtual explorer bees and 32,000 virtual harvesters. Before the end of 2006 they hope to use the honeybee vision system to help a mobile robot avoid obstacles.

Toby Breckon, a computer vision researcher at Cranfield University in the UK, says the approach has promise. "One of the big problems for stereo vision is that you have to search through the features in front of you," he says. "Bees have this almost built-in search algorithm that has the potential to help."

Breckon adds that the number of virtual bees could be adjusted for different situations. "A robot could use a small number of bees if it just needed to know where the walls of a corridor are, and then put in more bees to collect more detailed information," he says.

The research was presented at the 8th European Workshop on Evolutionary Computation in Image Analysis and Signal Processing in Budapest, Hungary, in April 2006, where it won the award for best paper.

Hey, no problem. It's an interesting article either way. :) Having re-read it a couple times myself, I'm pretty convinced he's talking about algorithmic "bees".

As such, the analogy is tenuous at best. However, I think I see (no pun intended) what he's trying to do here. If I may try to translate into English... :)

Stereoscopic processing is an incredibly difficult computational challenge. "Everybody" knows that with two eyes reporting a 2D view of the world from two slightly different angles, you can infer the 3D structure of that world to some degree. However, that inference process is very nontrivial.

The problem is one of image correlation.

The following mental exercise will help highlight the difference between how stereoscopic vision is supposed to work in theory, and why it's so hard to get it to work in practice.

Imagine you're in a completely dark, featureless room, and the only thing in that room is a single tiny candle emitting a single point of light several feet away from you. Your human visual system - comprising the most formidable spatial processing engine in the entire animal kingdom - will be able to easily discern the exact distance to that candle. It will do so by taking the position of that single point source of light on each of your respective retinas, normalizing for the orientation of each eye, using this information to plot virtual rays in 3D space from each eye to the candle, and computing where those rays intersect. All of the trigonometry and algebra required to do this is built into the neural circuitry of your visual cortex, and it is fairly easy to replicate using modern computer technology. At least, that very simple scenario is.

Now let's make things a little more complicated. Imagine you're in the same dark room, but instead of just one candle in the room there are now two, both at eye level (i.e. separated horizontally but not vertically - for example, if they were both on an invisible black table). Now, on each of your retinas there is not simply one point of light but two, meaning that from each of your eyes you can draw two outgoing rays. Whereas in the previous case (with one candle) two rays total (one from each eye) have one point of intersection, you now have four co-planar rays (two from each eye) and they have four total points of intersection. This is a much more complicated situation now - since each point of intersection represents a location in 3D space for one of the lights, figuring out your spatial environment is now much trickier. It becomes difficult to focus your eyes right; you know there are only two candles, since each retina has two points of light on it, but there are four total possible locations for them. You focus for a second or two, perhaps enduring an optical illusion where you see three lights at a time, until finally your eyes adjust and you see two lights in two locations. The process is quick but noticeable.

Now imagine that, instead of two candles at eye level, you have five hundred candles at eye level. Five hundred points of light, all in a line relative to your vision, some a little closer and some a little further from you. Imagine trying to find the one that's closest to you. Not so easy.

This is, at its simplest, how a robotic visual system sees the world. It also is very much like how V1, the first layer of the visual cortex, sees the world as well. Millions upon millions of individual points, all unrelated to one another, all different, all viewed from two different views. And only by correlating which point in one view corresponds to which point in another view can the visual system begin to infer the 3D structure of its environment.

How does the human brain do it? By leveraging massive parallel processing to extract similar features from one eye and map them to features seen by another eye. By "features" I refer to colors, patterns, edges, areas of high contrast, and so on. The human brain has dozens, if not hundreds, of ways to determine that Retinal Pixel A in one eye is looking at the same point in 3D space as Retinal Pixel B in another eye. (Yes, I mean "cones" and "rods" rather than "pixels", but I'm focusing on the computational aspects here. :) ) And, even with all that processing, one need only stare intently at a flat clear snowfield or the matching slats of clean vertical blinds to realize that even the mighty human visual spatial perception system can still be fooled.

Correlating visual features from the two images, then, is a very computationally expensive problem. Simply comparing every single pixel in one image to every single pixel in the other image is not only ridiculously expensive but also not very productive in the long run. Modern computer algorithms perform edge detection, 2D pattern correlation, contrast comparisons, etc., to try to reduce the number of things they have to compare and the ways in which they should compare them. Even so, these pre-processing techniques themselves take quite a bit of time, and while they dramatically reduce the computational power needed to decipher the scene's 3D structure, they still require unfathomable amounts of computing power to do so in real time.

This is where this "bee" research comes into play. These "bees" described in the article, if my understanding is correct, have nothing to do with neither physical "bee" robots nor with insect visual systems. Rather, what the researchers are proposing is to intelligently break each scene from each "eye" (camera) apart into a large number of small, manageable areas of interest. What they call a "bee", if I read correctly, is actually nothing more than a pair of small circles - one on each image - that move quasi-randomly around both camera images. These small circles represent just barely enough pixels to make meaningful statistical assertions about the similarity between two areas on the two different images - big enough to matter, but small enough to be computationally manageable.

In other words, because it's computationally impossible to compare every pixel in one image to every pixel in the other image, these researchers basically said, "Well, how about we take a small random group of pixels in one image, and compare them to a small random group of pixels in the other image?" Naturally, the first random pair of areas you pick is likely to have absolutely no correlation, so you sample many, many image-area pairs (i.e. a "swarm"). And, when you DO happen to find an area-pair that has some correlation, you want to leverage your precious find in order to draw MORE areas of correlation - so, you pull the other area-pairs closer to the area-pair that had the high correlation, and see if any of them find their own areas of correlation in turn.

The respective position of the two corresponding circles on each of the respective camera images would define two rays in 3D space, and the intersection of the two rays would define the location in 3D space that the area-pair was focusing on. In other words, each "bee" corresponds to a hypothetical point in 3D space, but that doesn't mean that there's actually anything THERE at that point - each area-pair could represent a point in thin air, or a point deep inside a solid object, and so on. The degree of correlation of the two images at that point determines whether the point actually represents the 3D location of a visible surface of an object in the environment.

Hence, I can definitely see the "bee" analogy - thousands of virtual points in space, each moving about randomly but willing to stop if it finds a spot of high correlation from both "retinal" images, and drawing others to its location so that they can find high-correlation spots nearby in turn.

As I explain it, I actually come to understand that it's really quite a clever idea. It takes quite a bit of explanation, though, to understand exactly what it has to do with "bees", and why it's an interesting development in computer visual models.

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.