Alternative sensory representations of the visual world

Dan Morris and Neel Joshi
CS223b Final Project, Winter 2002

Alternative sensory representations of the visual world

Abstract :

We have used two cameras and a Sensable Technologies "Phantom" force-feedback haptic display to visually and haptically render a three-dimensional surface that accurately represents key aspects of a visual scene. In particular, our system can be used to "feel" a depth map of the visual scene or to "feel" the contours defined by edges in the visual scene. The system runs in real-time; the user presses a button to trigger the capture and renderng of either a depth map or an edge-detected image.

Once an image has been rendered, the user can use the Phantom to explore the entire surface. See Sensable's website for a more detailed description of the Phantom's capabilities; in short, the sensation of using a Phantom is similar to running one's finger over an object. The force-feedback is extremely convincing, although it is currently limited to a single point of contact.

In addition to rendering depth and edges with the Phantom, we capture optic flow information in real-time and present this to the user using sound cues: a tone pans from left to right according to the horizontal location of maximum optic flow and scales in volume according to the magnitude of optic flow (the system is silent for sub-threshold flow).

We propose that with further development, this system could be used as an assistive device that would allow blind or visually impaired individuals to explore the "visual" world.

Introduction :

Given the current trend toward miniaturization of electronics, it is feasible that in the near future a person will easily carry any number of powerful electronic devices on his or her body. The availability of powerful computers in increasingly small packages will allow algorithms in computer vision - a field that has traditionally been constrained to high-end desktops - to run on laptops and handhelds.

Consequently, advances in computer vision will soon be able to contribute to a traditionally "low-tech" field : assistive devices for the blind. The goal of this project is to develop a system that extracts information from a moving camera and presents important or especially salient visual features to the user's non-visual senses.

When we visually explore a scene, we quickly obtain a tremendous amount of information. We are able to identify people, navigational obstacles, textures, etc. We ask the question: "how much of this information can be rendered using alternative sensory representations?"

The aspects of the scene that we chose to represent were depth, global flow, and edges. We justify each of these as follows :

Depth information is critical both in avoiding potential obstacles and in object recognition. It is currently not possible to create a complete three-dimensional model of objects from a single view, but it is possible - using a stereo camera - to determine the depth of various points in the scene, which in many cases allows approximation of the three-dimensional shape of relevant objects.
Optic flow has been used successfully to guide robots along hallways; the basic principle is that texture on the left and right walls will appear to move at the same speed only from a viewpoint that is moving directly down the center of the hallway. It has also been shown that humans use optic flow as an important navigational cue. Hence it is possible that a user could learn to balance optic flow by balancing the pan of a tone, presented via headphones. In our system, optic flow information is also critical in that it suggests to the user that motion has occurred, possibly necessitating a re-rendering of the scene for further exploration.
The human visual system is particularly good at recognizing edges in images; in many cases, the nature of key objects in a scene is apparent based only on the location of strong edges in the scene. Edge-detection thus represents a significant reduction in the complexity of a scene, while maintaining a disproportionate amount of information, making it ideal for haptic rendering.

System Description :

Our system currently runs on a PC under Windows 2000; our test machine is a dual-processor 850MHz desktop with 512MB of RAM.

Images are acquired for optic flow and edge-detection via a low-cost USB webcam (a Logitech QuickCam Web), which provides 30fps at 352x288. Images are processed using Intel's OpenCV computer vision library, which performs video capture, local optic flow calculation, and edge-detection.

Images are acquired for depth-mapping via a Videre Design stereo camera (a Mega-D) at the same resolution. Disparity maps are generated using SRI International's Small Vision System.

Haptic rendering is performed using the Ghost API and the Phantom force-feedback haptic display, both available from Sensable Technologies. We have developed software in C++ to capture images from both camera systems and convert the results into a three-dimensional triangle mesh suitable for rendering using OpenGL and the Ghost API. Our software also computes global optic flow information from the local flow vectors provided by OpenCV, and generates tones accordingly.

The program displays the real-time video stream from both cameras including disparity and optic flow information, and provides an OpenGL representation of the rendered scene, which is simultaneously displayed on the haptic device. The user can press a key at any time to capture either a depth-map or an edge map; the user's selection is immediately displayed in the OpenGL window and rendered on the Phantom. The user can then explore the scene using the Phantom; a cursor in the OpenGL window represents the Phantom's current location. The user can press a button on the Phantom to enable translation and rotation of the scene.

Note that we intentionally chose not to update the scene in real-time, although this would be possible with only small changes to our architecture. It is very difficult to explore a moving scene using the Phantom; it is much more effective for the user to capture an image on-demand, and explore the static scene for an unlimited amount of time.

The audio output from the program is silent when optic flow is below a threshold, but increases in volume as optic flow increases in the scene, and pans from left to right with the location of maximum optic flow.

The system can also read images that are generated offline and render them using the Phantom (and in OpenGL).

Results :

We are able to successfully capture and render arbitrary edge maps and disparity maps, and to convey optic flow information via sound. Although the Web does not currently allow us to let you reach out to your monitor and feel our force-feedback, the OpenGL representations shown below are entirely consistent with the force-feedback that the user feels when operating the Phantom.

Click on any image below to see a larger verion... note that many of the images are 1280x1024 screen captures, and Internet Explorer tends to scale images to fit within its window. If you want to see a screen shot in full detail, download the larger image to your hard drive.

Figure 1 :

We first demonstrate a scene that was generated from a file, rather than from live video. This is a popular test image in the stereo vision community; we have used a stereo image pair (first row) to generate a disparity map (second row, brighter objects are closer to the camera) and render it in OpenGL (third row). As discussed above, the mesh was simultaneously rendered on the haptic feedback device. The red walls you see outside the image constrain the user to the surface of the image, which prevents the user from getting "lost" during haptic exploration.

Figure 2 :

This is a complete screenshot of the system in operation, showing the GL representation of the rendered scene, the real-time stream from the webcam with optic flow vectors superimposed on the image (updated in real time) (although jpeg compression makes the flow vectors difficult to see), the edge-detected image, and the disparity map generated in real-time from the stereo camera. The occluded window behind the disparity map contains the real-time video feed from the stereo camera. The GL window is currently displaying a rendering of the edge-detected image, which - as always - was simultaneously output on the haptic device. The sign that you can't read in a web browser says "I enjoy Mengkudu Juice", a reference to the original title for this project.

Figure 3 :

Here we display another pair of full screen shots; in thise case, the user has chosen to render the depth image rather than the edge-detected image. The second screen shot displays the system's ability to manipulate the 3-d scene (translation and rotation are controlled from the Phantom), which can be useful for enhancing the visual representation and for allowing more flexible exploration with the haptic device.

Figure 4 :

...and a few more interesting screen shots. Once again, each one of the three-dimensional scenes could be "touched" by the user using the Phantom.

Figure 4 :

Several photos of the system in action... the stereo camera and the USB webcam (upper-left), the Phantom (upper-right), the authors using the Phantom to explore a scene (bottom row).

Future Work :

We consider this system to be a prototype; its primarily shortcoming is in the fact that the haptic representation is not adequately detailed, and thus cannot provide a complete conception of the objects that are present in the visual field. We would like to experiment with the following :

Haptic devices are available or will soon be available that add a "grip" to the one-point force-feedback that is available using the Phantom. We feel that giving the user even a single force-feedback grip will provide a much more convincing representation of the world, since humans tend to explore a haptic environment with more than a single finger.
Similarly, the Ghost API allows for multiple Phantoms; we would be interested to experiment with two-handed exploration or with the use of a Phantom in one hand and a tactile display in the other.
The Phantom has an update rate of 1kHz, which is sufficient to convey texture with astounding accuracy. Our images currently have numerous flat surfaces that could be layered with texture; we expect that texture might be used to represent visual properties that are not available from shape alone; color would be a logical choice.
We are currently throwing out some of the information available in the image due to computational limitations; we simply are limited in our total polygon count. Our 850MHz PIII is fast, but it is certainly not state-of-the-art; we have experimented with a faster machine, and found that we could add significantly more detail to the mesh without any changes to our system.
Most importantly, we would like to present the device to a visually-impaired subject and consider their suggestions very seriously. It is very difficult for us to conceive of the how information that is provided by the system could really be used without visual feedback, since our mechanisms for exploring a scene are fundamentally visual.

Original Proposal :

This project was originally proposed under the name "Indonesian Mengkudu Healthy Optic Assistance System"; the original proposal is available here.

The authors :

Dan Morris and Neel Joshi are both graduate students in Computer Science at Stanford. Dan worked primarily on the haptic-rendering and stereo vision aspects of the project (using the Ghost and SVS libraries); Neel worked primarily with the OpenCV library to capture and process the real-time video stream. All the equipment we used belongs to the Robotics Lab at Stanford.

If you have questions or suggestions, please feel free to contact Dan and Neel.