Very High-Resolution Image-Based Rendering with Motion Extraction

Steve Matuszek


Now includes March 30, 2000 updates, in green

Now includes April 20, 2000 updates, in brown


Michael Naimark, who has recently visited UNC, has taken very high-resolution (35mm ASA 50 film), very well-registered stereoscopic, panoramic footage of historical locations around the world.

Here are 15 images taken from a complete revolution in one of the sites (I forget which).
Here is a panorama I built out of them. Check out the interactive panorama viewer. (You may need to download the plug-in.)

I would like to apply image-based rendering techniques to this data (soon to be ours in the original film form, from which we can obtain digital copies), to result in a three-dimensional environment.

We don't have the physical film, in fact. What we have is a D2 digital tape of the film (just the Timbuktu and Dubrovnik reels), with frame numbers. What we do when we want a specific frame is tell the film lab (which is in San Francisco), and they do the conversion for us.

We then receive the film on DLT tape in Cineon format, which is a subset of DPX format. This is a format maintained by SMPTE for the motion picture industry, and contains much useful information in the header regarding color metrics, frame rates, shutter angles, and so forth. Mostly what we are interested in is the 10 bits per pixel of image information.

These files are probably going to have to be converted into something sane before I use them (no way I need 10 bits of color, and it's logarithmic at that). I have located the specifications and have begun writing code to do the conversion.

The code works okay but Adobe After Effects does a better color conversion.

The film lab did a good job on the film-to-tape transfer but this Unix tape they gave us is like a bad joke. Each of these files is tarred individually, with an absolute pathname, and there are many duplicate filenames on different files. Most of the ones I've been able to read claim to be from the right camera, but there are far more than the 72 there should be... and others are corrupted. Here is a full-size frame with lots of JPEG compression.

I wrote a lot of irritating tar scripts trying to make this work. The files are 50 meg each, too. This must be what it felt like to program with punched cards.

In the meantime, I have been using the data from DV digital video tapes. Unfortunately this introduces interlacing. Please read my discussion on interlacing issues.

Now I'm using the images from D2 tape, which is higher quality and which I think doesn't interlace. I have found a sequence I really want to use, of San Francisco -- in the foreground is a regulat geometric tiling pattern, with a cool building in the midground with waterfalls on it. I think that this footage would submit well to depth extraction and my fake transforms. The waterfalls would be tagged as moving objects and therefore I would render them by cycling the pixel data, on top of the image objects that result from the stationary scene.

By the way, the other people using this data are Anselmo Lastra and Voicu Popescu. They are the ones working on extracting depth from the data. When it comes time for me to create image-based objects, I will either use their results or an existing depth-from-stereo library.

Here are two frames from Timbuktu, the high-resolution scans from the film lab. They are five degrees apart, which is more than I probably want to work with but which is instructive. These are shown at one-eighth resolution; click on them to see one-quarter.

My code attempts to find the horizontal offset between these images by trying different values, taking the difference between the offset images, and minimizing over those results. Here is the difference with no offset:

And here is the difference from the best match found (also cropped):

It's an improvement, but not a close enough match to use as a mask for stationary/moving discrimination. I need to

What I hope to contribute on top of existing techniques is:


This is one of the best-registered and highest-resolution stereoscopic data sets extant, and since UNC has in evans one of the biggest number crunchers being applied in research, an environment of unmatched realism could ideally result. Furthermore, these are no "example" locations such as someone's graphics lab; these are UNESCO-designated "in danger" World Heritage Sites, such as Jerusalem, Dubrovnik (Croatia), Timbuktu (Mali), and Angkor (Cambodia).

Most of the features in these images appear to be nearly planar, and could actually be well-represented by large textured polygons. (The sandy ground on the market floor, for example, is unlikely to be looked at closely enough to require modeling of the footprints.)

This would certainly outperform the image-based objects that we display from laser data (such as the reading room), which are essentially enormous polygon soups. Perhaps more aggressive surface simplification is needed.


Working for Dr. Henry Fuchs, I am involved with applications that use image-based rendering, but I am not myself that conversant with the techniques. I'd like to get experience using them, and possibly extending them. Also in that project, I have tried with limited success to understand and modify our existing display code. Hopefully, I can start from first principles to create code that makes more sense (to myself, anyway).

If Andrei State et al. get the DPLEX working so that different processors can pump geometry through different pipes, I can take advantage of this, since I am writing the application from the ground up rather than having to modify existing code which has problems such as using GLUT. I would find this satsifying personally as only toy applications run correctly in this manner currently.

We have some Timbuktu images showing on a seamlessly combined multi-projector setup, about 90 horizontal degrees in all. If I can get animated (if not 3D immersive) images shwoing on that it would be pretty cool.


Additional references:


The demonstration will be a run of the application that displays the resulting 3-D environment, possibly on the head-mounted display, but more likely just with mouse interaction.

Did you remember to read the page with all the angles and diagrams and stuff ?

To sum up, I still have to