Very High-Resolution Image-Based Rendering with Motion Extraction

Steve Matuszek


Now includes March 30, 2000 updates, in green

Now includes April 20, 2000 updates, in brown

Now includes May 12, 2000 conclusions, in red


Michael Naimark, who has recently visited UNC, has taken very high-resolution (35mm ASA 50 film), very well-registered stereoscopic, panoramic footage of historical locations around the world.

Here are 15 images taken from a complete revolution in one of the sites (I forget which).
Here is a panorama I built out of them. Check out the interactive panorama viewer. (You may need to download the plug-in.)

I would like to apply image-based rendering techniques to this data (soon to be ours in the original film form, from which we can obtain digital copies), to result in a three-dimensional environment.

We don't have the physical film, in fact. What we have is a D2 digital tape of the film (just the Timbuktu and Dubrovnik reels), with frame numbers. What we do when we want a specific frame is tell the film lab (which is in San Francisco), and they do the conversion for us.

We then receive the film on DLT tape in Cineon format, which is a subset of DPX format. This is a format maintained by SMPTE for the motion picture industry, and contains much useful information in the header regarding color metrics, frame rates, shutter angles, and so forth. Mostly what we are interested in is the 10 bits per pixel of image information.

These files are probably going to have to be converted into something sane before I use them (no way I need 10 bits of color, and it's logarithmic at that). I have located the specifications and have begun writing code to do the conversion.

The code works okay but Adobe After Effects does a better color conversion.

The film lab did a good job on the film-to-tape transfer but this Unix tape they gave us is like a bad joke. Each of these files is tarred individually, with an absolute pathname, and there are many duplicate filenames on different files. Most of the ones I've been able to read claim to be from the right camera, but there are far more than the 72 there should be... and others are corrupted. Here is a full-size frame with lots of JPEG compression.

I wrote a lot of irritating tar scripts trying to make this work. The files are 50 meg each, too. This must be what it felt like to program with punched cards.

It turned out that the files were okay, but they were on the tape in a very strange order. More importantly, however, the stereo pairs I thought were synced were about a second apart, making them useless for stereo.

In the meantime, I have been using the data from DV digital video tapes. Unfortunately this introduces interlacing. Please read my discussion on interlacing issues.

Now I'm using the images from D2 tape, which is higher quality and which I think doesn't interlace. I have found a sequence I really want to use, of San Francisco -- in the foreground is a regular geometric tiling pattern, with a cool building in the midground with waterfalls on it. I think that this footage would submit well to depth extraction and my fake transforms. The waterfalls would be tagged as moving objects and therefore I would render them by cycling the pixel data, on top of the image objects that result from the stationary scene.

The D2 frames come out 720 x 480 and are very useful, but they don't compare to the 4k x 3k.

Here is a sample frame from Dubrovnik:

You can see in this stereogram that the stereo is preserved very nicely (with some careful matching of frames).

So once I had finally captured those, I could try to utilize them. Here is the conceptual pipeline for this data:
stereo footage 
    |	register left-right, and record
stereo pairs 
    |	depth-from-stereo
separate depth images
    |   combination of points, including tagging moving objects
colored point cloud
    |   removal or animation of moving objects
colored point cloud
    |	converted to triangles
triangle mesh
    |   custom display code

Previously, I had been thinking in terms of extracting motion directly from the image data. See below:

Here are two frames from Timbuktu, the high-resolution scans from the film lab. They are five degrees apart, which is more than I probably want to work with but which is instructive. These are shown at one-eighth resolution; click on them to see one-quarter.

My code attempts to find the horizontal offset between these images by trying different values, taking the difference between the offset images, and minimizing over those results. Here is the difference with no offset:

And here is the difference from the best match found (also cropped):

It's an improvement, but not a close enough match to use as a mask for stationary/moving discrimination. I need to

I've now got images as close together as I want. Here is a poorly matching offset, just so you can see what is going on:

And here are the same two images, but at a much better offset.

Note that the man, who is moving around, shows up clearly. A combination of multiple frames would remove him quite effectively. Here is how I combine frames:

The difference between those two, with proper offset, looks like

That is brightened way up to be more visible. Note that nearer the edges the rotation is less close to being simulated by a translation, so the correspondence is worse. One of the advantages of having enormous angular resolution is that we can use just the part in the very middle if we like.

Using that image as a mask gives us:

as the background and mover pixels respectively. We can then combine the two background images. We average them where they are both valid and use only the valid one where one is invalid. Where neither is valid, an inoffensive background color is chosen.

The "moving" pixels could then be shown on top of this, giving us the people and animals moving around but the background constant. It would look much better with more pictures averaged in, but unfortunately I designed my pipeline to pretty much only use two images at once.

I had abandoned this approach in favor of detecting movement at the depth image level...

By the way, the other people using this data are Anselmo Lastra and Voicu Popescu. They are the ones working on extracting depth from the data. When it comes time for me to create image-based objects, I will either use their results or an existing depth-from-stereo library.

Famous last words! I tried to write this code myself instead, and I am largely stuck. I understand that you can determine depth from disparity, and find disparity by offsetting the left and right images until they match at a feature.

The first and second strips are the left and right images. The lower strips represents the differences between left and right for a single scanline, offset left (top) to right (bottom). Clearly, the best matches are at the centers of the Xs. Note on the right side how the centers go downwards; since those points are farther away, that's exactly what we want to see.

Unfortunately, not every pixel is a feature. You can see that for most of the x values, a whole lot of no difference must somehow yield a preference. Here is the least bad correspondence I was able to get:

You can see the general structure. But the pixels with no difference around them get random quickly. I think the answer is some sort of feature pre-selecting:

and interpolating between those points. That should work great for big planar objects like buildings. So, I am stuck at depth from stereo. That makes it hard to move on to combining the depth views, while extracting motion.

What I hope to contribute on top of existing techniques is:


This is one of the best-registered and highest-resolution stereoscopic data sets extant, and since UNC has in evans one of the biggest number crunchers being applied in research, an environment of unmatched realism could ideally result. Furthermore, these are no "example" locations such as someone's graphics lab; these are UNESCO-designated "in danger" World Heritage Sites, such as Jerusalem, Dubrovnik (Croatia), Timbuktu (Mali), and Angkor (Cambodia).

Most of the features in these images appear to be nearly planar, and could actually be well-represented by large textured polygons. (The sandy ground on the market floor, for example, is unlikely to be looked at closely enough to require modeling of the footprints.)

This would certainly outperform the image-based objects that we display from laser data (such as the reading room), which are essentially enormous polygon soups. Perhaps more aggressive surface simplification is needed.


Working for Dr. Henry Fuchs, I am involved with applications that use image-based rendering, but I am not myself that conversant with the techniques. I'd like to get experience using them, and possibly extending them. Also in that project, I have tried with limited success to understand and modify our existing display code. Hopefully, I can start from first principles to create code that makes more sense (to myself, anyway).

If Andrei State et al. get the DPLEX working so that different processors can pump geometry through different pipes, I can take advantage of this, since I am writing the application from the ground up rather than having to modify existing code which has problems such as using GLUT. I would find this satsifying personally as only toy applications run correctly in this manner currently.

We have some Timbuktu images showing on a seamlessly combined multi-projector setup, about 90 horizontal degrees in all. If I can get animated (if not 3D immersive) images showing on that it would be pretty cool.

I really think that if I could get some kind of depth images, the recombination and motion extraction part would work right on it. Depth from stereo is a fascinating problem but not, unfortunately, the one I originally wanted to address.


Additional references:


The demonstration will be a run of the application that displays the resulting 3-D environment, possibly on the head-mounted display, but more likely just with mouse interaction.

Did you remember to read the page with all the angles and diagrams and stuff ?

To sum up, I still have to