Structure from Motion

Hey students! 👋 Welcome to one of the most exciting areas of computer vision - Structure from Motion (SfM)! This lesson will take you on a journey from flat 2D images to amazing 3D reconstructions. By the end of this lesson, you'll understand how your phone can create 3D models from photos, how autonomous cars "see" the world, and how movie studios create stunning visual effects. We'll explore multi-view geometry, feature tracking, camera pose estimation, and the complete pipeline that makes 3D reconstruction possible from just a collection of 2D images.

Understanding the Magic Behind Structure from Motion

Structure from Motion is like being a detective with a camera 📸. Imagine you're walking around a statue, taking photos from different angles. Your brain naturally understands that all these photos show the same object from different viewpoints. SfM teaches computers to do the same thing - it analyzes multiple 2D images to figure out both the 3D structure of what's being photographed and where the camera was positioned when each photo was taken.

The fundamental principle behind SfM comes from the fact that when we see the same point in multiple images taken from different positions, we can use triangulation to determine where that point exists in 3D space. This is similar to how your two eyes work together to give you depth perception - each eye sees a slightly different view, and your brain combines them to understand distance.

Real-world applications are everywhere around us! Google Street View uses SfM to create their immersive 3D maps. When you use your smartphone's panorama mode, it's applying SfM principles. Archaeologists use SfM to digitally preserve historical sites, and filmmakers use it to create realistic digital environments. The global SfM market was valued at approximately $1.2 billion in 2023 and is expected to grow significantly as AR/VR technologies advance.

Multi-View Geometry: The Mathematical Foundation

Multi-view geometry is the mathematical backbone that makes SfM possible 🧮. When you have two or more images of the same scene taken from different positions, there are geometric relationships between corresponding points that we can exploit.

The most important concept here is the epipolar constraint. When you have two cameras looking at the same 3D point, that point, and the centers of both cameras, form a triangle in 3D space. This creates something called the "epipolar line" - if you know where a point appears in one image, you know it must lie somewhere along a specific line in the second image. This dramatically reduces the search space when trying to match features between images.

The essential matrix and fundamental matrix are key mathematical tools that encode these geometric relationships. The essential matrix relates corresponding points in two images when we know the camera's internal parameters (like focal length), while the fundamental matrix works even when we don't know these parameters. These matrices satisfy the constraint: $x_2^T F x_1 = 0$, where $x_1$ and $x_2$ are corresponding points in two images.

For multiple views, we extend these concepts using trifocal tensors for three views, and more generally, we work with projective geometry. The camera projection can be modeled as: $x = PX$, where $x$ is the 2D image point, $X$ is the 3D world point, and $P$ is the $3 \times 4$ camera projection matrix that encodes both the camera's position and orientation.

Feature Detection and Tracking: Finding Correspondences

Before we can apply multi-view geometry, we need to find the same points across multiple images - this is called feature matching 🎯. Think of features as distinctive landmarks that are easy to recognize from different viewpoints.

SIFT (Scale-Invariant Feature Transform) was one of the breakthrough algorithms in this area. SIFT features are designed to be invariant to changes in scale, rotation, and lighting conditions. When you take a photo of a building from far away and then up close, SIFT can still match the same corner points despite the dramatic difference in appearance.

ORB (Oriented FAST and Rotated BRIEF) is a more modern, faster alternative that's widely used in real-time applications. While SIFT might detect thousands of features per image, ORB can process them much faster, making it suitable for mobile devices and robotics applications.

The feature matching process involves several steps:

Detection: Finding interesting points in each image
Description: Creating a mathematical "fingerprint" for each point
Matching: Comparing fingerprints across images to find correspondences
Filtering: Removing incorrect matches using geometric constraints

A typical image might yield 1,000-5,000 good features, and modern algorithms can achieve matching accuracies of over 95% when images have sufficient overlap (typically 60-80% is ideal).

Camera Pose Estimation: Where Am I?

Once we have feature correspondences, we need to figure out where the camera was positioned and oriented when each photo was taken 📍. This is called camera pose estimation, and it's crucial for accurate 3D reconstruction.

The process typically starts with two-view pose estimation. Given matched features between two images, we can estimate the relative position and orientation between the two camera positions. This involves solving for the essential matrix and then decomposing it to get rotation and translation parameters.

However, there's a catch - we can only determine the relative scale, not the absolute scale. If you're photographing a toy car or a real car, the SfM algorithm can't tell the difference without additional information! This is why SfM reconstructions are typically scaled arbitrarily.

Bundle Adjustment is the gold standard for refining camera poses and 3D points simultaneously. It's called "bundle" because it adjusts the bundle of light rays connecting 3D points to their image projections. The algorithm minimizes the reprojection error - the difference between where a 3D point should appear in an image versus where it actually appears.

The optimization problem can be written as:

$$\min_{\{R_i, t_i\}, \{X_j\}} \sum_{i,j} \rho(||x_{ij} - \pi(R_i X_j + t_i)||^2)$$

where $R_i$ and $t_i$ are the rotation and translation for camera $i$, $X_j$ are the 3D points, $x_{ij}$ are the observed 2D points, and $\pi$ is the camera projection function.

Sparse 3D Reconstruction Pipeline

The complete SfM pipeline brings everything together in a carefully orchestrated sequence 🔄. Modern SfM systems typically follow either an incremental or global approach.

Incremental SfM (like the popular COLMAP software) starts with two images, estimates their relative pose, triangulates 3D points, then gradually adds more images one by one. Each new image is positioned by finding its pose relative to the already-reconstructed 3D points. This approach is robust but can accumulate errors over long sequences.

Global SfM attempts to estimate all camera poses simultaneously using the complete set of image pairs. While this can be more accurate in theory, it's also more challenging to make robust against outliers and incorrect matches.

A typical pipeline includes these stages:

Feature extraction and matching across all image pairs
Image graph construction where images are nodes and matches are edges
Initial pair selection to start reconstruction with the most reliable image pair
Incremental pose estimation and 3D point triangulation
Bundle adjustment after adding each new image
Loop closure detection to handle cases where the camera returns to previously visited areas

Modern systems can process hundreds of images in minutes on standard hardware. For example, COLMAP can reconstruct a scene from 100 images (each 4MP) in approximately 10-30 minutes on a modern laptop, producing thousands of 3D points with centimeter-level accuracy.

The quality of reconstruction depends heavily on the input images. Best practices include: maintaining 60-80% overlap between adjacent images, avoiding pure rotation (include translation), ensuring good lighting conditions, and including images from multiple viewpoints around objects of interest.

Conclusion

Structure from Motion represents one of the most elegant solutions in computer vision - transforming simple 2D photographs into rich 3D understanding. We've explored how multi-view geometry provides the mathematical foundation, how feature tracking creates correspondences across images, how camera pose estimation determines viewpoints, and how the complete reconstruction pipeline brings it all together. From archaeological preservation to autonomous vehicles, SfM continues to shape how we capture, understand, and interact with our 3D world. The field continues to evolve with deep learning integration and real-time implementations, making 3D reconstruction more accessible and accurate than ever before.

Study Notes

• Structure from Motion (SfM): Technique to reconstruct 3D scenes and camera poses from multiple 2D images

• Epipolar Constraint: Corresponding points in two images lie on epipolar lines, reducing search space for feature matching

• Essential Matrix: Encodes geometric relationship between corresponding points when camera parameters are known

• Fundamental Matrix: Works without knowing camera internal parameters; satisfies $x_2^T F x_1 = 0$

• Camera Projection: Modeled as $x = PX$ where $P$ is $3 \times 4$ projection matrix

• SIFT Features: Scale-invariant feature transform, robust to scale, rotation, and lighting changes

• ORB Features: Oriented FAST and Rotated BRIEF, faster alternative to SIFT for real-time applications

• Bundle Adjustment: Simultaneously optimizes camera poses and 3D points by minimizing reprojection error

• Reprojection Error: Difference between predicted and observed 2D point locations in images

• Incremental SfM: Starts with two images, gradually adds more images one by one

• Global SfM: Estimates all camera poses simultaneously using complete image set

• Triangulation: Process of determining 3D point location from multiple 2D observations

• Scale Ambiguity: SfM can only determine relative scale, not absolute scale without additional information

• Image Overlap: 60-80% overlap between adjacent images recommended for robust reconstruction

• Feature Matching Pipeline: Detection → Description → Matching → Geometric filtering