Stereo Vision

Hey students! 👋 Welcome to our exciting journey into the world of stereo vision! In this lesson, you'll discover how computers can see depth just like humans do with two eyes. We'll explore the fascinating process of stereo correspondence, learn how to compute disparity maps, and understand how paired images can create detailed 3D depth information. By the end of this lesson, you'll understand the fundamental principles that power everything from autonomous vehicles to virtual reality systems! 🚗🥽

Understanding Binocular Vision and Stereo Principles

Just like you have two eyes that work together to perceive depth, stereo vision in computer vision uses two cameras positioned at slightly different viewpoints to extract three-dimensional information from the world around us. This amazing technique mimics the natural process of human binocular vision! 👀

The fundamental principle behind stereo vision lies in binocular disparity - the difference in the apparent position of objects when viewed from two different locations. When you close one eye and then the other, you'll notice that nearby objects appear to shift more than distant objects. This shift is exactly what computers use to calculate depth!

In stereo vision systems, two cameras are typically mounted horizontally with a known distance between them, called the baseline. The baseline distance is crucial because it determines the accuracy and range of depth measurements. A larger baseline provides better depth resolution for distant objects but may struggle with close objects, while a smaller baseline works well for nearby objects but loses accuracy at greater distances.

The process begins with camera calibration, where we determine the exact internal parameters of each camera (like focal length and lens distortion) and the precise geometric relationship between the two cameras. This calibration is essential because any errors here will propagate through the entire depth estimation process.

Epipolar Geometry: The Mathematical Foundation

Epipolar geometry is the mathematical framework that governs stereo vision systems. Think of it as the "rules of the game" that tell us where to look for corresponding points between the two images! 📐

When a point in 3D space is captured by both cameras, it creates what we call corresponding points in each image. The key insight is that for any point in the left image, its corresponding point in the right image must lie along a specific line called the epipolar line. This constraint dramatically reduces the search space from a 2D area to a 1D line!

The intersection points of all epipolar lines are called epipoles, which represent the projection of each camera center onto the other camera's image plane. In practice, most stereo systems undergo a process called rectification, where both images are transformed so that epipolar lines become horizontal and parallel. This makes the correspondence search much simpler - we only need to search along horizontal lines!

The mathematical relationship can be expressed through the fundamental matrix $F$, which encodes the epipolar geometry. For corresponding points $p_l$ in the left image and $p_r$ in the right image, the epipolar constraint is: $$p_r^T F p_l = 0$$

This elegant equation ensures that corresponding points satisfy the geometric constraints imposed by the camera setup.

Stereo Correspondence: Finding Matching Points

The stereo correspondence problem is often considered the heart of stereo vision - it's the challenge of finding which pixel in the left image corresponds to which pixel in the right image. This might sound simple, but it's actually one of the most computationally intensive parts of the entire process! 🔍

Several factors make correspondence matching challenging. Occlusions occur when objects visible in one camera are hidden in the other. Repetitive textures like brick walls or fabric patterns can create multiple potential matches. Lighting differences between the two cameras can make the same surface appear different in each image.

Modern stereo algorithms use various strategies to solve these challenges. Local methods compare small windows or patches around each pixel, using similarity measures like normalized cross-correlation or sum of squared differences. Global methods treat correspondence as an optimization problem, finding the disparity map that minimizes a global energy function while maintaining smoothness constraints.

Semi-global matching (SGM) has become particularly popular because it balances accuracy with computational efficiency. SGM aggregates matching costs along multiple 1D paths through the image, providing robust results while remaining fast enough for real-time applications.

The quality of correspondence matching directly affects the final depth map quality. Poor matches lead to noisy or incorrect depth estimates, while good matches produce smooth, accurate 3D reconstructions.

Disparity Computation and Depth Map Generation

Once we've solved the correspondence problem, computing disparity becomes straightforward! Disparity is simply the horizontal distance (in pixels) between corresponding points in the rectified stereo pair. For a point at coordinates $(x_l, y)$ in the left image and $(x_r, y)$ in the right image, the disparity is: $$d = x_l - x_r$$

The beautiful relationship between disparity and depth is given by the formula: $$Z = \frac{f \cdot B}{d}$$

Where $Z$ is the depth, $f$ is the focal length, $B$ is the baseline distance, and $d$ is the disparity. Notice how this relationship is inversely proportional - objects with large disparity are close, while objects with small disparity are far away! 📏

Creating a depth map involves computing disparity for every pixel in the image and then converting these disparity values to actual depth measurements. The resulting depth map is typically visualized as a grayscale image where brighter pixels represent closer objects and darker pixels represent more distant objects.

Real-world applications of stereo vision are everywhere! Autonomous vehicles use stereo cameras to detect obstacles and estimate distances to other cars. Manufacturing systems use stereo vision for quality control and robotic guidance. Even smartphones now incorporate stereo vision for portrait mode photography and augmented reality applications.

The accuracy of depth maps depends on several factors: the baseline distance, image resolution, calibration quality, and the stereo matching algorithm used. Modern systems can achieve millimeter-level accuracy under controlled conditions, making them suitable for precise industrial applications.

Conclusion

Stereo vision represents one of the most elegant solutions in computer vision, mimicking nature's own approach to depth perception. Through the principles of binocular disparity, epipolar geometry, and stereo correspondence, we can extract rich 3D information from simple 2D images. The process involves careful camera calibration, sophisticated matching algorithms, and precise disparity computation to generate accurate depth maps. As technology continues advancing, stereo vision remains a cornerstone technique powering innovations in robotics, autonomous systems, and immersive technologies.

Study Notes

• Stereo Vision: Computer vision technique using two cameras at different viewpoints to extract depth information from 2D images

• Binocular Disparity: The difference in apparent position of objects when viewed from two different locations

• Baseline: The horizontal distance between two stereo cameras, affecting depth accuracy and measurement range

• Epipolar Geometry: Mathematical framework governing the geometric relationship between stereo camera pairs

• Epipolar Lines: Lines in each image along which corresponding points must lie, reducing 2D search to 1D

• Rectification: Process of transforming stereo images so epipolar lines become horizontal and parallel

• Fundamental Matrix: Mathematical representation of epipolar geometry: $p_r^T F p_l = 0$

• Stereo Correspondence: The process of finding matching pixels between left and right stereo images

• Disparity: Horizontal pixel distance between corresponding points: $d = x_l - x_r$

• Depth Formula: Relationship between disparity and depth: $Z = \frac{f \cdot B}{d}$

• Depth Map: Grayscale representation where pixel intensity corresponds to distance from cameras

• Semi-Global Matching (SGM): Popular algorithm balancing accuracy and computational efficiency

• Occlusions: Areas visible in one camera but hidden in the other, creating correspondence challenges

• Applications: Autonomous vehicles, robotics, manufacturing quality control, smartphone photography, AR/VR systems