CV: Epipolar Geometry and Disparity

Resources:https://en.wikipedia.org/wiki/Epipolar_geometryhttps://en.wikipedia.org/wiki/Epipolar_geometryEpipolar geometryhttps://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/OWENS/LECT10/node3.htmla more light-hearted intro:

Introduction to Epipolar Geometry and Stereo Vision | LearnOpenCV #

Figure 3 – Extending the triangulation concept to explain how a 3D point (X) captured in two images can be calculated if the camera positions (C1 and C2) and pixel coordinates (x1 and x2) are known.

Figure 3 shows how triangulation can be used to calculate the depth of a point (X) when captured(projected) in two different views(images). In this figure, C1 and C2 are known 3D positions of the left and right cameras, respectively. x1 is the image of the 3D point X captured by the left camera, and x2 is the image of X captured by the right camera. x1 and x2 are called corresponding points because they are the projection of the same 3D point. We use x1 and C1 to find L1 and x2 and C2 to find L2. Hence we can use triangulation to find X just like we did for figure 2. ==> the intersection yields a set of equation with x-dir and y-dir

From the above example, we learned that to triangulate a 3D point using two images capturing it from different views, the key requirements are:

  1. Position of the cameras – C1 and C2.
  2. Point correspondence – x1 and x2.

Note that the stereo camera calibration is useful only when the images are captured by a pair (or a set) of cameras rigidly fixed with respect to each other. If a single camera captures the images from two different angles, then we can find depth only to a scale. The absolute depth is unknown unless we have some special geometric information about the captured scene that can be used to find the actual scale.

(In feature matching) the ratio of the number of pixels with known point correspondence to the total number of pixels is minimal. This means we will have a very sparsely reconstructed 3D scene. For dense reconstruction, we need to obtain point correspondence for the maximum number of pixels possible.  

                        Figure 7 – Multiple matched points using template matching

A simplified way to find the point correspondences is to find pixels with similar neighboring pixel information. In figure 7, we observe that using this method of matching pixels with similar neighboring information results in a single-pixel from one image having multiple matches in the other image. We find it challenging to write an algorithm to determine the true match.

Is there a way to reduce our search space? Some theorem which we can use to eliminate all the extra false matches that lead to inaccurate correspondence? We make use of epipolar geometry here.

Epipolar geometry and its use in point correspondence

                                       Figure 8 – Image explaining epipolar geometry.

In figure 8, we assume a similar setup to figure 3. A 3D point X is captured at x1 and x2 by cameras at C1 and C2, respectively. As x1 is the projection of X, If we try to extend a ray R1 from C1 that passes through x1, it should also pass through X. This ray R1 is captured as line L2, and X is captured as x2 in the image i2. As X lies on R1, x2 should lie on L2. This way, the possible location of x2 is constrained to a single line, and hence we can say that the search space for a pixel in image i2, corresponding to pixel x1, is reduced to a single line L2. We use epipolar geometry to find L2. 

Time to define some technical terms now! Along with Xwe can also project the camera centers in the respective opposite images. e2 is the projection of camera center C1 in image i2, and e1 is the projection of camera center C2 in image i1. The technical term for e1 and e2 is epipoleHence in a two-view geometry setup, an epipole is the image of the camera center of one view in the other view. 

The line joining the two camera centers is called a baseline. Hence epipole can also be defined as the intersection of baseline with the image plane. 

Figure 8 shows that using R1 and baseline, we can define a plane P. This plane also contains X, C1, x1, x2, and C2. We call this plane the epipolar plane. Furthermore, the line obtained from the intersection of the epipolar plane and the image plane is called the epipolar line. Hence in our example, L2 is an epipolar line. For different values of X, we will have different epipolar planes and hence different epipolar lines. However, all the epipolar planes intersect at baseline, and all the epipolar lines intersect at epipole. All this together forms the epipolar geometry.

Revisiting figure 8 with all the technical terms we have learned till now.

We have epipolar plane P created using baseline B and ray R1. e1 and e2 are epipoles, and L2 is the epipolar line. Based on the epipolar geometry of the given figure, search space for pixel in image i2 corresponding to pixel x1 is constrained to a single 2D line which is the epipolar line l2. This is called the epipolar constraint.

Is there a way to represent the entire epipolar geometry by a single matrix? Furthermore, can we calculate this matrix using just the two captured images? The good news is that there is such a matrix, and it is called the Fundamental matrix. 

Understanding projective geometry and homogeneous representation

How do we represent a line in a 2D plane? Equation of a line in a 2D plane is ax + by + c = 0. With different values of a, b, and c, we get different lines in a 2D plane. Hence a vector (a,b,c) can be used to represent a line.

Suppose we have line ln1 defined as 2x + 3y + 7 = 0 and line ln2 as 4x + 6y + 14 = 0. Based on our above discussion, l1 can be represented by the vector (2,3,7) and l2 by the vector (4,6,14). We can easily say that l1 and l2 essentially represent the same line and that the vector (4,6,14) is basically the scaled version of the vector (2,3,7), scaled by a factor of 2.

Hence any two vectors (a,b,c) and k(a,b,c), where k is a non-zero scaling constant, represent the same line. Such equivalent vectors, which are related by just a scaling constant, form a class of homogeneous vectors. The vector (a,b,c) is the homogeneous representation of its respective equivalent vector class. 

The set of all equivalent classes, represented by (a,b,c), for all possible real values of a, b, and c other than a=b=c=0, forms the projective space. We use the homogeneous representation of homogeneous coordinates to define elements like points, lines, planes, etc., in projective space. We use the rules of projective geometry to perform any transformations on these elements in the projective space.

Fundamental matrix derivation

In figure 3, Assume that we know the camera projection matrices for both the cameras, say P1 for the camera at C1 and P2 for the camera at C2.

What is a projection matrix? The camera’s projection matrix defines the relation between the 3D world coordinates and their corresponding pixel coordinates when captured by the camera. To know more about the camera projection matrix, read this post on camera calibration.  

Just like P1 projects 3D world coordinates to image coordinates, we define P1inv, the pseudo inverse of P1, such that we can define the ray R1 from C1 passing through x1 and X as:

k is a scaling parameter as we do not know the actual distance of X from C1. We need to find the epipolar line Ln2 to reduce the search space for a pixel in i2 corresponding to pixel x1 in i1 as we know that Ln2 is the image of ray R1 captured in i2. Hence to calculate Ln2, we first find two points on ray R1, project them in image i2 using P2 and use the projected images of the two points to find Ln2.

The first point that we can consider on R1 is C1, as the ray starts from this point. The second point can be calculated by keeping k=0. Hence we get the points as C1 and (P1inv)(x1).

Using the projection matrix P2 we get the image coordinates of these points in the image i2 as P2*C1 and P2*P1inv*x1 respectively. We also observe that P2*C1 is basically the epipole e2 in image i2.

A line can be defined in projective geometry using two points p1 and p2 by simply finding their cross product p1 x p2 ==> while normal n-D representation is merely taking the difference and scale it; this cross product representation, as we will see later has algebric convenience. Hence

In projective geometry, if a point x lies on a line L, we can write it in the form of the equation

Hence, as x2 lies on the epipolar line Ln2, we get

By replacing the value of Ln2 from the above equation, we get the equation:

This is a necessary condition for the two points x1 and x2 to be corresponding points, and it is also a form of epipolar constraint. Thus F represents the overall epipolar geometry of the two-view system.

What else is so special about this equation? It can be used to find the epipolar lines!

Using Fundamental matrix to find epipolar lines

As x1 and x2 are corresponding points in the equation, if we can find correspondence for some points ==> see the standford ppt at the end, for 8 points alg, using feature matching methods like ORB or SIFT, we can use them to solve the above equation for F.

The findFundamentalMat() method of OpenCV provides implementations of various algorithms, like 7-Point Algorithm, 8-Point Algorithm, RANSAC algorithm, and LMedS Algorithm, to calculate Fundamental matrix using matched feature points. 

Once F is known, we can find the epipolar line Ln2  using the formula

If we know Ln2, we can restrict our search for pixel x2 corresponding to pixel x1 using the epipolar constraint.

A special case of two-view vision – parallel imaging planes

We have been trying to solve the correspondence problem. We started by using feature matching, but we observed that it leads to a sparse 3D structure, as the point correspondence for a tiny fraction of the total pixels is known. Then we saw how we could use a template-based search for pixel correspondence. We learned how epipolar geometry could be used to reduce the search space for point correspondence to a single line – the epipolar line.

Can we simplify this process of finding dense point correspondences even further? 

Figure 9. Upper pair of images showing results for feature matching and lower pair of images showing a points in one image (left) and the corresponding points lying on respective epipolar lines in the second image (right).

Figure 10. A Special case of two view geometry. Upper pair of images showing results for feature matching and lower pair of images showing a points in one image (left) and the corresponding points lying on respective epipolar lines in the second image (right). Source – 2005 Stereo Dataset

Figure 9 and Figure 10 show the feature matching results and epipolar line constraint for two different pairs of images. What is the most significant difference between the two figures in terms of feature matching and the epipolar lines?

Yes! You got it right! The matched feature points have equal vertical coordinates in Figure 10. All the corresponding points have equal vertical coordinates. All the epipolar lines in Figure 10 have to be parallel and have the same vertical coordinate as the respective point in the left image. Well, what is so great about that?

Exactly! Unlike the case of figure 9, there is no need to calculate each epipolar line explicitly. If the pixel in the left image is at (x1,y1), the equation of the respective epipolar line in the second image is y=y1.

We search for each pixel in the left image for its corresponding pixel in the same row of the right image. This is a special case of two-view geometry where the imaging planes are parallel. Hence, the epipoles (image of one camera captured by the other camera) form at infinity. Based on our understanding of epipolar geometry, epipolar lines meet at epipoles. Hence in this case, as the epipoles are at infinity, our epipolar lines are parallel.

Awesome! This significantly simplifies the problem of dense point correspondence. However, we still have to perform triangulation for each point. Can we simplify this problem as well? Well, once again, the special case of parallel imaging planes has good news for us! It helps us to apply stereo disparity. It is similar to stereopsis or stereoscopic vision, the method that helps humans perceive depth. Let’s understand this in detail. 

Disparity and Disparity Shift 

The following gif is generated using images from the Middlebury Stereo Datasets 2005. It demonstrates the pure translation motion of the camera, making the imaging planes parallel. Can you tell which objects are closer to the camera?

We can clearly say that the toy cow at the bottom is closer to the camera than the toys in the topmost row. How did we do this? We basically see the shift in the object in the two images. The more the shift closer is the object. This shift is what we call as disparity.  

How do we use it to avoid point triangulation for calculating depth? We calculate the disparity (shift of the pixel in the two images) for each pixel and apply a proportional mapping to find the depth for a given disparity value.

from OpenCV: Depth Map from Stereo Images

Below is an image and some simple mathematical formulas which prove that intuition. (Image Courtesy :

The above diagram contains equivalent triangles. Writing their equivalent equations will yield us following result:

disparity = x−x′ = Bf / Z <==> (x - x') / B = f / Z * const_correction

x and x′ are the distance between points in image plane corresponding to the scene point 3D and their camera center. B is the distance between two cameras (which we know) and f is the focal length of camera (already known). So in short, the above equation says that the depth of a point in a scene is inversely proportional to the difference in distance of corresponding image points and their camera centers. So with this information, we can derive the depth of all pixels in an image.

So it finds corresponding matches between two images. We have already seen how epiline constraint make this operation faster and accurate. Once it finds matches, it finds the disparity.

Disparity

disparity: https://en.wikipedia.org/wiki/Binocular_disparity

Binocular disparity

From Wikipedia, the free encyclopedia

Binocular disparity refers to the difference in image location of an object (or for in CV, pixels) seen by the left and right eyes, resulting from the eyes’ horizontal separation (parallax). The brain uses binocular disparity to extract depth information from the two-dimensional retinal images in stereopsis. In computer vision, binocular disparity refers to the difference in coordinates of similar features within two stereo images.

A similar disparity can be used in rangefinding by a coincidence rangefinder to determine distance and/or altitude to a target. In astronomy, the disparity between different locations on the Earth can be used to determine various celestial parallax, and Earth's orbit can be used for stellar parallax.

Definition

Human eyes are horizontally separated by about 50–75 mm (interpupillary distance) depending on each individual. Thus, each eye has a slightly different view of the world around. This can be easily seen when alternately closing one eye while looking at a vertical edge. The binocular disparity can be observed from apparent horizontal shift of the vertical edge between both views.

At any given moment, the line of sight of the two eyes meet at a point in space. This point in space projects to the same location (i.e. the center) on the retinae of the two eyes. Because of the different viewpoints observed by the left and right eye however, many other points in space do not fall on corresponding retinal locations. Visual binocular disparity is defined as the difference between the point of projection in the two eyes and is usually expressed in degrees as the visual angle.[1]

Disparity shift

Disparity shift - Intel Communities

==>" As mentioned in the tuning guide, if disparity shift = 0 then a stereo camera can see infinitely far.

So how I like to think of disparity shift is like a person standing in front of the camera holding up a board. If disparity = 0 then they are standing so far away that the camera cannot see them at all. As disparity is increased, the person holding up the board gets closer and closer to the camera, restricting how far ahead it can read the detail of (MaxZ is reducing). Until finally MaxZ is low enough that the board is right in front of the camera and it can see very little except what is in front of the held-up board."

==>" My understanding of the disparity shift is basically the "search range" along epipolar lines. The way depth matching works, you are trying to match two pixels (one in the left view and one in the right view). Their distance apart (in pixel space) is related to how far they are in physical space. By changing the disparity shift value, you are effectively limiting this search range (search for pixels close to each other, you see near; search for pixels far apart from each other, you see far). This is exactly how human vision works, as Marty explained in a previous post."

(epipolar geometry: https://en.wikipedia.org/wiki/Epipolar_geometry)

Understanding disparity and disparity shift · Issue #3039 · IntelRealSense/librealsense · GitHub

I. Disparity and disparity shift are in terms of pixels

True

II. The disparity search range (126) cannot be changed

True. The search range is fixed but the starting value can be changed using disparity shift.

III. The disparity (in pixels) for a particular depth increases when resolution increases

True

IV. Consequently, the size of the range of depths that can be measured effectively decreases as resolution increases

As resolution increases, both minZ and maxZ increase but maxZ increases more, so the total range maxZ-minZ increases with higher resolution.

More on Details

https://web.stanford.edu/class/cs231a/course_notes/03-epipolar-geometry.pdfhttps://web.stanford.edu/class/cs231a/course_notes/03-epipolar-geometry.pdf

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值