See how the Tangram Vision Platform can radically accelerate your perception roadmap.

Table of Contents

Camera calibration is generally understood to be the process of estimating the characteristics of a camera that affect the formation of images of real 3D scenes. Principle among these is the effective focal length of the lens which gives images a wide-angle or telescopic character for a given camera position and scene-being-image. There are a wide range of models used to describe a camera’s image formation behavior. Some are relatively simple with mostly-linear equations of few components and are only adequate to describe simple camera systems. Others are highly non-linear, featuring high-degree polynomials with many coefficients. Regardless of the camera model, one can use the techniques in this article to calibrate their camera.

Creating a calibration module is no small task; therefore, we'll be splitting this blog post into three parts. This first part will focus on the theory behind calibration. The second part will reveal the mathematics, and principles behind calibration, and present a roadmap on creating your own approach. Finally, the third part will show the results from a self-built calibration module, describing how the module arrives at useful results. For code, you can follow along here, or access the full codebase for this tutorial at the Tangram Visions Blog repository. With that said, let's dive in.

There are many existing tools one can use to calibrate a camera. These exist both as standalone programs and SDK libraries. Given this, why would you want to write your own camera calibration module? There are some real reasons to do so e.g.: the existing tools may not support the hardware platform or parametric camera models you want to use. Aside from any practical motivations, it’s useful to work through the process of camera calibration to gain a better understanding of what’s happening when you use a tool like OpenCV. This is especially useful when trying to understand why one of these tools may not be producing a suitable calibration.

Camera calibration is built on a foundation of linear algebra, 3D geometry and non-linear least squares optimization. One need not be an expert in all of these areas, but a basic understanding is necessary. To keep the blog post a reasonable length, some knowledge of linear algebra, the construction and use of 3D transformations, multivariable calculus and optimization (being able to understand a cost function should suffice) techniques is assumed for the result of the article.

The code snippets (largely in Part II) will be in Rust and we’ll be using the NAlgebra linear algebra crate and the argmin optimization crate. The entire example can be found here.

To begin, we need a mathematical model of the formation of images. This provides a compact formula which approximately describes the relationship of 3D points in the scene to their corresponding 2D pixels in the image, aka *image formation* or *projection.* Projection is generally thought of in the following way:

- Light in a 3D scene hits the various objects therein and scatters.
- Some of that scattered light enters the camera’s aperture.
- If a lens is present, the lens will refract (i.e. redirect) the light rays directing them to various pixels which measure the light incident to them during the exposure period.

One such model is called the *pinhole model* which describes image formation for pinhole cameras and camerae obscurae. We will use this model for this post due to its simplicity.

The pinhole model can work for cameras with simple lenses provided they’re highly rectilinear, but most cameras exhibit some sort of lens distortion which isn’t captured in a model like this. It's worth noting that most camera models ignore a lot of the real-life complexity of a lens (focus, refraction, etc).

In the pinhole model, the camera is said to be at the origin of a camera coordinate system shown in the diagram below. To image a 3D point, you draw a line between the point in question and the camera and see where that line intersects the virtual image plane, which is a model stand-in for the camera's physical sensor. This *intersection point* is then shifted and scaled according to the sensor's resolution into a pixel location. In the model, the image plane is set in front of the camera (i.e. along the +Z axis) at a distance called the *focal length*.

The trick to deriving the formula for the intersection point from this diagram is to consider just two dimensions at a time. Doing this exposes a similar triangles relationship between points on the plane and points in the scene.

So we end up with the formula:

$$\begin{bmatrix} u_{plane} \\\ v_{plane} \end{bmatrix} =\begin{bmatrix} F_c \frac{X}{Z} \\\ F_c \frac{Y}{Z} \end{bmatrix}$$

Which maps points in 3D to the virtual image plane.

The image plane is not an entirely abstract concept. It represents the actual CMOS sensor or piece of film of the camera. The intersection point is a point on this physical sensor; \\(u_{plane}\\) and \\(v_{plane}\\) are thus described in units of distance from a point on this sensor (e.g. meters). A further transformation which maps this plane location to actual pixels is required.

To do so, we need a few pieces of information:

- The
*pixel pitch: t*he edge length of a pixel given in meters-per-pixel (often micrometers-per-pixel) - The
*principal point*: the pixel location that any point on the camera’s Z-axis maps to. The principal point (\(c_x, c_y\)) is usually in the center of the image (e.g. at pixel 320, 240 for a 640x480 image) but often deviates slightly because of small imperfections.

The principal point accounts for the discrepancy between the pixel-space origin and the intersection point. An intersection point on the Z axis will have zero X and Y components and thus projects to \\( (u_{plane}, v_{plane}) = (0,0)\\). Meanwhile, the origin in pixel-space is the upper left hand corner of the sensor.

After mapping to pixel space, we get our final pinhole camera model:

$$\begin{bmatrix}u_{pix} \\\ v_{pix}\end{bmatrix} =\begin{bmatrix}\frac{f_c}{pp} \frac{X}{Z} + c_x \\\ \frac{f_c}{pp} \frac{Y}{Z} + c_y\end{bmatrix} $$

It’s common in computer vision parlance to group the focal length and pixel pitch terms into one pixel-unit focal length term since they’re usually fixed for a given camera (assuming the lens cannot zoom). It’s also very common to see distinct focal length parameters for the X and Y dimensions \\((f_x, f_y)\\). Historically, this was to account for non-square pixels or for exotic lenses (e.g. anamorphic lenses), but in many scenarios having two focal length parameters isn’t well-motivated and can even be detrimental. This results in the pinhole model in its most common form:

$$\begin{bmatrix}u_{pix} \\\ v_{pix}\end{bmatrix} =P(\bar{X}; \{ f_x, f_y,c_x,c_y \} )=\begin{bmatrix}f_x \frac{X}{Z} + c_x \\\ f_y \frac{Y}{Z} + c_y\end{bmatrix} $$

It is the coefficients \\(f_x, f_y, c_x, c_Y\\) that we hope to estimate using camera calibration, as we will show in the next two parts of this series.

```

fn project(

params: &na::Vector4<f64>, /*fx, fy, cx, cy*/

pt: &na::Point3<f64>,

) -> na::Point2<f64> {

na::Point2::<f64>::new(

params[0] * pt.x / pt.z + params[2], // fx * x / z + cx

params[1] * pt.y / pt.z + params[3], // fy * y / z + cy

)

}

```

At this point, we've explained the theory behind camera calibration. We'll take a break here, and return with the second part where we dive into the principles, much of the math, and some of the code required to understand creating a calibration module.

The Tangram Vision Platform lets perception teams develop and deploy faster.