See how the Tangram Vision Platform can radically accelerate your perception roadmap.

Table of Contents

Previously we covered how rotations, scales, and translations work for 2D coordinate frames. The most important takeaway was that we were able to both think conceptually about transformations, as well as express them in plain English (I want the depth from color transform) and mathematically (I want \(\Gamma^{depth}_{color}\), or depth ← color).

While 2D coordinate frames are common in mathematics, we interact with our world in three dimensions (3D). In particular, many of the sensors that Tangram Vision deals with every day have to be related to each other in 3D space. This means that most of the coordinate systems we are interested in tend to be expressed in three dimensions as well. Relating two 3D coordinate systems together must be done regardless of how complex your sensor system is. That includes anything from orienting two cameras into the same frame, to combining LiDAR with other sensors. This means that the 2D transforms we derived last time won't be enough.

In this article, we aim to extend the previous equations we formulated for 2D transforms and derive equivalent 3D transforms. You can follow along here, or at the full codebase at the Tangram Visions Blog repository. Let's get started!

Recall that we try to express the entire transformation process as a function applied over a point. We started with the following equation:

$$p_B = f(p_A)$$

Fortunately, this model still works! In 3D, our points have three Cartesian axes: \(X\), \(Y\), and \(Z\). Ultimately, we are looking for one of the following forms:

$$p_B = \begin{bmatrix} x_B \\ y_B \\ z_B \end{bmatrix} = S^B_A \cdot R^B_A \cdot \begin{bmatrix} x_A \\ y_A \\ z_A \end{bmatrix} + T^B_A$$

or, alternatively with projective coordinates:

$$p_B = \begin{bmatrix} x_B \\ y_B \\ z_B \\ 1 \end{bmatrix} = \Gamma^B_A \cdot \begin{bmatrix} x_A \\ y_A \\ z_A \\ 1 \end{bmatrix}$$

As we did last time, let's look at each of these coordinate transformations independently, and then see how they combine together!

Just like last time, translations are the easiest transformations to understand. We add an additional translation dimension just as we did with our definition of points above.

$$T^B_A = \begin{bmatrix} T_x \\ T_y \\ T_z \end{bmatrix}^B_A$$

$$p_B = p_A + T^B_A$$

$$\begin{bmatrix} x_B \\ y_B \\ z_B \end{bmatrix} = \begin{bmatrix} x_A\\ y_A \\ z_A \end{bmatrix} + \begin{bmatrix} T_x \\ T_y \\ T_z \end{bmatrix}^B_A$$

Again, it comes out to be simple addition, which keeps things easy.

Scaling to an additional dimension changes the final equation similarly to how it did with translation. We add an extra element to the final matrix, producing:

$$S^B_A = \begin{bmatrix} s_x && 0 && 0 \\ 0 && s_y && 0 \\ 0 && 0 && s_z \end{bmatrix}^B_A$$

$$p_B = S^B_A \cdot p_A$$

$$\begin{bmatrix} x_B \\ y_B \\ z_B \end{bmatrix} = \begin{bmatrix} s_x && 0 && 0 \\ 0 && s_y && 0 \\ 0 && 0 && s_z \end{bmatrix}^B_A \cdot \begin{bmatrix} x_A\\ y_A \\ z_A \end{bmatrix}$$

Just like last time, we may have an isometric scale (\(s_x = s_y = s_z\)), or we might have an affine transform of some form (\(s_x \neq s_y \neq s_z\)). Fortunately, our scale matrix remains simple to write and apply!

Rotation is again going to be the most complex out of our transformations. First, let's extend our rotation matrix from last time into three dimensions. Recall that we were rotating the \(X\) and \(Y\) dimensions (i.e. we were rotating about the \(Z\) axis):

$${R_z(\kappa)}^B_A = \begin{bmatrix} cos(\kappa) && -sin(\kappa) && 0 \\ sin(\kappa) && cos(\kappa) && 0 \\ 0 && 0 && 1 \end{bmatrix}^B_A$$

Notice we called this the \(R_z(\kappa)\) matrix above, as it encodes rotations of angle \(\kappa\) about the \(Z\) axis. The challenge we find ourselves with is that this matrix can only rotate about the \(Z\) axis! It isn't possible to perform any rotations of points in the \(XZ\) or \(YZ\) planes with this matrix. For these, we would need to rotate about the \(Y\) and \(X\) axes, respectively. To do so, we define two new rotation matrices, \(R_x(\omega)\) and \(R_y(\phi)\).

$${R_x(\omega)}^B_A = \begin{bmatrix} 1 && 0 && 0 \\ 0 && cos(\omega) && -sin(\omega) \\ 0 && sin(\omega) && cos(\omega) \end{bmatrix}^B_A$$

$${R_y(\phi)}^B_A = \begin{bmatrix} cos(\phi) && 0 && sin(\phi) \\ 0 && 1 && 0 \\ -sin(\phi) && 0 && cos(\phi) \end{bmatrix}^B_A$$

Each of these rotation matrices \(R_x\), \(R_y\), and \(R_z\) rotate about their respective axes by some angle \(\omega\), \(\phi\), or \(\kappa\). Colloquially, these \(\omega\), \(\phi\), and \(\kappa\) are often referred to as *roll*, *pitch*, and *yaw* angles. We often refer to these as a set of *Euler Angles* when referring to the entire set. With all these matrices, we're able to encode any arbitrary rotation between two coordinate frames. The question is: how do we combine them?

Unfortunately, the answer is not so clear. Different systems may use different conventions to combine these rotation matrices. While the end result may be the same, the underlying angles and matrix components will not be. Typically, picking an order to apply these matrices is called picking a parameterization. Which parameterization you pick is likely to be driven by your application.

One aspect that can drive which parameterization you pick is a concept known as gimbal lock. Gimbal lock is a singularity in our parameterization that makes it impossible to extract our original Euler Angles from a combined rotation matrix. That may not make a lot of sense, so here's an example:

Suppose we decided to choose the \(X Y Z\) parameterization, this would look like:

$${R_{xyz}(\omega, \phi, \kappa)}^B_A = {R_x(\omega)}^B_A \cdot {R_y(\phi)}^B_A \cdot {R_z(\kappa)}^B_A = $$

$$\begin{bmatrix} cos(\phi)cos(\kappa) && -cos(\phi)sin(\kappa) && sin(\phi) \\ cos(\omega)sin(\kappa) + cos(\kappa)sin(\omega)sin(\phi) && cos(\omega)cos(\kappa) - sin(\omega)sin(\phi)sin(\kappa) && -cos(\phi)sin(\omega) \\ sin(\omega)sin(\kappa) - cos(\omega)cos(\kappa)sin(\phi) && cos(\kappa)sin(\omega) + cos(\omega)sin(\phi)sin(\kappa) && cos(\omega)cos(\phi)\end{bmatrix}^B_A$$

This is quite the equation! To make things easier, we might instead write the rotation matrix with some placeholders so that we don't have to write every equation every time (and risk getting it wrong).

$${R_{xyz}(\omega, \phi, \kappa)}^B_A = \begin{bmatrix} r_{xx} && r_{xy} && r_{xz} \\ r_{yx} && r_{yy} && r_{yz} \\ r_{zx} && r_{zy} && r_{zz} \end{bmatrix}^B_A$$

For starters, lets extract the \(\phi\) angle from this matrix above:

$$r_{xz} = sin(\phi)$$

$$sin^{-1}(r_{xz}) = \phi$$

This is fine, except when \(\phi\) is \(\pi \over 2\) or \(- {\pi \over 2}\) (90° or 270°). If we expand the matrix to actual values assuming that \(\phi = {\pi \over 2}\), we get:

$$R_{xyz}(\omega, \phi, \kappa)^B_A = \begin{bmatrix} 0 && 0 && 1 \\ sin(\omega + \kappa) && cos (\omega + \kappa) && 0 \\ -cos(\omega + \kappa) && sin(\omega + \kappa) && 0\end{bmatrix}$$

This means that regardless of whether we change \(\omega\) or \(\kappa\), we have the same effect. Likewise, it means that there is no way to individually extract these two values as we cannot differentiate between the two of them! This is what is meant when someone states that there is a gimbal lock.

Gimbal lock motivates how one might choose a parameterization, to try and avoid it. There are 12 in all, as described in this paper from NASA. The most common choices of parameterization you might see are the \(X Y Z\) and \(Z Y X\) parameterizations, but others such as \(ZYZ\) or \(X Y X\) are also quite common. When choosing a parameterization, remember that the resulting rotation matrix is just a combination of the individual \(X\), \(Y\), and \(Z\) rotation matrices that were defined above!

💡 There are other parameterizations for rotations (such as unit quaternions or the geometric algebra) that don't have this problem with gimbal lock at all. We haven't covered these, but they provide an alternative if you're finding this 3D representation of transformations problematic.

Like before, let's put this all together! Regardless of how we parameterize our rotations, we get the following relationship:

$$p_B = f(p_A)$$

$$p_B = S^B_A \cdot R^B_A \cdot p_A + T^B_A$$

$$\begin{bmatrix} x_B \\ y_B \\ z_B \end{bmatrix} = \begin{bmatrix} s_x && 0 && 0 \\ 0 && s_y && 0 \\ 0 && 0 && s_z \end{bmatrix} \cdot \begin{bmatrix} r_{xx} && r_{xy} && r_{xz} \\ r_{yx} && r_{yy} && r_{yz} \\ r_{zx} && r_{zy} && r_{zz} \end{bmatrix} \cdot \begin{bmatrix} x_A \\ y_A \\ z_A \end{bmatrix} + \begin{bmatrix} T_x \\ T_y \\ T_z \end{bmatrix}$$

Just like last time, the order is:

- Rotation
- Scale
- Translation

And just like last time, we can use projective or homogeneous coordinates to convert this into a single \(\Gamma^B_A\) matrix:

$$p_i = \begin{bmatrix} x_i && y_i && z_i && 1 \end{bmatrix}^{T}$$

$$\Gamma^B_A = \begin{bmatrix} s_x r_{xx} && s_x r_{xy} && s_x r_{xz} && T_x \\ s_y r_{yx} && s_y r_{yy} && s_y r_{yz} && T_y \\ s_z r_{zx} && s_z r_{zy} && s_z r_{zz} && T_z \\ 0 && 0 && 0 && 1 \\\end{bmatrix}^B_A$$

$$p_B = \Gamma^B_A \cdot p_A$$

$$\begin{bmatrix} x_B \\ y_B \\ z_B \\ 1 \end{bmatrix} = \begin{bmatrix} s_x r_{xx} && s_x r_{xy} && s_x r_{xz} && T_x \\ s_y r_{yx} && s_y r_{yy} && s_y r_{yz} && T_y \\ s_z r_{zx} && s_z r_{zy} && s_z r_{zz} && T_z \\ 0 && 0 && 0 && 1 \\\end{bmatrix}^B_A \cdot \begin{bmatrix} x_A \\ y_A \\ z_A \\ 1 \end{bmatrix}$$

Just as in the 2D case, the final dimension on our points remains 1, even after multiplying by \(\Gamma^B_A\)!

In this article we demonstrated how our language and mathematical models for understanding 2D coordinate frames can easily and readily be extended into three dimensions with minimal effort. Most importantly, we managed to preserve our ability to talk about coordinate transformations at a high level in English (I want the B from A transform) as well as mathematically (I want \(\Gamma^B_A\)).

Relating 3D coordinate frames is a cornerstone component of any multi-sensor system. Whether we are relating multiple cameras, or stitching disparate sensing technologies together, it is all about location and where these sensors are relative to one another. Unlike the 2D case, we discovered that 3D rotations need to consider the problem of parameterization to avoid a gimbal lock. This is something that all sensing systems need to consider if they want to use the matrix representation that we provide here. In fact, Tangram Vision uses these same mathematics when reasoning about device calibration in systems with much more complexity.

The world of coordinates and mathematics can be both fascinating and complex when working in multi-sensor environments. However, if this all seems like complexity that you would rather trust with experts, The Tangram Vision SDK includes tools and abstractions that simplify optimizing multi-sensor systems for rapid integration and consistent performance.

We hope you found this article helpful—if you've got any feedback, comments or questions, be sure to tweet at us!

The Tangram Vision Platform lets perception teams develop and deploy faster.