Let's explore if we can help accelerate your perception development and deployment.
This is a follow-up conversation with OpenCV centered around the proliferation of sensors. See the first talk for how sensors are deployed today, and how different modalities compensate for each other.
Find the slides from this latest talk below.
OpenCV was kind enough to host our own CEO and cofounder Brandon Minor (again) on their weekly webinar. The topic: Sensors for your sensors! Part Deux.
There has been an explosion of sensors in the market, which has increased capabilities and decreased prices. This has resulted in a calibration combinatorics problem that, as of yet, has no solution. We dive into this problem today.
Many sensor systems engineers have become used to creating "pipelines" to control the "flow" of data. This makes them all plumbers of a sort. Whether your data is going through physical wires, over-the-air, or through the cloud, it's all still going through the pipeline, all being controlled for flow; all plumbing. It's an apt comparison, honestly, and not just because of the terminology. Plumbing is something that's taken for granted. Most people don't give it a second thought in their day-to-day lives.
Yet as soon as plumbing fails, everything goes wrong. What does one do when their pipes burst and the basement is flooding? Most people don't have the expertise to deal with it themselves, so they call in a professional plumber, someone who can take everything apart and fix the problem at its source.
What do we do when our perception pipeline explodes? Do we bring in a data piping expert? What does that even look like? There are so many varieties of data, and so many ways to pipe it around. The requirements for this job would seem so abstract as to be impossible to hire for.
Of course, this isn't the case. There are perception data experts: we call them calibration engineers.
Calibration tells you everything you need to know about your plumbing:
We need all of this information to create an efficient and robust data pipeline.
Perceptive readers will notice that I still didn't answer my original question: what do we do when our perception pipeline explodes? If we were keeping to the plumber analogy, now would be the time when we'd look through our Rolodex and find a top-notch calibration engineer to come fix our crumbling infrastructure. And yet... when is the last time you met a calibration engineer?
They don't exist.
The sad reality is that there is virtually no such thing as a calibration engineer in Industry, and very few in Academia. Calibration is considered a niche within a niche, and many of the tasks involved in structuring data pipelines seem like one-offs. As such, large and small companies alike optimize their technical hiring for generalists.
This lack of talent is a setup for disaster. The number of sensors has exploded in recent years; their capabilities have increased; and their cost has gone down. This means that there are more ways to approach data than ever, more ways to mix and match these sensor systems. This is a calibration combinatorics problem, and it's getting harder to solve every day.
Instead of investing in the tools needed, companies instead opt to make compromises:
...and on and on. All of this either eats into their margin or affects the utility of the product.
To illustrate this, let's dive into how calibration is handled now. We'll examine a robot equipped with an IMU, two cameras, and a LiDAR puck. This is one of the more common configurations one could have in my experience.
The most common technique to calibrate cameras by far is "the checkerboard dance", also known as Zhang's method [1]. Everyone working with cameras has done or seen this process. It's available through the most popular computer vision platforms like OpenCV and ROS for multiple lens models.
Its popularity comes out of its ease of use. Creating datasets for it is a breeze: just wave a camera in front of the checkerboard, find some interesting features in the world, and let the optimization run. If the errors are low enough, a camera is considered "good to go".
Now that we understand your camera's unique profile, called its intrinsics, we can use that to inform its extrinsics, e.g. the relationship between cameras. Again, take the error for every camera, and continue on if the error stays under a certain threshold.
These follow something similar to cameras, though we don't have to use a target; a flat plane will do [2]. We can use that information to refine the intrinsic calibration already on the LiDAR module, then align the structures that are derived from the LiDAR scans with the features found in the already-calibrated camera system, maybe using a dimpled checkerboard. Done.
This relies on periods of motion and stillness to get good readings. Most come pre-calibrated, but IMUs drift over time. We can match the motion measurements of an IMU to the path that we derived for our cameras or LiDAR when we were calibrating those. Do this, and find the bias and scale adjustments needed for the IMU to read true (its intrinsics), as well as an extrinsic measurement for it with respect to the rest of the system.
So, with all this said, we have seemingly calibrated our system! It wasn't a horrible effort, though the code to get here wasn't simple. Now, let's take our robot out and see what we get.
Bad news: what we get is not great. Each of those processes that we chained together had their own little sources of error. It could have been from a bad motion set for the IMU, a blurry photo in the camera dataset, or a misclassified plane for the LiDAR. By making each sensor directly dependent on the precision of another process, that error was propagated throughout the system in a way that we can't measure or understand.
In fact, it's worse than it seems: our calibration process presented its results as absolute truths! It told us "this is your camera, here's how it behaves." But what it told us was a half-truth at best. We'll be able to see the results of this lie in real-time as our system runs into a wall that it thought was still a meter away.
A first pass at remedying this situation is usually, paradoxically, by trying to "fix" the inbound data that's used for the calibration process. This leads to all sorts of heuristics being used as a way to threshold what makes "good" and "bad" data. Is the error on this point too high? Throw it out! Does that IMU motion seem too erratic? Get rid of it. If it doesn't seem "right", then some processes just ditch it altogether. This predictably makes the problem worse, not better; the data is being fit to the model, not the model to the data. We aren't ready for the real world.
And what about that GPS unit that we just picked up? That could inform our calibration, but how do we add it on? The current process we've stitched together doesn't allow for more data other than just our camera, or just our LiDAR, or just our IMU. It's all one at a time.
To top it all off: as roboticists, we know that no matter how sturdy our sensing rig is bolted on, those sensors are going to change. Even the best calibration is just a snapshot in time, and we'll lose utility in a dynamic environment. Having a piecemeal calibration process like the one above is just asking to decommission a machine sooner or later.
All of what I've described above is common practice by industry experts. This is not the state of the art, by any means; academia has come at this problem 20 different ways. But all of the novel calibration work is done by a handful of groups. That talent is hard to come by, and their tools aren't hardened for use in the everyday and the practical.
So, we have no tools, and no experts, and a ton of in-house process and code to support if we want to keep our system running. There has to be a better way...
Our real problem is that the techniques that "worked" for individual sensors don't work for the complex sensor systems that are demanded by the newest conditions and applications. There needs to be a holistic and general framework for dealing with these data sources, one that doesn't have trouble handling the different data types, approaches, and supplementary info that we throw at it.
Luckily, that solution is already one of the oldest in the field: it's called bundle adjustment. Remember how each calibration process we did above was independent from each other? With bundle adjustment, everything is solved for at the same time. All sensor data and ground truth points are thrown in and adjusted at the same time, in a "bundle".
Note: The "bundle" here can also refer to the bundle of rays going between a camera center and object space. See [3] for a breakdown of that terminology and a survey of BA methods in computer vision.
If you have ever dove into calibration yourself, you might be groaning right now. Bundle adjustment is a common practice already; how is this the future if it's part of the problem in the present? I argue that the vision industry just hasn't been using it correctly.
Bundle adjustment has been deeply understood for a long time in fields like photogrammetry and metrology, with some landmark publications being printed all the way back in the 1970s [4]. In these works, researchers explored how to handle dozens of cameras, landmarks, and models simultaneously in one process. This gave calibration results, sure, but it also gave intuition on the utility of your data, the precision of your sensor readings for every sensor, and the trustworthiness of your ground truth.
All of this has been done before for both long-range and short-range vision. However, given the difference in application between photogrammetry (which is primarily static) and computer vision (which is primarily dynamic), these techniques were never brought over wholesale from one discipline to the other. Instead, the computer vision field has rediscovered the same statistical processes over the last 50 years. Constrained computational resources made it necessary for computer vision engineers to cut corners as this rediscovery occurred. These shortcuts have now been ingrained into the engineering culture of CV.
Now, I am picking on my own field, but there are serious research challenges still open to both photogrammetry and computer vision. They lie mainly in bringing together all these different sensor modalities under any and all conditions, and still getting a good calibration for a strong data pipeline. I'm happy to say that the lines between fields are starting to blur here. There are a few new open-source computer vision processes that are taking this holistic approach to calibration in different ways.
LIO-SAM, for instance, optimizes the LiDAR and IMU as a part of normal structure-from-motion operation under a continuous time model, which shares much of the mathematics of pure bundle adjustment. There's no special process; the calibration refinement for these sensors is baked in.
mrcal, a newer camera calibration library from CalTech, deviates from the norm that we saw in our calibration process above by treating the checkerboard as uncertain, not ground truth. It optimizes the camera and the checkerboard at the same time. This gives a better sense of precision to the user for both their data and their model, which proves invaluable.
Deep learning also holds some utility here, as learned metric data can be wrapped into a bundle adjustment just like any other data source. This is certainly an area ripe for development in the coming years.
The flexibility and power of pure bundle adjustment shouldn't be underestimated. Bundle adjustment is:
Many of the problems that are cited with the technique come from misguided implementation of the fundamentals, related to the bifurcation in the literature between photogrammetry (where it started) and computer vision (where it got popular).
The future, the real next step, is making bundle adjustment for calibration as easy to do as the checkerboard dance in OpenCV. The math and the models are there; it will just take some dedication to reach. When this happens, Industry won't just have better results with their current systems. They'll have a repeatable, tried-and-true path to great systems no matter the form or function.
It might not surprise you to hear that we're working on this next step at Tangram Vision. Check out the Tangram Vision Platform if you're tired of fighting your data pipeline and are ready to start focusing on what makes your platform unique!
[1] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr98-71.pdf. Zhang, Z (1998). A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[2] https://ieeexplore.ieee.org/document/8234357. Bergelt, R., Khan, O., & Hardt, W. (2017). Improving the intrinsic calibration of a Velodyne LiDAR sensor. 2017 IEEE SENSORS.
[3] https://hal.inria.fr/inria-00548290/document. Triggs, B., McLauchlan, P. F., Hartley, R. I., & Fitzgibbon, A. W. (2000). Bundle Adjustment — A Modern Synthesis. Lecture Notes in Computer Science, 298–372.
[4] https://www.asprs.org/wp-content/uploads/pers/1972journal/nov/1972_nov_1117-1126.pdf.Kenefick, J., Gyer, M., & Harp, B. (1972). Analytical Self-Calibration.