Building Robust Filesystem Interactions in Rust

Sometimes, unfortunately, you just have to read a file

Jul 15, 2025

Jeremy Steward
Senior Perception Engineer
An analog filesystem

Tangram Vision recently released our very own apt-repository for downloading and installing MetriCal. This gives us the ability to ship MetriCal without relying on Docker, making it easier for users to install and use our software without first needing to understand containers. The journey to create this repository prompted many questions about how MetriCal would behave in the broader context of the system compared to running inside a container. One thing is true about any local software, container or not: sometimes, you will just have to read a file.

There is some surprising complexity embedded in how MetriCal interacts with the filesystem layer in Linux; we’ve made (and fixed) plenty of mistakes over the years. After reviewing our own progress, we felt that these fixes presented a good introduction to a topic of intermediate difficulty in Rust: how does one build robust (and understandable) filesystem interactions? Moreover, are there any best-practices we can offer as guidance to others building great software for robotics?

Note: We realize that some of the complexity of this topic is managed entirely differently across operating systems. As MetriCal operates primarily on Linux, try to understand that most of the advice listed here is based around the interactions that can occur on a Linux or POSIX-like system.

For those readers TL;DRing their way through this article, here’s the punchline: Our short and sweet list of filesystem guidelines for any developer.

Filesystem Guidelines for any Developer

  1. Paths should be treated like pointers to resources, and not owned resources unto themselves

  2. File descriptors or objects like File should almost exclusively be passed around in your software. Prevent future developers from re-using a path just because it “seems fine at the time” (hint: it is not fine).

  3. Outsource interactions with the filesystem to battle-tested software, rather than hand-rolling a recursive directory walk yourself.

  4. Aim to reduce or lint out the use of Path::join. While path-like objects are convenient, it is best to build more of our interfaces around the std::io::Read or std::io::Write traits, or if a File is absolutely necessary, the File type directly.

Still here? Great! Let’s dig into why these work.

Filesystem Grievances

Sometimes, computers hurt.

“Filesystem issues” covers a broad range of topics: anything from opening a file with incorrect permissions, trying to log to a file such as our report.html, or reading ROS1 bag or MCAP data to optimize during the course of the calibration process. To keep this focused and in context, we are going to assume that you, the reader, are familiar with Rust and have used the std::fs APIs enough to at least be familiar enough with how they roughly work.

Based on our own interactions with the filesystem, our grievances can be broken down into one of the following categories:

  1. Error communication when things go wrong

  2. Heavy path use and re-use

  3. Complex interactions being made more difficult due to not having good default APIs in Rust’s standard library

Grievance #1: Error communication is difficult

Consider the following example Rust program:

fn main() {
    let err = std::fs::File::open("/tmp/not-a-file");

    let err = miette::Report::from_err(err);
    println!("{err:?}");
}

Running this program, we get an inscrutable error message:

When this gets printed out to your console during runtime, it is very hard to understand what exactly the issue is, where it originated from, or what the caller is supposed to do about it.

This illustrates a fundamental issue: calling a program from the command line often means that you are thinking in and providing paths to the program, but the program’s operation is entirely dependent on the abstractions provided by the host language (e.g. Rust’s std::fs).

Users calling the software won’t understand:

  • Which path did you try to use for this operation?

  • What operation did you perform that resulted in the error?

  • How can they fix it?

It is not always evident whether all of these questions need answers, or if answers can be provided by the program itself.

Consider a program that fails due to trying to open a file whose path points to a broken symbolic link (symlink). Is the program responsible for understanding that the symlink is broken, or is there something fundamentally more broken about the caller’s environment and machine state? We can’t provide fine-grained support for issues without at least some contextual understanding of the system and how the software is being executed.

As MetriCal grows and evolves, a growing challenge is in how we report the error from the system itself (i.e. the error code and the message) and tying the surrounding context back to the error. Regrettably, this issue compounds with every other filesystem issue. When something goes wrong the ability to debug the issue is just as valuable as the actual fix itself.

Grievance #2: Heavy path use and re-use

Paths are a great abstraction to point to a specific resource on your system; after all, it is often said that in Unix, everything is a file. Unfortunately, heavy usage of Path, PathBuf, and path-like objects (i.e. anything that is AsRef<Path> in Rust) lead to a whole host of different error categories that are very difficult to reproduce at scale. The worst of these are time-of-check-time-of-use (TOCTOU) errors, which mostly arise from the fact that “files” themselves are a kind of global, mutable state on your system.

In most programs, the default “abstraction” over a file or resource is a file descriptor or file object (usually something like std::fs::File). MetriCal recently developed an error in how we generate our report.html reports from the console logs. The system had two different processes running on separate threads that were both trying to write to the same file simultaneously: a console-to-HTML converter and a logging system. Without proper thread synchronization, these competing writes created a race condition that sometimes corrupted the final report file. However, the fundamental reason this occurred is because we misused the abstraction that is paths and path-like objects.

Grievance #3: Complex filesystem interactions

Similarly, complex filesystem interactions such as recursively walking a directory can easily result in a number of TOCTOU errors even when paths are not re-used:

  • Infinitely recursing the directory over and over again because of a failure to check for hard or symbolic links that point to a higher directory. It is imperative to check for e.g. repeated inode / vnode information when recursing.

  • Opening too many file descriptors at once and exhausting your operating system’s limits on the number of open file descriptors.

  • Crossing an unexpected filesystem boundary that may mean any inode / vnode tracking must be managed separately for each filesystem being traversed.

  • Permissions and ownership of items in the directory may be entirely different than the permissions and ownership of the parent directory itself, resulting in a range of hard-to-debug permissions errors. Coupled with the aforementioned difficulty in error reporting, this can sometimes mean errors that are difficult to pin down and describe effectively to users.

While Rust's std::fs::read_dir makes directory traversal look simple, recursively walking through all subdirectories involves many hidden system calls that can fail at any point. Rust's user-friendly API can mislead developers into thinking recursive directory operations are as straightforward as reading a single directory, when they're actually much more complex and error-prone.

This issue becomes particularly problematic when running software inside Docker, as we do with MetriCal. The software only recognizes paths within the Docker container (e.g., you'll see /volumes/data/my/path/to/files instead of my/path/to/files). Additionally, problems like broken symlinks and missing data from incorrectly mounted filesystems can manifest in new ways, stemming entirely from improper Docker volume configuration for a specific system.

Part of our impetus for providing apt/deb packages for MetriCal is that for a subset of users, running MetriCal locally without a container vastly simplifies their understanding of what happens when something complex and/or erroneous does occur. Docker makes a lot of the deployment part of the software easy, but can ramp up complexity in managing how it interacts with local parts of the filesystem.

Filesystems: Distributed Systems, Not Trees

Not pictured: A filesystem

A major contributor to problematic filesystem patterns is a fundamental misunderstanding of what the filesystem is. A common model taught to programmers about how to reason about the filesystem is that it is fundamentally a tree-like structure; however, this is only true of paths themselves, not the actual filesystem-as-an-abstraction.

Despite widespread belief that filesystems function like tree-like structures, they more closely resemble distributed systems. Hard links, symbolic links, and mounted directories (which can cross filesystem boundaries) all demonstrate ways in which filesystems aren't truly tree-like. The simplest way to understand this comparison is that filesystems sit beneath multiple processes and contain shared, mutable state that's accessed through the kernel's virtual filesystem (VFS) layer.

Many problems with filesystem patterns stem from the same issue seen in non-Rust programs—specifically, a weak concept of ownership where other processes or the kernel itself might not honor ownership principles. Unlike memory management issues, however, we can't simply "switch to Rust" and expect the borrow-checker to magically resolve our filesystem challenges.

These errors primarily stem from developers' incomplete understanding of the distinction between paths and file objects, as well as a tendency to underestimate the complexity of filesystem operations. A good practice is to become familiar with how your operating system handles paths and file descriptors, along with the range of possible errors for each operation. For perspective, std::io::ErrorKind contains over 30 different variants. Understanding these errors and developing better ways to present them is essential for building robust software.

Filesystem Guidelines for any Developer (Again)

That understanding is how we got these common guidelines that developers should live by when writing software that interacts with the filesystem:

  1. Paths should be treated like pointers to resources, and not owned resources unto themselves

  2. File descriptors or objects like File should almost exclusively be passed around in your software. Prevent future developers from re-using a path just because it “seems fine at the time” (hint: it is not fine).

  3. Outsource interactions with the filesystem to battle-tested software, rather than hand-rolling a recursive directory walk yourself.

  4. Aim to reduce or lint out the use of Path::join. While path-like objects are convenient, it is best to build more of our interfaces around the std::io::Read or std::io::Write traits, or if a File is absolutely necessary, the File type directly.

How Rust Makes This Better

Fortunately, the errors listed in the previous sections were something that Tangram could tackle and solve. While we will probably never root out every possible filesystem problem in MetriCal, we have made great strides in doing so, and continue to find ways to make this better every day.

Here are some of the patterns we have leveraged and some of the crates that we found help us write better code by default day to day:

Lint std::fs calls out of your project

It is a bit heavy-handed, but an easy way to avoid the pitfalls of the std::fs APIs in Rust is simply to create a clippy.toml file and lint these types and functions out of your code entirely. An example of such a configuration might be:

disallowed-types = [
  "std::fs::DirEntry",
  "std::fs::File",
  "std::fs::OpenOptions",
  "std::fs::ReadDir",
]

disallowed-methods = [
  "std::fs::canonicalize",
  "std::fs::copy",
  "std::fs::create_dir",
  "std::fs::create_dir_all",
  "std::fs::hard_link",
  "std::fs::metadata",
  "std::fs::read",
  "std::fs::read_dir",
  "std::fs::read_link",
  "std::fs::read_to_string",
  "std::fs::remove_dir",
  "std::fs::remove_dir_all",
  "std::fs::remove_file",
  "std::fs::rename",
  "std::fs::set_permissions",
  "std::fs::soft_link",
  "std::fs::symlink_metadata",
  "std::fs::write",
  "std::os::unix::fs::symlink",
  "std::os::windows::fs::symlink_dir",
  "std::os::windows::fs::symlink_file",
]

These APIs are generally feature complete and the docs are definitely transparent about their limitations, but the problem is that they are often just as easy to misuse (and with disastrous error reporting / management) when things go wrong. However…

Use fs-err or cap-std as a replacement

Fortunately there are replacements for std::fs that make a lot of these issues either:

  1. Disappear entirely; or

  2. Much more difficult to misinterpret

Two crates we love: fs-err and cap-std. It is beyond the scope of this article to demonstrate everything in these libraries, but consider just fs-err which advertises itself as a drop-in replacement for std::fs . Using our previous sample program, errors from this crate look like the following:

fn main() {
    // Notice how we replaced std::fs::File with fs_err::File
    let err = fs_err::File::open("/tmp/not-a-file").unwrap_err();

    let err = miette::Report::from_err(err);
    println!("{err:?}");
}

And produce the following output:

…which is significantly clearer than our initial pass! This error message includes both the operation ("failed to open file") and the specific path (/tmp/not-a-file) that caused the failure. For this reason alone, replacing std::fs with an alternative like fs-err represents a clear improvement for end-users, all without requiring any additional code instrumentation.

Use walkdir for iterating over directory contents

Likewise, we heartily support the walkdir crate. Like fs-err in the previous section, walkdir handles most of the underlying complexity of recursively walking a directory and iterating over files. This crate is fantastic because it can easily be configured by default to skip recursing over symlinks and separate filesystems.

Overall, the implementation avoids the vast majority of the obvious bugs for recursing through a directory and provides a compelling implementation that’s trusted by many Rust users, including the fantastic ignore crate and fd-find tool.

Avoid passing path-like objects around on their own

The simplest way to avoid path re-use is to eliminate path-like objects entirely. Passing around generic path types like P: AsRef<Path> or PathBuf, especially beyond initial command line argument parsing, often leads to the problematic patterns we've discussed. While we can't immediately open every file that MetriCal might use, we should never clone or pass around paths without carefully considering file descriptor access and ownership. As with all Rust code, proper ownership thinking is the path to victory.

We open file descriptors as late as possible for pragmatic reasons, but aim to discard paths and path-like objects as soon as we can. Crates like cap-std enhance this approach by using the openat family of calls to provide a capability-first API that only allows relative paths (and lets you control whether to accept symlinks).

Conclusion

Filesystems are tricky beasts, and some of the most difficult bugs can arise due to less-than-obvious mismatches in how we think about our abstractions (paths and files-as-individually-owned-resources) with how our system treats our filesystem (closer to a distributed system of global mutable state). While many of the bugs explored here never resulted in extreme data corruption or the deletion of customer data, many did result in annoying error messages or less-than-graceful behaviour when reading or writing calibration data from MetriCal.

While Rust cannot always magically solve all filesystem interactions, the patterns and libraries outlined above continue to improve MetriCal's robustness. By adopting similar approaches in your own code, you can avoid many common pitfalls and build more reliable software. We hope you do!

Further Reading

Tangram Newsletter

Subscribe to our newsletter and keep up with latest calibration insights and Tangram Vision news.

Tangram Newsletter

Subscribe to our newsletter and keep up with latest calibration insights and Tangram Vision news.

Tangram Newsletter

Subscribe to our newsletter and keep up with latest calibration insights and Tangram Vision news.