~/naiquevin

Understanding lifetimes in Rust

This article is meant for Rust beginners who are trying to wrap their head around lifetimes. I don't claim to be an expert on this topic. I am also new to the language in the sense that there are many things about rust that I still haven't internalized. But at this point, I feel comfortable reading and writing code with lifetimes and it doesn't hinder my productivity.

I got the idea for this article while working on tapestry. I encountered a situation where a small change to a function necessitated the use of explicit lifetimes. It seemed like a perfect example to demonstrate the concept. I'll be using a similar but much simpler problem statement in this article.

Let's say we are implementing some kind of a static site generator in which the user can write different kind of posts using different predefined templates 1. There are two main entities in our code:

  1. Templates: Each template has two fields — id and supported_tags. Consider that templates are "static data" i.e. list of templates are hard coded. We'll assume that the predefined templates will have unique ids

  2. Posts: Each post has three fields — title, template, tags. Think of posts as user input. The template field refers to one of the templates in the static list

In a real static site generator, templates and posts will likely have more fields such as body, date etc. but I'll skip them for conciseness.

Let's start by implementing structs to represent these two entities.

use std::collections::HashSet;

struct Template {
    id: String,
    supported_tags: HashSet<String>,
}

impl Template {
    fn new(id: &str, supported_tags: Vec<&str>) -> Self {
        Self {
            id: String::from(id),
            supported_tags: supported_tags.iter().map(|s| String::from(*s)).collect(),
        }
    }
}

#[derive(Debug)]
struct Post {
    title: String,
    template: String,
    tags: HashSet<String>,
}

impl Post {
    fn new(title: &str, template: &str, tags: Vec<&str>) -> Self {
        Self {
            title: String::from(title),
            template: String::from(template),
            tags: tags.iter().map(|s| String::from(*s)).collect(),
        }
    }
}

We've defined new methods for both structs which will come handy when instantiating them.

Since posts represent user input, they need to be validated. We'll verify two simple constraints:

  1. the template field of a post must match the id of one of the predefined templates

  2. tags field of a post must be a subset of the supported_tags field of the associated template

Let's represent the above validation errors using an enum. Because the term "error" is conventionally used to name error types in rust, I am using "violation" instead to avoid confusion.

#[derive(Debug)]
enum Violation {
    TemplateRefNotFound {
        title: String,
        template_id: String,
    },
    UnsupportedTags {
        title: String,
        tags: HashSet<String>,
    },
}

Now let's define a validate method in the Post struct.

impl Post {
    fn validate(&self, templates: &Vec<Template>) -> Vec<Violation> {
        let mut violations = vec![];
        match templates.iter().find(|x| x.id == self.template) {
            Some(tmpl) => {
                if !self.tags.is_subset(&tmpl.supported_tags) {
                    let m = Violation::UnsupportedTags {
                        title: self.title.clone(),
                        tags: self.tags.clone(),
                    };
                    violations.push(m);
                }
            }
            None => {
                let m = Violation::TemplateRefNotFound {
                    title: self.title.clone(),
                    template_id: self.template.clone(),
                };
                violations.push(m);
            }
        }
        violations
    }
}

We can now try out a few examples in the main function. Let's also define a helper function to validate a post and print the violations i.e. validation errors to stdout.

fn validate_post(post: &Post, templates: &Vec<Template>) {
    let violations = post.validate(&templates);
    println!("{} violations found for {post:?}", violations.len());
    if violations.len() > 0 {
        for violation in violations {
            println!("  {violation:?}");
        }
    }
}

fn main() {
    let templates = vec![
        Template::new("blog", vec!["opinion", "report", "personal"]),
        Template::new(
            "announcement",
            vec!["new project", "update", "security", "urgent"],
        ),
    ];

    let post = Post::new(
        "Major security update",
        "announcement",
        vec!["security", "update", "urgent"],
    );

    validate_post(&post, &templates);

    let post = Post::new("A day at the beach", "blog", vec!["personal", "song"]);
    validate_post(&post, &templates);
}

Running it prints the following to stdout.

$ cargo run
0 violations found for Post { title: "Major security update", template: "announcement", tags: {"urgent", "update", "security"} }
1 violations found for Post { title: "A day at the beach", template: "blog", tags: {"personal", "song"} }
  UnsupportedTags { title: "A day at the beach", tags: {"personal", "song"} }

Here is a Rust playground link if you wish to try it. I'll refer to this version as the first iteration.

Using references instead of cloning

The above code works but is not memory efficient. Notice that we're cloning the String objects in the Post.validate method. Rust is a low level language that's designed for writing memory efficient code. Instead of cloning data, it's possible to refer to the existing data in memory by using references.

What if we define the Violation enum in terms of reference type &str instead of owned type String?

Does not compile
#[derive(Debug)]
enum Violation {
    TemplateRefNotFound {
        title: &str,
        template_id: &str,
    },
    UnsupportedTags {
        title: &str,
        tags: HashSet<&str>,
    }
}

It fails to compile.

error[E0106]: missing lifetime specifier
  --> examples/itr2.rs:45:18
   |
45 |     item_id: &str,
   |              ^ expected named lifetime parameter

The error says missing lifetime specifier. Great! For the first time we've encountered the term lifetime.

Let me show you the fix first, and then we will see why it works. The fix is to specify lifetime parameter when defining the Violation enum as follows,

#[derive(Debug)]
enum Violation<'a> {
    TemplateRefNotFound {
        title: &'a str,
        template_id: &'a str,
    },
    UnsupportedTags {
        title: &'a str,
        tags: &'a HashSet<String>,
    },
}

Then modify the definition of the Post.validate method to explicitly specify the lifetime.

impl Post {
    fn validate<'a>(&'a self, templates: &Vec<Template>) -> Vec<Violation<'a>> {
       // Body of the fn remains the same
    }
}

Now it compiles. All we've done is redefine the enum and the method with weird looking syntax <'a> and &'a. What do these tokens mean and how does it work?

First, let's step back a bit and understand why the use of references make the code memory efficient. If you're familiar with low level languages such as C, C++, you may skip the next paragraph.

A reference is nothing but a pointer to a memory location. Since the data that we want to store in title, template_id and tags fields of a Violation instance already exists in memory (as fields of a Post instance), we can just store references to those same memory locations in a Violation instance instead of cloning the data. That way, an instance of Violation enum returned by Post.validate method will not require additional memory.

But rust has a concept of ownership. The data that we want to "reuse" in Violation is owned by a Post. In creating references to that data, we are "borrowing" from it. The compiler will allow this only if it can statically check that the owner outlives (or lives as long as) the borrower i.e. the Post instance gets dropped only after Violation instance is dropped. This is to prevent dangling references.

The specifier 'a used in the enum definition is to say that an instance of the enum will live only as long as lifetime 'a. Note that lifetime 'a is an abstract value. During compilation, the borrow checker will figure it out from the code that calls the function.

Generics

This article is about lifetimes and not generics, so I won't spend much time on this topic. However lifetimes are similar to generics and the syntax is also the same. Unlike lifetimes, the concept of generics is not unique to Rust 2. Hence comparing the two concepts may help those who are familiar 3 with other languages that have generics.

Generic types allow us to define a single function in terms of abstract data types such that it can be compiled to work for multiple concrete data types. Thus, it helps in avoiding code duplication. Consider following function that's generic over type T.

fn max<T>(xs: &[T]) -> &T {
    // ...
}

At compile time, the compiler will look at code that calls this function and substitute the value T with the actual data type that it's called with. Think of it like how a function argument gets substituted by the actual value at runtime.

Coming back to lifetimes, it's almost similar. Before we can use a generic type, it's name has to be declared using the <T> syntax. Similarly, before we can use a lifetime parameter, it's name is declared as <'a>. At compile time, 'a will be substituted by the actual lifetime of the object from where the &'a references are borrowed. Based on that the borrow checker will check whether the owner outlives the borrower. If that condition is not satisfied, compilation will fail. I am simplifying a lot here so all this may not accurately represent the actual implementation of the borrow checker.

Let's try to call validate such that the lifetime check is not satisfied. Add the following lines inside the main function.

Does not compile
let violations = {
    let post = Post::new("A day at the beach", "blog", vec!["personal", "song"]);
    post.validate(&templates)
};
println!("{} violations found for {post:?}", violations.len());

It doesn't compile. The error is:

error[E0597]: `post` does not live long enough
   --> examples/itr2.rs:147:9
|
145 | let violations = {
|         -------- borrow later stored here
146 |     let post = Post::new("A day at the beach", "blog", vec!["personal", "song"]);
|             ---- binding `post` declared here
147 |     post.validate(&templates)
|         ^^^^ borrowed value does not live long enough
148 | };
|     - `post` dropped here while still borrowed

The error message itself does a great job of explaining what's happening. But the point is, the compiler can enforce this because of the lifetime parameter specified in the enum definition.

Before proceeding with the next example, what if I told you that the change we did to the Post.validate method definition was unnecessary? Try removing the lifetime specifiers from the method definition.

impl Post {
    fn validate(&self, templates: &Vec<Template>) -> Vec<Violation> {
       // Body of the fn remains the same
    }
}

It does compile. That's because of lifetime elison. Just like rust's compiler can infer types, it can also infer lifetimes in certain situations.

Here is the Rust playground link for second iteration.

Borrowing from multiple owners

Now let's try to make a trivial improvement to the validation code. We can make the Violation::UnsupportedTags validation error/violation more helpful to the user by additionally mentioning which tags are indeed supported. For this, we'll need to define one more field for the UnsupportedTags enum variant and modify the validate method to populate it with a reference to the supported_tags field of the corresponding template.

Does not compile
#[derive(Debug)]
#[allow(unused)]
enum Violation<'a> {
    TemplateRefNotFound {
        title: &'a str,
        template_id: &'a str,
    },
    UnsupportedTags {
        title: &'a str,
        tags: &'a HashSet<String>,
        supported_tags: &'a HashSet<String>,
    },
}

// ...

impl Post {
    fn validate(&self, templates: &Vec<Template>) -> Vec<Violation> {
        let mut violations = vec![];
        match templates.iter().find(|x| x.id == self.template) {
            Some(tmpl) => {
                if !self.tags.is_subset(&tmpl.supported_tags) {
                    let m = Violation::UnsupportedTags {
                        title: &self.title,
                        tags: &self.tags,
                        supported_tags: &tmpl.supported_tags,
                    };
                    violations.push(m);
                }
            }
            None => {
                let m = Violation::TemplateRefNotFound {
                    title: &self.title,
                    template_id: &self.template,
                };
                violations.push(m);
            }
        }
        violations
    }
}

Our code doesn't compile any more.

error: lifetime may not live long enough
   --> examples/itr3.rs:110:9
|
89  | fn validate(&self, templates: &Vec<Template>) -> Vec<Violation> {
|                 -                 - let's call the lifetime of this reference `'1`
|                 |
|                 let's call the lifetime of this reference `'2`
...
110 |     violations
|         ^^^^^^^^ method was supposed to return data with lifetime `'2` but it is returning data with lifetime `'1`

The Violation::UnsupportedTags instance now borrows from two owners — the objects that the two arguments self and templates point to. The borrow checker cannot infer the lifetimes any more. The compilation error indicates the presence of two lifetimes. To fix this, we can explicitly specify two lifetime specifiers, but we have to establish a relationship between them.

impl Post {
    fn validate<'a, 'b>(&'a self, templates: &'b Vec<Template>) -> Vec<Violation<'a>>
    where 'b: 'a {
        // Body of the fn remains the same
    }
}

The where 'b: 'a is called lifetime bound which is similar to a trait bound. It means that lifetime 'b lives at least as long as 'a. We need to tell that to the borrow checker because the lifetime associated with the return type of the method is 'a (notice the <'a> in the return type). Hence the borrow checker will allow it only if the condition that 'a is the shorter of the two lifetimes is satisfied. Using lifetime bounds we're making that explicit.

After making the above changes, the code does compile. On running it, we can see that the output is much more user-friendly.

$ cargo run
0 violations found for Post { title: "Major security update", template: "announcement", tags: {"urgent", "security", "update"} }
1 violations found for Post { title: "A day at the beach", template: "blog", tags: {"song", "personal"} }
  UnsupportedTags { title: "A day at the beach", tags: {"song", "personal"}, supported_tags: {"personal", "opinion", "report"} }

Actually, I lied again! It's possible to define the validate method using just one lifetime parameter. In fact, the compiler error we saw earlier hints at how to do it but I intentionally omitted that part. I wanted to show an explicit version first, which I believe is easier to reason about.

The following definition of validate also compiles.

impl Post {
    fn validate<'a>(&'a self, templates: &'a Vec<Template>) -> Vec<Violation<'a>> {
        // Body of the fn remains the same
    }
}

Why does it work? Remember that the lifetime specifier in function definition 'a is abstract and the borrow checker will substitute it with the actual lifetimes of the objects whose references are passed as function arguments. The return value is constructed using these references and hence, also borrows from the same objects. If the actual lifetimes of these two objects happen to be different, the borrow checker is smart enough to substitute 'a with the shorter one.

In our case, both arguments are references to objects that are initialized inside the main function so their actual lifetimes are the same. Let's see what happens if they are not the same.

fn main() {
    let templates = vec![
        Template::new("blog", vec!["opinion", "report", "personal"]),
        Template::new(
            "announcement",
            vec!["new project", "update", "security", "urgent"],
        ),
    ];

    {
        let post = Post::new("A day at the beach", "blog", vec!["personal", "song"]);
        let violations = post.validate(&templates);
        println!("{} violations found for {post:?}", violations.len());
    };
}

Here post gets dropped before templates. The borrow checker will substitute 'a with the lifetime of post because it's shorter. But during the lifetime of violations, both post and templates are alive, so references borrowed from them are valid. Hence it is allowed.

But the following won't compile.

Does not compile
let violations = {
    let post = Post::new("A day at the beach", "blog", vec!["personal", "song"]);
    post.validate(&templates)
};
println!("{} violations found for {post:?}", violations.len());

Here, the return value of the validation function cannot be returned outside the block because when the block ends, post will be dropped. So there's no lifetime that can be substituted in place of 'a such that the "owner outlives the borrower" condition is satisfied.

Here is the Rust playground link for third iteration.

That's all!

Here's a recap of what we did:

  • We started with an easy but memory-inefficient implementation that resorts to cloning Strings (code)
  • In the second iteration, we used references by defining lifetime specifiers in the enum. We also found out that specifiers were not needed in this case due to lifetime elison. (code)
  • The third iteration was a result of making the validation error message user friendly. In doing so, we had to bring back the lifetime specifiers in the function definition. We also found out that it was not necessary to use two lifetime parameters even though the data being returned was borrowed from two different args. The borrow checker is smart enough consider the shorter of the two lifetimes. (code)

Summary

  • In Rust, a variable "owns" its value.
  • Whenever a data structure holds a reference, it "borrows" from some owner
  • When a function returns a reference, the returned value borrows from one or more args (more accurately, it borrows from the objects that the args point to; args are references too in such cases)
  • Rust doesn't have a garbage collector. To ensure that memory is freed promptly, a value gets automatically dropped when it goes out of scope. So the compiler has to ensure that the owner of a value lives at least as long as the borrower
  • When the compiler can't infer where a value is being borrowed from, it also can't infer it's lifetime. In such cases, we need to explicitly specify lifetime parameters in our code.

Thanks to Samuel Chase and Pardeep Singh for reviewing this article.

Footnotes

1. I doubt that any one would model an actual static site generator this way. I am using it just as an example that's close enough to the code of tapestry. I wanted to avoid using tapestry's code directly in this article so that I wouldn't need to explain the project first

2. At least the popular languages of today and the ones that I know of don't have the concept of lifetimes

3. If you're not familiar with generic types, the rust book explains it very nicely

comments powered by Disqus