Skip to content

Yoyo Code
MatyΓ‘Ε‘ Racek's blog


Indirect ownership, shallow borrow and self-referential data structures

We want Rust to be able to express a pattern that roughly looks like this:

struct ParsedFile {
  contents: Vec<u8>,
  // items in this vec reference the `contents` field data
  words: Vec<&'self:contents:indirect [u8]>
}

This is a sort of "difficulty: easy" of self-referential data structures. The example is based on Niko Matsakis's talk about Polonius (new borrow checker implementation), where something like this was the most speculative example of what Polonius could enable. Relevant part is here, but I recommend watching the whole talk, it's pretty interesting.

There's been numerous requests for this feature. At this point it's not very controversial, and I have no doubt that it will land in Rust in following years in some form.

I decided to explore how this could look like from user's point of view if this was a first class language feature, to flesh out what exactly we want as users.

As far as I understand, the main problem with this feature is currently with technicalities - Stacked Borrows, implementation, Polonius integration, soundness concerns, LLVM... i.e. not what I am discussing in this post. Those problems are deep, so I think we're still pretty far from serious language feature proposal.

Nevertheless, let's explore this and start again with an easy mode - struct which owns a buffer and has a vector with shared references to this buffer:

struct ParsedFile {
  contents: Vec<u8>,
  // items in this vec reference the `contents` field
  words: Vec<&'self:contents:indirect [u8]>
}

Note: this is not a serious syntax proposal, I just picked something that doesn't break syntax highlighting

What exactly are our requirements for this struct?

  • ParsedFile has to be movable in a specific way

    • Moving the whole struct is ok - references point to the heap, not to the struct
      • this is the crucial part and the whole reason why this pattern is so useful. More on that later
    • When moving words, Rust has to make sure that contents outlives it
      • this should be already handled by the language pretty well - it's basically equivalent of destructuring the struct to separate variables and letting current Rust borrow checker go from there.
    • Moving only contents invalidates words - this should be covered by the language pretty well, too
  • Mutating contents invalidates words (or it "might", but more on that below)

  • contents should be well readable during the whole lifetime of the struct (all references to it are shared, so this should be no problem)

This is the easy mode because all references are shared, so we have no problems with aliasing, yet. It's a natural extension of what's already possible on stack today:

let contents = std::fs::read("file.txt").unwrap();
let words = contents.split(|c| *c == b' ').collect::<Vec<_>>();

You can do this, so the natural question is - why can't you "just" put these two variables in a struct? Both of them point to indirect data, so they should be happily movable together, but they are not.

Thinking of this kind of struct as something like an "imaginary stack frame" is helpful analogy. Then this becomes an exercise of how to transfer Rust borrowing rules from stack to structs. This is also how generators and async functions actually work and where this problem often arises, so it's not a stretch to think about it like that.

Let's increase the difficulty πŸ”—

Let's look at a bit more problematic case. This should also be possible:

struct ParsedFile {
  contents: Vec<u8>,
  // this reference IS mutable
  words: Vec<&'self:contents:indirect mut [u8]>
}

Requirements are similar as before, but there must be some greater restriction, because this can easily lead to mutable aliasing and that's a big no-no in Rust.

Main difference is that contents has to be inaccessible while words are borrowed mutably.

It's unclear to me if this needs to be even stronger though. Maybe contents have to be inaccessible while words exists, i.e. for the whole lifetime of the struct and to access contents, you have to drop words.

Also, what if you create shared reference to words? Would that allow you tu also read contents? There's a lot of edge cases to sort through, but I'm not going to go too deeply into that, because I think this pattern is less useful than the one with shared references (also, this is a "shiny future" post, so we don't care about details).

Indirect ownership πŸ”—

To make this all work, we need two changes to the language.

First, we need a way to mark indirect ownership of a data in a struct. In other words, we need to express that a struct owns data that can be borrowed, but moving the struct doesn't invalidate them.

// This marks indirect ownership (syntax is just an example)
// Vec owns T indirectly
struct Vec<T + 'indirect> { /*...*/ }

Rephrased again, we have to separate lifetime of this struct from lifetime of the data it points to. We want to express that reference to this struct doesn't survive move, but reference to the indirect data does.

The relationship is a bit more complicated, because reference to indirect data still doesn't survive some moves:

let numbers = vec![42];
let first_number = &data[0];

let moved_numbers = numbers; // first_number should survive this move 
println!("{}", *first_number); //ok

drop(moved_numbers); // first_number doesn't survive this move (obviously)
println!("{}", *first_number); // compile error

References either have to be moved together with the owner (in a self-referential data structure), or the owner has to be moved somewhere "close" to the reference - to a different field or variable. It might even make sense to separate two kinds of moves - shallow move or deep move. I'm not sure if it's necessary to express that in surface Rust, though.

Shallow borrows πŸ”—

Consequently, as a second change, Rust has to remove one of its core limitations. Borrows are currently transitive, meaning that if you borrow a struct, you transitively borrow everything that this struct contains or points to.

We need this to restrict mutable borrows of the owner struct, to prevent invalidating borrows of indirect data. With &mut ParsedFile, you should be able to change words but you're not allowed to change contents.

We need a way to specify which methods borrow the indirect data and which borrow only the base struct itself. A Vec is not a good example, because there doesn't seem to be a useful mutation that doesn't mutate the indirect data, but a better example could be the ParsedFile struct itself if we modify it a bit, because it also holds indirect data:

struct ParsedFile {
    contents: Vec<u8>,
    first_word: Option<&'self:contents:indirect [u8]>
}
impl ParsedFile {
    // this method uses shallow borrow, it only borrows the base struct, 
    // not the indirect data. This means that it doesn't invalidate 
    // mutable references to the indirect data
    pub fn clear(&'shallow mut self) {
        self.first_word = None; // ok
        self.contents.clear(); // compile error
    }

    // this method borrows everything
    pub fn correct_clear(&mut self) {
        self.first_word = None; // ok
        self.contents.clear(); // ok

        // Side Note: here's another interesting aspect: 
        // `self.contents.clear()` invalidates `words`, so we have to reassign it
        self.first_word = None; // removing this line should be a compiler error

        // It's unclear if this is the right semantics
        // The alternative is to invalidate whole 
        // `&mut self` and recreate the struct, instead:
        *self = ParsedFile { contents: self.contents, first_word: None }
    }
}

Shallow borrow looks close to the idea of view types that Niko Matsakis recently wrote about. There are more interesting generalizations of this idea. For example, you might want to distinguish data mutation from its extension. It'd be cool if Vec::push_within_capacity didn't invalidate mutable references to its data - because in fact, it doesn't touch them in any way, it only adds more.

It's not clear to me if all these features need special new annotations, or just relaxing the current rules would be enough. There is a lot of edge cases to think about. Possibly also breaking changes, because some unsafe code could rely on the current strict borrowing behaviour.

Alternatives πŸ”—

Here are some alternatives that are possible today.

Just use 'static πŸ”—

It's interesting that this pattern is already possible in safe code, with a caveat, though. You can leak the Vec and make the references static:

struct ParsedFile {
  contents: &'static [u8],
  words: Vec<&'static [u8]>
}

This is of course not ideal. It doesn't work with mutable references and you either have to leak the memory or use unsafe code to drop it (which is probably unsound, because the 'static becomes a lie, then), but it's a good hint that a feature like this shouldn't be that far from what we already have.

Just use raw pointers πŸ”—

Sound equivalent alternative to 'static is to use raw pointers for everything, and recreate the contents vec in Drop impl. That's far less ergonomic, though. To make words well usable, we have to provide methods that cast them to references. That's a lot of unsafe code to deal with.

struct ParsedFile {
  contents: *const u8,
  words: Vec<(*const u8, usize)>
}

It'd be a big improvement if we only had to make contents a raw pointer.

struct ParsedFile {
  contents: *const u8,
  words: Vec<&'self:contents [u8]>
}

This would require only single unsafe block in drop impl, which is way better than using unsafe on every access. The problem is that it's impossible to name the lifetime of slices in words Vec. Currently, it's impossible to name and contain lifetime in a struct. Once you have a lifetime, it's either static or it must be generic parameter.

Just use Rental πŸ”—

There's a rental crate that allows you to use some of these patterns with a bit more ergonomic API.

#[rental]
pub struct SimpleRef {
    head: Box<i32>,
    iref: &'head i32,
}

I think there are some other crates, too, but here I wanted to focus on a native solution.

Just go safe with indexes πŸ”—

A safe alternative is to use indexes instead of references.

struct ParsedFile {
  contents: Vec<u8>,
  words: Vec<(usize, usize)>
}

This works well, and you can make it ergonomic by wrapping the index pair in a new type, but it looses all Rust guarantees about validity of references. Nice thing about using references is that Rust makes sure those references are always valid and contents can't be modified while those references exist.

Indexes are safe, but it's basically a reimplementation of pointers with no guarantees attached to them. Indexes can now be used to index anything, they can alias, doesn't need to be valid at all and Rust will happily allow you to modify contents while words is still alive.

That said, it's probably my preferred alternative, just because I don't like to use unsafe or external crates if I really don't have to.

Why do we even need this? πŸ”—

At this point pretty much everybody wants this, so I don't think it's necessary to argue, but I can at least tell why I think it is useful. Apart from obvious async+generator usages, which can be hidden as implementation detail, I'd argue this feature would be a big ergonomic win.

The most important point is that it allows you to contain lifetimes. Rust lifetimes allow you to create efficient APIs that are impossible to misuse. This is a big deal compared to say, C++, where it's common to create much more wasteful implementation, just because it's safer.

The problem is that lifetimes in Rust are infectious. They need to be tied to something on the stack, which means that you can't contain them in a private implementation, you have to expose them in a public API and that makes it much less ergonomic to use. In the case of a struct from this post, users would have to manage contents themselves on the stack, and ParsedFile would have to be generic over a lifetime, containing only words.

This feels like an arbitrary restriction - for a newcomer, it's unclear why you can't "just" put the contents in a struct. In fact, this was one of the first limitations that I have struggled with when learning Rust, because it's common in other languages. I bumped into this when trying to port some parser from JavaScript. Lifting this restriction would yet again make learning Rust a bit easier.

For example, some Web frameworks will parse an HTTP request for you, but still allow you to access the raw data. Same thing with some file parsers (well, the one that I ported). It's also useful for a lazy approach, when you want to parse the file only when needed, like this:

struct ParsedFile {
  contents: Vec<u8>,
  words: Option<Vec<&'self:contents:indirect [u8]>>
}

Another similar use case is efficient string storage - instead of storing a bunch of individually allocated Strings, you only store one and then a list of references to it. You can do this today, but it's significantly less ergonomic, because it requires you to make the struct generic over lifetime and manage the original String separately.

One use case that is not easy to do safely right now is to use this pattern with foreign types. For example, storing a collection together with an in-progress iterator over that collection.

struct ParsedFile {
  contents: Vec<u8>,
  position: Iter<'self:contents:indirect, u8>
}

Conclusion πŸ”—

That's all what I wanted to share. Most of those are not really well-developed ideas, it's more of a wishlist of how something like this could work in an ideal world. There's definitely a lot of edge cases and soundness problems to solve. Nevertheless, I hope that one day we get something like this, hopefully even better than what I described here.

Discussion on Reddit or Internals or Hacker News



P.S.: Ok, but.... πŸ”—

Ok, but what about multiple borrows? πŸ”—

struct ParsedFile {
  contents: Vec<u8>,
  first_word: &'self:contents:indirect mut [u8],
  second_word: &'self:contents:indirect mut [u8]
}

Here we must make sure the borrows don't overlap, but this is really the same situation as happens on the stack.

let mut contents: Vec<u8>;
let first_word = &mut contents[0..3];
let second_word = &mut contents[4..];

This compiles, but Rust doesn't allow you to use first_word anymore. In other words, it wouldn't even allow you to create invalid struct in the first place. You have to use split_at_mut, and after that creating the struct should be ok again.

Ok, but what about stacked borrows? πŸ”—

This is more interesting case:

struct ParsedFile {
  contents: Vec<u8>,
  first_word: &'self:contents:indirect mut [u8],
  first_letter: &'self:first_word:indirect mut [u8]
}

This goes back to our stack analogy again.

let mut contents: Vec<u8>;
let first_word = &mut contents[0..3];
let first_letter = &mut first_word[0];

Here, Rust has to make sure that first_letter is not used after the first_word is used again. This means that touching first_word in our struct would have to have the same effect as touching contents - it invalidates all derived borrows and therefore invalidates the struct (in a similar way as partial move does today).

Ok, but what about REAL Stacked Borrows? πŸ”—

Not sure...

I'm curious how compatible this is with Stacked Borrows - the proposed aliasing model for Rust. I'd imagine that not very much. SB probably has to be somehow adjusted anyway, because even currently used techniques for self-referential structs are not compatible with it, as far as I understand.

Ok, but what is the hard mode? πŸ”—

A hard mode is direct self-referential struct - where shallow borrow doesn't help us.

struct MineField {
   num: u8,
   num_ref: &'self:num u8
}

This struct has to be pinned - it's immovable. We currently have Pin type for that, but it'd be nicer if Rust had first class support for this. One awkward thing is that it's kinda difficult to even create. On the stack it's fine with some little new syntax:

let m = MineField { num: 8, num_ref: &.num }

But on heap it's currently impossible to do safely, because rust doesn't have placement new and requires moving the struct from stack. You have to do this MaybeUninit dance (not sure if it's exactly correct honestly):

let m = unsafe {
  let m: Box<MaybeUninit<MineField>> = Box::new_uninit();
  addr_of_mut!(m.num).write(8);
  addr_of_mut!(m.num_ref).write(&m.num);
  m.assume_init()
};

Here, Rust would still need to make sure that struct on the heap is never moved, though. That's where a special language support would be cool.

Veteran mode πŸ”—

Veteran mode is a number of mutually referencing structs. Intrusive linked list and similar cursed patterns.

struct Doom {
  child: Doom2<'self>
}

struct Doom2<'parent> {
  parent: &'parent Doom
}

These are even more cumbersome to create, because it's a chicken and egg problem. It's also unclear which one should be dropped first. But this one is so niche that it's probably fine with raw pointers for now (and possibly some aliasing fixes).