[PATCH v10 6/8] rust: file: add `FileDescriptorReservation`

Al Viro viro at zeniv.linux.org.uk
Sun Sep 15 18:39:05 UTC 2024


On Sun, Sep 15, 2024 at 02:31:32PM +0000, Alice Ryhl wrote:

> +impl Drop for FileDescriptorReservation {
> +    fn drop(&mut self) {
> +        // SAFETY: By the type invariants of this type, `self.fd` was previously returned by
> +        // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current`
> +        // still refers to the same task, as this type cannot be moved across task boundaries.
> +        unsafe { bindings::put_unused_fd(self.fd) };
> +    }
> +}

FWIW, it's a bit more delicate.  The real rules for API users are

	1) anything you get from get_unused_fd_flags() (well, alloc_fd(),
internally) must be passed either to put_unused_fd() or fd_install() before
you return from syscall.  That should be done by the same thread and
all calls of put_unused_fd() or fd_install() should be paired with
some get_unused_fd_flags() in that manner (i.e. by the same thread,
within the same syscall, etc.)

	2) calling thread MUST NOT unshare descriptor table while it has
any reserved descriptors.  I.e.
	fd = get_unused_fd();
	unshare_files();
	fd_install(fd, file);
is a bug.  Reservations are discarded by that.  Getting rid of that
constraint would require tracking the sets of reserved descriptors
separately for each thread that happens to share the descriptor table.
Conceptually they *are* per-thread - the same thread that has done
reservation must either discard it or use it.  However, it's easier to
keep the "it's reserved by some thread" represented in descriptor table
itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is
NULL) than try and keep track of who's reserved what.  The constraint is
basically "all reservations can stay with the old copy", i.e. "caller has
no reservations of its own to transfer into the new private copy it gets".
	It's not particularly onerous[*] and it simplifies things
quite a bit.  However, if we are documenting thing, it needs to be
put explicitly.  With respect to Rust, if you do e.g. binfmt-in-rust
support it will immediately become an issue - begin_new_exec() is calling
unshare_files(), so the example above can become an issue.

	Internally (in fs/file.c, that is) we have additional safety
rule - anything that might be given an arbitrary descriptor (e.g.
do_dup2() destination can come directly from dup2(2) argument,
file_close_fd_locked() victim can come directly from close(2) one,
etc.) must leave reserved descriptors alone.  Not an issue API users
need to watch out for, though.

[*] unsharing the descriptor table is done by
	+ close_range(2), which has no reason to allocate any descriptors
and is only called by userland.
	+ unshare(2), which has no reason to allocate any descriptors
and is only called by userland.
	+ a place in early init that call ksys_unshare() while arranging
the environment for /linuxrc from initrd image to be run.  Again, no
reserved descriptors there.
	+ coredumping thread in the beginning of do_coredump().
The caller is at the point of signal delivery, which means that it had
already left whatever syscall it might have been in.  Which means
that all reservations must have been undone by that point.
	+ execve() at the point of no return (in begin_new_exec()).
That's the only place where violation of that constraint on some later
changes is plausible.  That one needs to be watched out for.



More information about the Linux-security-module-archive mailing list