[PATCH v10 6/8] rust: file: add `FileDescriptorReservation`

Al Viro viro at zeniv.linux.org.uk
Sun Sep 15 19:34:43 UTC 2024


On Sun, Sep 15, 2024 at 07:39:05PM +0100, Al Viro wrote:
> 	2) calling thread MUST NOT unshare descriptor table while it has
> any reserved descriptors.  I.e.
> 	fd = get_unused_fd();
> 	unshare_files();
> 	fd_install(fd, file);
> is a bug.  Reservations are discarded by that.  Getting rid of that
> constraint would require tracking the sets of reserved descriptors
> separately for each thread that happens to share the descriptor table.
> Conceptually they *are* per-thread - the same thread that has done
> reservation must either discard it or use it.  However, it's easier to
> keep the "it's reserved by some thread" represented in descriptor table
> itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is
> NULL) than try and keep track of who's reserved what.  The constraint is
> basically "all reservations can stay with the old copy", i.e. "caller has
> no reservations of its own to transfer into the new private copy it gets".

FWIW, I toyed with the idea of having reservations kept per-thread;
it is possible and it simplifies some things, but I hadn't been able to
find a way to do that without buggering syscall latency for open() et.al.

It would keep track of the set of reservations in task_struct (count,
two-element array for the first two + page pointer for spillovers,
for the rare threads that need more than two reserved simultaneously).

Representation in fdtable:
state		open_fds bit	value in ->fd[] array
free		clear		0UL
reserved	set		0UL
uncommitted	set		1UL|(unsigned long)file
open		set		(unsigned long)file

with file lookup treating any odd value as 0 (i.e. as reserved)
fd_install() switching reserved to uncommitted *AND* separate
"commit" operation that does this:
	if current->reservation_count == 0
		return
	if failure
		for each descriptor in our reserved set
			v = fdtable->fd[descriptor]
			if (v) {
				fdtable->fd[descriptor] = 0;
				fput((struct file *)(v & ~1);
			}
			clear bit in fdtable->open_fds[]
	else
		for each descriptor in our reserved set
			v = fdtable->fd[descriptor]
			if (v)
				fdtable->fd[descriptor] = v & ~1;
			else
				BUG
	current->reservation_count = 0

That "commit" thing would be called on return from syscall
for userland threads and would be called explicitly in
kernel threads that have a descriptor table and work with
it.

The benefit would be that fd_install() would *NOT* have to be done
after the last possible failure exit - something that installs
a lot of files wouldn't have to play convoluted games on cleanup.
Simply returning an error would do the right thing.

Two stores into ->fd[] instead of one is not a big deal;
however, anything like task_work_add() to arrange the call
of "commit" ends up being bloody awful.

We could have it called from syscall glue directly, but
that means touching assembler for quite a few architectures,
and I hadn't gotten around to that.  Can be done, but it's
not a pleasant work...



More information about the Linux-security-module-archive mailing list