[PATCH bpf-next 3/4] bpf: Introduce path iterator

Fri May 30 00:42:16 UTC 2025

On Thu, May 29, 2025 at 4:10 PM Al Viro <viro at zeniv.linux.org.uk> wrote:
>
> On Thu, May 29, 2025 at 03:13:10PM -0700, Song Liu wrote:
>
> > Is it an issue if we only hold a reference to a MNT_LOCKED mount for
> > short period of time? "Short period" means it may get interrupted, page
> > faults, or wait for an IO (read xattr), but it won't hold a reference to the
> > mount and sleep indefinitely.
>
> MNT_LOCKED mount itself is not a problem.  What shouldn't be done is
> looking around in the mountpoint it covers.  It depends upon the things
> you are going to do with that, but it's very easy to get an infoleak
> that way.
>
> > > OTOH, there's a good cause for moving some of the flags, MNT_LOCKED
> > > included, out of ->mnt_flags and into a separate field in struct mount.
> > > However, that would conflict with any code using that to deal with
> > > your iterator safely.
> > >
> > > What's more, AFAICS in case of a stack of mounts each covering the root
> > > of parent mount, you stop in each of those.  The trouble is, umount(2)
> > > propagation logics assumes that intermediate mounts can be pulled out of
> > > such stack without causing trouble.  For pathname resolution that is
> > > true; it goes through the entire stack atomically wrt that stuff.
> > > For your API that's not the case; somebody who has no idea about an
> > > intermediate mount being there might get caught on it while it's getting
> > > pulled from the stack.
> > >
> > > What exactly do you need around the mountpoint crossing?
> >
> > I thought about skipping intermediate mounts (that are hidden by
> > other mounts). AFAICT, not skipping them will not cause any issue.
>
> It can.  Suppose e.g. that /mnt gets propagation from another namespace,
> but not the other way round and you mount something on /mnt.
>
> Later, in that another namespace, somebody mounts something on wherever
> your /mnt gets propagation to.  A copy will be propagated _between_
> your /mnt and whatever you've mounted on top of it; it will be entirely
> invisible until you umount your /mnt.  At that point the propagated
> copy will show up there, same as if it had appeared just after your
> umount.  Prior to that it's entirely invisible.  If its original
> counterpart in another namespace gets unmounted first, the copy will
> be quietly pulled out.

Thanks for sharing this information!

> Note that choose_mountpoint_rcu() callers (including choose_mountpoint())
> will have mount_lock seqcount sampled before the traversal _and_ recheck
> it after having reached the bottom of stack.  IOW, if you traverse ..
> on the way to root, you won't get caught on the sucker being pulled out.

In some of our internal discussions, we talked about using
choose_mountpoint() instead of follow_up(). I didn't go that direction in this
version because it requires holding "root". But if it makes more sense
to use, choose_mountpoint(), we sure can hold "root".

Alternatively, I think it is also OK to pass a zero'ed root to
choose_mountpoint().

> Your iterator, OTOH, would stop in that intermediate mount - and get
> an unpleasant surprise when it comes back to do the next step (towards
> /mnt on root filesystem, that is) and finds that path->mnt points
> to something that is detached from everything - no way to get from
> it any further.  That - despite the fact that location you've started
> from is still mounted, still has the same pathname, etc. and nothing
> had been disrupted for it.
>
> And yes, landlock has a narrow race in the matching place.  Needs to
> be fixed.  At least it does ignore those as far as any decisions are
> concerned...

If we update path_parent in this patchset with choose_mountpoint(),
and use it in Landlock, we will close this race condition, right?

>
> Note, BTW, that it might be better off by doing that similar to
> d_path.c - without arseloads of dget_parent/dput et.al.; not sure
> how feasible it is, but if everything in it can be done under
> rcu_read_lock(), that's something to look into.

I don't think we can do everything here inside rcu_read_lock().
But d_path.c does have some code we can probably reuse or
learn from. Also, we probably need two variations of iterators,
one walk until absolute root, while the other walk until root of
current->fs, just like d_path() vs. d_absolute_path(). Does this
sound reasonable?

Thanks,
Song