[PATCH v5 bpf-next 0/5] bpf path iterator

Wed Jul 9 16:06:38 UTC 2025

On Mon, Jul 07, 2025 at 06:50:12PM +0000, Song Liu wrote:
> Hi Christian, 
> 
> Thanks for your comments! 
> 
> > On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner at kernel.org> wrote:
> 
> [...]
> 
> >>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> >> 
> >> I think that's fine.
> > 
> > Ok, sorry for the delay but there's a lot of different things going on
> > right now and this one isn't exactly an easy thing to solve.
> > 
> > I mentioned this before and so did Neil: the lookup implementation
> > supports two modes sleeping and non-sleeping. That api is abstracted
> > away as heavily as possible by the VFS so that non-core code will not be
> > exposed to it other than in exceptional circumstances and doesn't have
> > to care about it.
> > 
> > It is a conceptual dead-end to expose these two modes via separate APIs
> > and leak this implementation detail into non-core code. It will not
> > happen as far as I'm concerned.
> > 
> > I very much understand the urge to get the refcount step-by-step thing
> > merged asap. Everyone wants their APIs merged fast. And if it's
> > reasonable to move fast we will (see the kernfs xattr thing).
> > 
> > But here are two use-cases that ask for the same thing with different
> > constraints that closely mirror our unified approach. Merging one
> > quickly just to have something and then later bolting the other one on
> > top, augmenting, or replacing, possible having to deprecate the old API
> > is just objectively nuts. That's how we end up with a spaghetthi helper
> > collection. We want as little helper fragmentation as possible.
> > 
> > We need a unified API that serves both use-cases. I dislike
> > callback-based APIs generally but we have precedent in the VFS for this
> > for cases where the internal state handling is delicate enough that it
> > should not be exposed (see __iterate_supers() which does exactly work
> > like Neil suggested down to the flag argument itself I added).
> > 
> > So I'm open to the callback solution.
> > 
> > (Note for really absurd perf requirements you could even make it work
> > with static calls I'm pretty sure.)
> 
> I guess we will go with Mickaël’s idea:
> 
> > int vfs_walk_ancestors(struct path *path,
> >                       bool (*walk_cb)(const struct path *ancestor, void *data),

> >                       void *data, int flags)
> > 
> > The walk continue while walk_cb() returns true.  walk_cb() can then
> > check if @ancestor is equal to a @root, or other properties.  The
> > walk_cb() return value (if not bool) should not be returned by
> > vfs_walk_ancestors() because a walk stop doesn't mean an error.
> 
> If necessary, we hide “root" inside @data. This is good. 
> 
> > @path would be updated with latest ancestor path (e.g. @root).
> 
> Update @path to the last ancestor and hold proper references. 
> I missed this part earlier. With this feature, vfs_walk_ancestors 
> should work usable with open-codeed bpf path iterator. 
> 
> I have a question about this behavior with RCU walk. IIUC, RCU 
> walk does not hold reference to @ancestor when calling walk_cb().

I think a reference to the mount should be held, but not necessarily to
the dentry if we are still in the same mount as the original path.

> If walk_cb() returns false, shall vfs_walk_ancestors() then
> grab a reference on @ancestor? This feels a bit weird to me. 

If walk_cb() checks for a root, it will return false when the path will
match, and the caller would expect to get this root path, right?

In general, it's safer to always have the same behavior when holding or
releasing a reference.  I think the caller should then always call
path_put() after vfs_walk_ancestors() whatever the return code is.

> Maybe “updating @path to the last ancestor” should only apply to
> LOOKUP_RCU==false case? 
> 
> > @flags could contain LOOKUP_RCU or not, which enables us to have
> > walk_cb() not-RCU compatible.
> > 
> > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> > failed with -ECHILD, the caller can restart the walk by calling
> > vfs_walk_ancestors() again but without LOOKUP_RCU.
> 
> 
> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
> again without LOOKUP_RCU, I think we should keep @path not changed
> With LOOKUP_RCU==true, and only update it to the last ancestor 
> when LOOKUP_RCU==false. 

As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public
flag.  Instead, walk_cb() should never sleep (and then potentially be
called under RCU by the vfs_walk_ancestors() implementation).

> 
> With this behavior, landlock code will be like:
> 
> 
> /* Assume we hold reference on “path”. 
>  * With LOOKUP_RCU, path will not change, we don’t need 
>  * extra reference on “path”.
>  */
> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
> /* 
>  * At this point, whether err is 0 or not, path is not 
>  * changed.
>  */
> 
> if (err == -ECHILD) {
> 	struct path walk_path = *path;
> 
> 	/* reset any data changed by the walk */
> 	reset_data(data);
> 
> 	/* get a reference on walk_path. */
> 	path_get(&walk_path);
> 
> 	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
> 	/* Now, walk_path might be updated */
> 
> 	/* Always release reference on walk_path */
> 	path_put(&walk_path);
> }
> 
> 
> BPF path iterator sode will look like:
> 
> static bool bpf_cb(const struct path *ancestor, void *data)
> {
> 	return false;
> }

Instead of this callback, we could just always return if walk_cb is
NULL.

> 
> struct path *bpf_iter_path_next(struct bpf_iter_path *it)
> {
> 	struct bpf_iter_path_kern *kit = (void *)it;
> 
> 	if (vfs_walk_ancestors(&kit->path, bpf_cb, NULL))
> 		return NULL;
> 	return &kit->path;
> }
> 
> 
> Does this sound reasonable to every body?
> 
> Thanks,
> Song
>