[PATCH v5 bpf-next 0/5] bpf path iterator

Mon Jul 7 18:50:12 UTC 2025

Hi Christian, 

Thanks for your comments! 

> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner at kernel.org> wrote:

[...]

>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
>> 
>> I think that's fine.
> 
> Ok, sorry for the delay but there's a lot of different things going on
> right now and this one isn't exactly an easy thing to solve.
> 
> I mentioned this before and so did Neil: the lookup implementation
> supports two modes sleeping and non-sleeping. That api is abstracted
> away as heavily as possible by the VFS so that non-core code will not be
> exposed to it other than in exceptional circumstances and doesn't have
> to care about it.
> 
> It is a conceptual dead-end to expose these two modes via separate APIs
> and leak this implementation detail into non-core code. It will not
> happen as far as I'm concerned.
> 
> I very much understand the urge to get the refcount step-by-step thing
> merged asap. Everyone wants their APIs merged fast. And if it's
> reasonable to move fast we will (see the kernfs xattr thing).
> 
> But here are two use-cases that ask for the same thing with different
> constraints that closely mirror our unified approach. Merging one
> quickly just to have something and then later bolting the other one on
> top, augmenting, or replacing, possible having to deprecate the old API
> is just objectively nuts. That's how we end up with a spaghetthi helper
> collection. We want as little helper fragmentation as possible.
> 
> We need a unified API that serves both use-cases. I dislike
> callback-based APIs generally but we have precedent in the VFS for this
> for cases where the internal state handling is delicate enough that it
> should not be exposed (see __iterate_supers() which does exactly work
> like Neil suggested down to the flag argument itself I added).
> 
> So I'm open to the callback solution.
> 
> (Note for really absurd perf requirements you could even make it work
> with static calls I'm pretty sure.)

I guess we will go with Mickaël’s idea:

> int vfs_walk_ancestors(struct path *path,
>                       bool (*walk_cb)(const struct path *ancestor, void *data),
>                       void *data, int flags)
> 
> The walk continue while walk_cb() returns true.  walk_cb() can then
> check if @ancestor is equal to a @root, or other properties.  The
> walk_cb() return value (if not bool) should not be returned by
> vfs_walk_ancestors() because a walk stop doesn't mean an error.

If necessary, we hide “root" inside @data. This is good. 

> @path would be updated with latest ancestor path (e.g. @root).

Update @path to the last ancestor and hold proper references. 
I missed this part earlier. With this feature, vfs_walk_ancestors 
should work usable with open-codeed bpf path iterator. 

I have a question about this behavior with RCU walk. IIUC, RCU 
walk does not hold reference to @ancestor when calling walk_cb().
If walk_cb() returns false, shall vfs_walk_ancestors() then
grab a reference on @ancestor? This feels a bit weird to me. 
Maybe “updating @path to the last ancestor” should only apply to
LOOKUP_RCU==false case? 

> @flags could contain LOOKUP_RCU or not, which enables us to have
> walk_cb() not-RCU compatible.
> 
> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> failed with -ECHILD, the caller can restart the walk by calling
> vfs_walk_ancestors() again but without LOOKUP_RCU.

Given we want callers to handle -ECHILD and call vfs_walk_ancestors
again without LOOKUP_RCU, I think we should keep @path not changed
With LOOKUP_RCU==true, and only update it to the last ancestor 
when LOOKUP_RCU==false. 

With this behavior, landlock code will be like:

/* Assume we hold reference on “path”. 
 * With LOOKUP_RCU, path will not change, we don’t need 
 * extra reference on “path”.
 */
err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
/* 
 * At this point, whether err is 0 or not, path is not 
 * changed.
 */

if (err == -ECHILD) {
	struct path walk_path = *path;

	/* reset any data changed by the walk */
	reset_data(data);

	/* get a reference on walk_path. */
	path_get(&walk_path);

	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
	/* Now, walk_path might be updated */

	/* Always release reference on walk_path */
	path_put(&walk_path);
}

BPF path iterator sode will look like:

static bool bpf_cb(const struct path *ancestor, void *data)
{
	return false;
}

struct path *bpf_iter_path_next(struct bpf_iter_path *it)
{
	struct bpf_iter_path_kern *kit = (void *)it;

	if (vfs_walk_ancestors(&kit->path, bpf_cb, NULL))
		return NULL;
	return &kit->path;
}

Does this sound reasonable to every body?

Thanks,
Song