[PATCH v5 bpf-next 0/5] bpf path iterator

Christian Brauner brauner at kernel.org
Mon Jul 7 11:17:09 UTC 2025


On Mon, Jul 07, 2025 at 12:46:41PM +0200, Christian Brauner wrote:
> On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote:
> > On Fri, 27 Jun 2025, Song Liu wrote:
> > > 
> > > 
> > > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil at brown.name> wrote:
> > > 
> > > [...]
> > > 
> > > >> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> > > >> initially, so some clarification:
> > > >> 
> > > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> > > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> > > >> A reliable step-by-step ref-walk, like this set, works well with 
> > > >> BPF, and we want to keep it.
> > > > 
> > > > The distinction between rcu-walk and ref-walk is an internal
> > > > implementation detail.  You as a caller shouldn't need to think about
> > > > the difference.  You just want to walk.  Note that LOOKUP_RCU is
> > > > documented in namei.h as "semi-internal".  The only uses outside of
> > > > core-VFS code is in individual filesystem's d_revalidate handler - they
> > > > are checking if they are allowed to sleep or not.  You should never
> > > > expect to pass LOOKUP_RCU to an VFS API - no other code does.
> > > > 
> > > > It might be reasonable for you as a caller to have some control over
> > > > whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> > > > But for dotdot lookup the code will never sleep - so that is not
> > > > relevant.
> > > 
> > > Unfortunately, the BPF use case is more complicated. In some cases, 
> > > the callback function cannot be call in rcu critical sections. For 
> > > example, the callback may need to read xatter. For these cases, we
> > > we cannot use RCU walk at all. 
> > 
> > I really think you should stop using the terms RCU walk and ref-walk.  I
> > think they might be focusing your thinking in an unhelpful direction.
> 
> Thank you! I really appreciate you helping to shape this API and it
> aligns a lot with my thinking.
> 
> > The key issue about reading xattrs is that it might need to sleep.
> > Focusing on what might need to sleep and what will never need to sleep
> > is a useful approach - the distinction is wide spread in the kernel and
> > several function take a flag indicating if they are permitted to sleep,
> > or if failure when sleeping would be required.
> > 
> > So your above observation is better described as 
> > 
> >    The vfs_walk_ancestors() API has an (implicit) requirement that the
> >    callback mustn't sleep.  This is a problem for some use-cases
> >    where the call back might need to sleep - e.g. for accessing xattrs.
> > 
> > That is a good and useful observation.  I can see three possibly
> > responses:
> > 
> > 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
> >    always allowed to sleep.  I don't particularly like this approach.
> 
> Agreed.
> 
> > 
> > 2/ Use repeated calls to vfs_walk_parent() when the handling of each
> >    ancestor might need to sleep.  I see no problem with supporting both
> >    vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
> >    precedent for having different  interfaces for different use cases.
> 
> Meh.
> 
> > 
> > 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> 
> I think that's fine.

Ok, sorry for the delay but there's a lot of different things going on
right now and this one isn't exactly an easy thing to solve.

I mentioned this before and so did Neil: the lookup implementation
supports two modes sleeping and non-sleeping. That api is abstracted
away as heavily as possible by the VFS so that non-core code will not be
exposed to it other than in exceptional circumstances and doesn't have
to care about it.

It is a conceptual dead-end to expose these two modes via separate APIs
and leak this implementation detail into non-core code. It will not
happen as far as I'm concerned.

I very much understand the urge to get the refcount step-by-step thing
merged asap. Everyone wants their APIs merged fast. And if it's
reasonable to move fast we will (see the kernfs xattr thing).

But here are two use-cases that ask for the same thing with different
constraints that closely mirror our unified approach. Merging one
quickly just to have something and then later bolting the other one on
top, augmenting, or replacing, possible having to deprecate the old API
is just objectively nuts. That's how we end up with a spaghetthi helper
collection. We want as little helper fragmentation as possible.

We need a unified API that serves both use-cases. I dislike
callback-based APIs generally but we have precedent in the VFS for this
for cases where the internal state handling is delicate enough that it
should not be exposed (see __iterate_supers() which does exactly work
like Neil suggested down to the flag argument itself I added).

So I'm open to the callback solution.

(Note for really absurd perf requirements you could even make it work
with static calls I'm pretty sure.)



More information about the Linux-security-module-archive mailing list