[PATCH] x86/bpf: use bpf_capable() instead of capable(CAP_SYS_ADMIN)

Tue Oct 21 12:32:06 UTC 2025

On Tue, Sep 2, 2025 at 7:37 PM Alexei Starovoitov
<alexei.starovoitov at gmail.com> wrote:
>
> On Wed, Aug 13, 2025 at 2:49 AM Ondrej Mosnacek <omosnace at redhat.com> wrote:
> >
> > On Sat, Aug 9, 2025 at 2:46 AM Alexei Starovoitov
> > <alexei.starovoitov at gmail.com> wrote:
> > >
> > > On Fri, Aug 8, 2025 at 4:59 PM Serge E. Hallyn <serge at hallyn.com> wrote:
> > > >
> > > > On Wed, Aug 06, 2025 at 08:18:55PM -0500, Serge E. Hallyn wrote:
> > > > > On Wed, Aug 06, 2025 at 04:31:05PM +0200, Ondrej Mosnacek wrote:
> > > > > > Don't check against the overloaded CAP_SYS_ADMINin do_jit(), but instead
> > > > > > use bpf_capable(), which checks against the more granular CAP_BPF first.
> > > > > > Going straight to CAP_SYS_ADMIN may cause unnecessary audit log spam
> > > > > > under SELinux, as privileged domains using BPF would usually only be
> > > > > > allowed CAP_BPF and not CAP_SYS_ADMIN.
> > > > > >
> > > > > > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2369326
> > > > > > Fixes: d4e89d212d40 ("x86/bpf: Call branch history clearing sequence on exit")
> > > > > > Signed-off-by: Ondrej Mosnacek <omosnace at redhat.com>
> > > > >
> > > > > So this seems correct, *provided* that we consider it within the purview of
> > > > > CAP_BPF to be able to avoid clearing the branch history buffer.
> > >
> > > true, but...
> > >
> > > > >
> > > > > I suspect that's the case, but it might warrant discussion.
> > > > >
> > > > > Reviewed-by: Serge Hallyn <serge at hallyn.com>
> > > >
> > > > (BTW, I'm assuming this will get pulled into a BPF tree or something, and
> > > > doesn't need to go into the capabilities tree.  Let me know if that's wrong)
> > >
> > > Right.
> > > scripts/get_maintainer.pl arch/x86/net/bpf_jit_comp.c
> > > is your friend.
> > >
> > > Pls cc author-s of the commit in question in the future.
> > > Adding them now.
> > >
> > > > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > > > > index 15672cb926fc1..2a825e5745ca1 100644
> > > > > > --- a/arch/x86/net/bpf_jit_comp.c
> > > > > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > > > > @@ -2591,8 +2591,7 @@ emit_jmp:
> > > > > >                     seen_exit = true;
> > > > > >                     /* Update cleanup_addr */
> > > > > >                     ctx->cleanup_addr = proglen;
> > > > > > -                   if (bpf_prog_was_classic(bpf_prog) &&
> > > > > > -                       !capable(CAP_SYS_ADMIN)) {
> > > > > > +                   if (bpf_prog_was_classic(bpf_prog) && !bpf_capable()) {
> > >
> > > This looks wrong for several reasons.
> > >
> > > 1.
> > > bpf_capable() and CAP_BPF in general applies to eBPF only.
> > > There is no precedent so far to do anything differently
> > > for cBPF when CAP_BPF is present.
> >
> > That's not entirely true, see below.
> >
> > > 2.
> > > commit log states that
> > > "privileged domains using BPF would usually only be allowed CAP_BPF
> > > and not CAP_SYS_ADMIN"
> > > which is true for eBPF only, since cBPF is always allowed for
> > > all unpriv users.
> > > Start chrome browser and you get cBPF loaded.
> >
> > Processes using cBPF (via SO_ATTACH_FILTER) already can trigger a
> > CAP_BPF check - when the net.core.bpf_jit_harden sysctl is set to 1,
> > then the sequence sk_attach_filter() -> __get_filter() ->
> > bpf_prog_alloc() -> bpf_prog_alloc_no_stats() ->
> > bpf_jit_blinding_enabled() -> bpf_token_capable() happens for the same
> > iio-sensor-proxy syscall as the one that hits the CAP_SYS_ADMIN check.
>
> Agree that it's a similar case and we should standardize on
> the way to check whether mitigation is necessary.
> But I think in both cases it should be "_noaudit" version.
> In other words, everywhere where jit, verifier or anything else
> is checking caps to apply a mitigation or not we should use "_noaudit".
>
> > Because of this we have already granted the BPF capability in
> > Fedora/RHEL SELinux policy to many domains that would usually run as
> > root and that use SO_ATTACH_FILTER. The logic being that they are
> > legitimately using BPF + without SELinux they would be fully
> > privileged (root) and they would pass that check + it seemed they
> > could otherwise lose some performance due to the hardening (though I'm
> > not sure now if it applies to cBPF, so this point could be moot) +
> > CAP_BPF doesn't grant any excess privileges beyond this (as opposed to
> > e.g. CAP_SYS_ADMIN). This is what I meant behind that commit log
> > statement, though I didn't remember the details, so I didn't state it
> > as clearly as I could have (my apologies).
> >
> > Now this same usage started triggering the new plain CAP_SYS_ADMIN
> > check so I naturally assumed that changing it to bpf_capable() would
> > be the most logical solution (as it would let us keep the services
> > excluded from the hardening via CAP_BPF without granting the broad
> > CAP_SYS_ADMIN).
>
> I see. Sounds like you identified bpf_jit_blinding_enabled() case
> and special cased it selinux. So doing bpf_capable() in JIT
> would fit the existing selinux handling.
>
> >
> > Is the fact that CAP_BPF check is reachable via cBPF use unexpected
> > behavior? If both cBPF and eBPF can be JIT'd and CAP_BPF is already
> > being used for the "exempt from JIT hardening" semantics in one place,
> > why should cBPF and eBPF be treated differently? In fact, shouldn't
> > the decision to apply the Spectre mitigation also take into account
> > the net.core.bpf_jit_harden sysctl even when the program is not cBPF?
> >
> > > 3.
> > > glancing over bugzilla it seems that the issue is
> > > excessive audit spam and not related to CAP_BPF and privileges.
> > > If so then the fix is to use
> > > ns_capable_noaudit(&init_user_ns, CAP_SYS_ADMIN)
> > >
> > > 4.
> > > I don't understand how the patch is supposed to fix the issue.
> > > iio-sensor-proxy is probably unpriv. Why would it use CAP_BPF?
> > > It's using cBPF, so there is no reason for it to have CAP_BPF.
> > > So capable(CAP_BPF) will fail just like capable(CAP_SYS_ADMIN),
> > > but since CAP_BPF check was done first, the audit won't
> > > be printed, because it's some undocumented internal selinux behavior ?
> > > None of it is in the commit log :(
> >
> > It is not unprivileged. It runs as root and without SELinux it would
> > have all capabilities allowed. If it were running without any
> > capabilities, then indeed there would be no SELinux checks.
> >
> > > 5.
> > > And finally all that looks like a selinux bug.
> > > Just because something in the kernel is asking capable(CAP_SYS_ADMIN)
> > > there is no need to spam users with the wrong message:
> > > "SELinux is preventing iio-sensor-prox from using the 'sys_admin' capabilities."
> > > iio-sensor-prox is not trying to use 'sys_admin' capabilities.
> > > cBPF prog will be loaded anyway, with or without BHB clearing.
> >
> > Well, it depends... In this case the AVC denial informs us that the
> > kernel is making some decision depending on the capability and that a
> > decision should be made in the policy to allow or silence the access
> > vector. Even when the consequence is not a failure of the syscall, it
> > still may be useful to have the denial reported, since there is a
> > potential performance impact. OTOH, with CAP_SYS_ADMIN if the decision
> > is to not allow it, then silencing it via a dontaudit rule would
> > potentially hide other more critical CAP_SYS_ADMIN denials, so it's
> > hard to decide what is better - to silence this specific case in the
> > kernel vs. to let the user allow/silence the specific AV in the
> > policy...
>
> imo we should apply a general rule for using "_noaudit"
> for all checks that don't end up as a denial.
> There are various bpf_allow_ptr_leaks(), bpf_bypass_spec_v[14]()
> checks in the verifier that should be converted to "_noaudit".
> It's true that the check affects performance and users could be
> interested in the end result, but if they enable mitigations they
> expect the performance hit across the board, so skipping a mitigation
> for a privileged process is a bonus. A prog will be tiny bit faster.
> So not worth flagging anywhere.

Ok, so I have sent a v2 that switches to ns_capable_noaudit() instead:
https://lore.kernel.org/selinux/20251021122758.2659513-1-omosnace@redhat.com/

--
Ondrej Mosnacek
Senior Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.