BPF LSM and fexit [was: [PATCH bpf-next v3 04/10] bpf: lsm: Add mutable hooks list for the BPF LSM]

Tue Feb 11 21:38:20 UTC 2020

On Tue, Feb 11, 2020 at 09:33:49PM +0100, Jann Horn wrote:
> >
> > Got it. Then let's whitelist them ?
> > All error injection points are marked with ALLOW_ERROR_INJECTION().
> > We can do something similar here, but let's do it via BTF and avoid
> > abusing yet another elf section for this mark.
> > I think BTF_TYPE_EMIT() should work. Just need to pick explicit enough
> > name and extensive comment about what is going on.
> 
> Sounds reasonable to me. :)

awesome :)

> > Locking rules and cleanup around security_blah() shouldn't change though.
> > Like security_task_alloc() should be paired with security_task_free().
> > And so on. With bpf_sk_storage like logic the alloc/free of scratch
> > space will be similar to the way socket and bpf progs deal with it.
> >
> > Some of the lsm hooks are in critical path. Like security_socket_sendmsg().
> > retpoline hurts. If we go with indirect calls right now it will be harder to
> > optimize later. It took us long time to come up with bpf trampoline and build
> > bpf dispatcher on top of it to remove single indirect call from XDP runtime.
> > For bpf+lsm would be good to avoid it from the start.
> 
> Just out of curiosity: Are fexit hooks really much cheaper than indirect calls?
> 
> AFAIK ftrace on x86-64 replaces the return pointer for fexit
> instrumentation (see prepare_ftrace_return()). So when the function
> returns, there is one return misprediction for branching into
> return_to_handler(), and then the processor's internal return stack
> will probably be misaligned so that after ftrace_return_to_handler()
> is done running, all the following returns will also be mispredicted.
> 
> So I would've thought that fexit hooks would have at least roughly the
> same impact as indirect calls - indirect calls via retpoline do one
> mispredicted branch, fexit hooks do at least two AFAICS. But I guess
> indirect calls could still be slower if fexit benefits from having all
> the mispredicted pointers stored on the cache-hot stack while the
> indirect branch target is too infrequently accessed to be in L1D, or
> something like that?

For fexit I've tried ftrace style of overwriting return address in the stack,
but it was slower than "leave; add rsp, 8; ret". So I went with that.
Looks like skipping one frame makes intel return stack prediction work better
than overwriting. I tested on broadwell only though. Later I realized that I
can do "jmp bpf_trampoline" instead of "call bpf_trampoline", so this extra
'add rsp, 8' can be removed and both direct jump and return stack predictors
will be happy. Will be posting this patch soon.

Old perf numbers of fexit vs original vs kprobe:
https://lore.kernel.org/bpf/20191122011515.255371-1-ast@kernel.org/