[RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring

Wed May 26 18:38:35 UTC 2021

On Wed, May 26, 2021 at 1:54 PM Jens Axboe <axboe at kernel.dk> wrote:
> On 5/26/21 11:31 AM, Jens Axboe wrote:
> > On 5/26/21 11:15 AM, Jens Axboe wrote:
> >> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe at kernel.dk> wrote:
> >>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>> That said, audit is not for everyone, and we have build time and
> >>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>> audit at compile time a number of Linux distributions effectively
> >>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>> filter, for example:
> >>>>>
> >>>>>  % auditctl -a task,never
> >>>>
> >>>> As has been brought up, the issue we're facing is that distros have
> >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>> of people doing custom kernels. My question would then be how much
> >>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>
> >>> I commented on that case in my last email to Pavel, but I'll try to go
> >>> over it again in a little more detail.
> >>>
> >>> As we discussed earlier in this thread, we can skip the req->opcode
> >>> check before both the _entry and _exit calls, so we are left with just
> >>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>> functions are small, I've copied them and their supporting functions
> >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>> "task,never" case.
> >>>
> >>> +  static inline struct audit_context *audit_context(void)
> >>> +  {
> >>> +    return current->audit_context;
> >>> +  }
> >>>
> >>> +  static inline bool audit_dummy_context(void)
> >>> +  {
> >>> +    void *p = audit_context();
> >>> +    return !p || *(int *)p;
> >>> +  }
> >>>
> >>> +  static inline void audit_uring_entry(u8 op)
> >>> +  {
> >>> +    if (unlikely(audit_enabled && audit_context()))
> >>> +      __audit_uring_entry(op);
> >>> +  }
> >>>
> >>> We have one if statement where the conditional checks on two
> >>> individual conditions.  The first (audit_enabled) is simply a check to
> >>> see if anyone has "turned on" auditing at runtime; historically this
> >>> worked rather well, and still does in a number of places, but ever
> >>> since systemd has taken to forcing audit on regardless of the admin's
> >>> audit configuration it is less useful.  The second (audit_context())
> >>> is a check to see if an audit_context has been allocated for the
> >>> current task.  In the case of "task,never" current->audit_context will
> >>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>> will never be called.
> >>>
> >>> Worst case here is checking the value of audit_enabled and
> >>> current->audit_context.  Depending on which you think is more likely
> >>> we can change the order of the check so that the
> >>> current->audit_context check is first if you feel that is more likely
> >>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>
> >>> +  static inline void audit_uring_exit(int success, long code)
> >>> +  {
> >>> +    if (unlikely(!audit_dummy_context()))
> >>> +      __audit_uring_exit(success, code);
> >>> +  }
> >>>
> >>> The exit call is very similar to the entry call, but in the
> >>> "task,never" case it is very simple as the first check to be performed
> >>> is the current->audit_context check which we know to be NULL.  The
> >>> __audit_uring_exit() slowpath will never be called.
> >>
> >> I actually ran some numbers this morning. The test base is 5.13+, and
> >> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >> test and the test with this series applied. I used your git branch as of
> >> this morning.
> >>
> >> The test case is my usual peak perf test, which is random reads at
> >> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >> two different tests - one was having a thread just do the IO, the other
> >> is using SQPOLL to do the IO for us. The device is capable than more
> >> IOPS than a single core can deliver, so we're CPU limited in this test.
> >> Hence it's a good test case as it does actual work, and shows software
> >> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >> between runs on the same base), yet I did average 4 runs.
> >>
> >> Kernel               SQPOLL          IOPS            Perf diff
> >> ---------------------------------------------------------
> >> 5.13         0               3029872         0.0%
> >> 5.13         1               3031056         0.0%
> >> 5.13 + audit 0               2894160         -4.5%
> >> 5.13 + audit 1               2886168         -4.8%
> >>
> >> That's an immediate drop in perf of almost 5%. Looking at a quick
> >> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >> shows this:
> >>
> >> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>
> >> Note that this is with _no_ rules!
> >
> > io_uring also supports a NOP command, which basically just measures
> > reqs/sec through the interface. Ran that as well:
> >
> > Kernel                SQPOLL          IOPS            Perf diff
> > ---------------------------------------------------------
> > 5.13          0               31.05M          0.0%
> > 5.13 + audit  0               25.31M          -18.5%
> >
> > and profile for the latter includes:
> >
> > +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> > +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>
> As Pavel correctly pointed it, looks like auditing is enabled. And
> indeed it was! Hence the above numbers is without having turned off
> auditing. Running the NOPs after having turned off audit, we get 30.6M
> IOPS, which is down about 1.5% from the baseline. The results for the
> polled random read test above did _not_ change from this, they are still
> down the same amount.
>
> Note, and I should have included this in the first email, this is not
> any kind of argument for or against audit logging. It's purely meant to
> be a set of numbers that show how the current series impacts
> performance.

Thanks for running some tests Jens, unfortunately the git tree hadn't
been updated to reflect the improved audit_uring_entry() and
audit_uring_exit() functions as we were still discussing things and I
thought there still might be some changes.  I just now updated the
branch with the latest code so if you have the cycles (ha!) to run
another perf test I think those numbers would be more interesting.
For example, I don't believe you should see __audit_uring_entry() or
__audit_uring_exit() at all if you have the audit "task,never" rule
loaded; you will see audit_uring_entry() and audit_uring_exit() but as
we discussed previously those should just be a single
"current->audit_context != NULL" check.

Also, I want to pull back a bit as I think there is confusion about
how audit works and why these changes are necessary.  As everyone
likely knows, there are audit calls sprinkled throughout the kernel
code, e.g. audit_log_format() and similar.  In addition to these calls
that most are aware of, there are setup/teardown audit calls that run
at task creation and destruction as well as at syscall entry and exit.
It is these lesser known calls that are responsible for the filtering,
setup, and general management of the audit context state in the
system; they also handle generation of some audit records such as
SYSCALL, PATH, etc. based on information that is recorded by audit
calls inserted into other places in the kernel.  The
audit_alloc_kernel() and audit_free() calls we are adding to the
io_uring/io-wq code are intended to do similar things to the existing
audit task creation/destruction hooks, but for the io_uring/io-wq
kernel threads.  Similarly the audit_uring_entry() and
audit_uring_exit() calls are intended to act as replacements for the
syscall entry and exit code.  If we want the existing audit calls in
the kernel to work properly, we need these setup/teardown functions.

Hopefully that makes things a bit more clear regarding these calls in
the io_uring/io-wq code.

Another point I wanted to address is the "double audit" (!!!) that has
been mentioned recently in this thread.  Don't get too excited, this
isn't quite what you think it is, it is a side effect of how io_uring
can dispatch certain operations.  As I think the io_uring folks
already know, io_uring can process I/O ops in several different
contexts; one of which is after a syscall completes but before the
kernel returns to userspace.  In this particular case things can get
rather interesting from an audit perspective as we need to handle both
the syscall audit records *and* the io_uring operation records.  If
you look closer at the audit code in this patchset you'll see some of
the fun stuff we need to do to make sure this case is handled
correctly.  If you happen to see a code path that takes you through
audit_syscall_entry() + <syscall_stuff> + audit_uring_entry() +
<io_uring_stuff> + audit_uring_exit() + audit_syscall_exit() rest
assured that is correct :)

Of course there is the normal audit_uring_entry() and
audit_uring_exit() without the audit syscall hooks in other code
paths, e.g. SQPOLL, but those are less dramatic than the "OMG, double
audit!!!" ;)

-- 
paul moore
www.paul-moore.com