new seccomp mode aims to improve performance

Paul Moore paul at paul-moore.com
Mon Jun 1 12:32:41 UTC 2020


On Mon, Jun 1, 2020 at 6:17 AM Lennart Poettering
<lennart at poettering.net> wrote:
> On Fr, 29.05.20 12:27, Kees Cook (keescook at chromium.org) wrote:
> > # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
> > Seccomp_filters:        32
> >
> > # grep SystemCall /lib/systemd/system/systemd-resolved.service
> > SystemCallArchitectures=native
> > SystemCallErrorNumber=EPERM
> > SystemCallFilter=@system-service
> >
> > I'd like to better understand what they're doing, but haven't had time
> > to dig in. (The systemd devel mailing list requires subscription, so
> > I've directly CCed some systemd folks that have touched seccomp there
> > recently. Hi! The starts of this thread is here[4].)
>
> Hmm, so on x86-64 we try to install our seccomp filters three times:
> for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32
> syscall ABI. Not all of the filters we apply work on all ABIs though,
> because syscalls are available on some but not others, or cannot
> sensibly be matched on some (because of socketcall, ipc and such
> multiplexed syscalls).
>
> When we fist added support for seccomp filters to systemd we compiled
> everything into a single filter, and let libseccomp apply it to
> different archs. But that didn't work out, since libseccomp doesn't
> tell use when it manages to apply a filter and when not, i.e. to which
> arch it worked and to which arch it didn't. And since we have some
> whitelist and some blacklist filters the internal fallback logic of
> libsecccomp doesn't work for us either, since you never know what you
> end up with. So we ended up breaking the different settings up into
> individual filters, and apply them individually and separately for
> each arch, so that we know exactly what we managed to install and what
> not, and what we can then know will properly filter and can check in
> our test suite.
>
> Keeping the filters separate made things a lot easier and simpler to
> debug, and our log output and testing became much less of a black
> box. We know exactly what worked and what didn't, and our test
> validate each filter.

In situations where the calling application creates multiple per-ABI
filters, the seccomp_merge(3) function can be used to merge the
filters into one.  There are some limitations (same byte ordering,
filter attributes, etc.) but in general it should work without problem
when merging x86_64, x32, and x86.

For what it is worth, libseccomp does handle things like the
multiplexed socket syscalls[*] across multiple ABIs, just not quite in
the way Lennart and systemd wanted.  It is also possible, although I
would be a bit surprised, that some of the systemd's concerns have
been resolved in modern libseccomp.  For better or worse, systemd was
one of the first adopters of libseccomp and they had to deal with more
than a few bumps as the library was developed.

[*] Handling the multiplexed syscalls is tricky, especially when one
combines multiple ABIs and the presence of both the multiplexed and
direct-wired syscalls on some kernel versions.  Recent libseccomp
versions do handle all these cases; creating multiplexed filters,
direct-wired filters, or both depending on the particular ABI.  The
problem comes when you try to wrap all of that up in a single library
API that works regardless of the ABI and kernel version across
different build and runtime environments.  This is why we don't
support the "exact" variants of the libseccomp API on filters which
contain multiple ABIs, we simply can't guarantee that we will always
be able to filter on the third argument socket() in a filter than
consists of the x86_64 and x86 ABIs.  The non-exact API variants
create the rules as best they can in this case, creating three rules
in the filter: a x86_64 rule which filters on the third argument of
socket(), a x86 rule which filters on the third argument of the
direct-wired socket(), and a x86 rule which filters on the multiplexed
socketcall(socket) syscall (impossible to filter on the syscall
argument here).

> For systemd-resolved we apply a bunch more filters than just those
> that are result of SystemCallFilter= and SystemCallArchitectures=
> (SystemCallFilter= itself synthesizes one filter per syscall ABI).

...

> So yeah, if one turns on many of these options in services (and we
> generally turn on everything we can for the services we ship) and then
> multiply that by the archs you end up with quite a bunch.

I'm not sure how systemd is architected with respect to seccomp
filtering, but once again it would seem like seccomp_merge() could be
useful here.

> If we wanted to optimize that in userspace, then libseccomp would have
> to be improved quite substantially to let us know exactly what works
> and what doesn't, and to have sane fallback both when building
> whitelists and blacklists.

It has been quite a while since we last talked about systemd's use of
libseccomp, but the upcoming v2.5.0 release (no date set yet, but
think weeks not months) finally takes a first step towards defining
proper return values on error for the API, no more "negative values on
error".  I'm sure there are other things, but I recall this as being
one of the bigger systemd wants.

As an aside, it is always going to be difficult to allow fine grained
control when you have a single libseccomp filter that includes
multiple ABIs; the different ABI oddities are just too great (see
comments above).  If you need exacting control of the filter, or ABI
specific handling, then the recommended way is to create those filters
independently and merge them together before loading them into the
kernel or applying any common rules.

> An easy improvement is probably if libseccomp would now start refusing
> to install x32 seccomp filters altogether now that x32 is entirely
> dead? Or are the entrypoints for x32 syscalls still available in the
> kernel? How could userspace figure out if they are available? If
> libseccomp doesn't want to add code for that, we probably could have
> that in systemd itself too...

You can eliminate x32 syscalls today using libseccomp though either
the "BADARCH" filter attribute or through a x32 specific filter that
defaults to KILL/ERRNO/etc. and has no rules (of course you could
merge this x32 filter with your x86_64 filter).

While I don't see us removing the ability to create x32 filters from
libseccomp any time soon (need to support older kernels), I can say
that I would be very happy to see x32 removed from systems.
Regardless of what one may think of the wisdom in creating this ABI, I
think we can agree the implementation was a bit of a hack.

-- 
paul moore
www.paul-moore.com



More information about the Linux-security-module-archive mailing list