new seccomp mode aims to improve performance

Mon Jun 1 10:11:37 UTC 2020

On Fr, 29.05.20 12:27, Kees Cook (keescook at chromium.org) wrote:

> # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
> Seccomp_filters:        32
>
> # grep SystemCall /lib/systemd/system/systemd-resolved.service
> SystemCallArchitectures=native
> SystemCallErrorNumber=EPERM
> SystemCallFilter=@system-service
>
> I'd like to better understand what they're doing, but haven't had time
> to dig in. (The systemd devel mailing list requires subscription, so
> I've directly CCed some systemd folks that have touched seccomp there
> recently. Hi! The starts of this thread is here[4].)

Hmm, so on x86-64 we try to install our seccomp filters three times:
for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32
syscall ABI. Not all of the filters we apply work on all ABIs though,
because syscalls are available on some but not others, or cannot
sensibly be matched on some (because of socketcall, ipc and such
multiplexed syscalls).

When we fist added support for seccomp filters to systemd we compiled
everything into a single filter, and let libseccomp apply it to
different archs. But that didn't work out, since libseccomp doesn't
tell use when it manages to apply a filter and when not, i.e. to which
arch it worked and to which arch it didn't. And since we have some
whitelist and some blacklist filters the internal fallback logic of
libsecccomp doesn't work for us either, since you never know what you
end up with. So we ended up breaking the different settings up into
individual filters, and apply them individually and separately for
each arch, so that we know exactly what we managed to install and what
not, and what we can then know will properly filter and can check in
our test suite.

Keeping the filters separate made things a lot easier and simpler to
debug, and our log output and testing became much less of a black
box. We know exactly what worked and what didn't, and our test
validate each filter.

For systemd-resolved we apply a bunch more filters than just those
that are result of SystemCallFilter= and SystemCallArchitectures=
(SystemCallFilter= itself synthesizes one filter per syscall ABI).

1. RestrictSUIDSGID= generates a seccomp filter to generated suid/sgid
   binaries, i.e. filters chmod() and related calls and their
   arguments

2. LockPersonality= blocks personality() for most arguments

3. MemoryDenyWriteExecute= blocks mmap() and similar calls if the
   selected map has X and W set at the same time

4. RestrictRealtime= blocks sched_setscheulder() for most parameters

5. RestrictAddressFamilies= blocks socket() and related calls for
   various address families

6. ProtectKernelLogs= blocks the syslog() syscall for most parameters

7. ProtectKernelTunables= blocks the old _sysctl() syscall among some
   other things

8. RestrictNamespaces= blocks various unshare() and clone() bits

So yeah, if one turns on many of these options in services (and we
generally turn on everything we can for the services we ship) and then
multiply that by the archs you end up with quite a bunch.

If we wanted to optimize that in userspace, then libseccomp would have
to be improved quite substantially to let us know exactly what works
and what doesn't, and to have sane fallback both when building
whitelists and blacklists.

An easy improvement is probably if libseccomp would now start refusing
to install x32 seccomp filters altogether now that x32 is entirely
dead? Or are the entrypoints for x32 syscalls still available in the
kernel? How could userspace figure out if they are available? If
libseccomp doesn't want to add code for that, we probably could have
that in systemd itself too...

Lennart

--
Lennart Poettering, Berlin