[RFC PATCH v19 1/5] exec: Add a new AT_CHECK flag to execveat(2)

Thu Jul 18 14:46:54 UTC 2024

On Wed, Jul 17, 2024 at 10:08 PM Jeff Xu <jeffxu at google.com> wrote:
>
> On Wed, Jul 17, 2024 at 3:01 AM Mickaël Salaün <mic at digikod.net> wrote:
> >
> > On Tue, Jul 16, 2024 at 11:33:55PM -0700, Jeff Xu wrote:
> > > On Thu, Jul 4, 2024 at 12:02 PM Mickaël Salaün <mic at digikod.net> wrote:
> > > >
> > > > Add a new AT_CHECK flag to execveat(2) to check if a file would be
> > > > allowed for execution.  The main use case is for script interpreters and
> > > > dynamic linkers to check execution permission according to the kernel's
> > > > security policy. Another use case is to add context to access logs e.g.,
> > > > which script (instead of interpreter) accessed a file.  As any
> > > > executable code, scripts could also use this check [1].
> > > >
> > > > This is different than faccessat(2) which only checks file access
> > > > rights, but not the full context e.g. mount point's noexec, stack limit,
> > > > and all potential LSM extra checks (e.g. argv, envp, credentials).
> > > > Since the use of AT_CHECK follows the exact kernel semantic as for a
> > > > real execution, user space gets the same error codes.
> > > >
> > > So we concluded that execveat(AT_CHECK) will be used to check the
> > > exec, shared object, script and config file (such as seccomp config),
> >
> > "config file" that contains executable code.
> >
> Is seccomp config  considered as "contains executable code", seccomp
> config is translated into bpf, so maybe yes ? but bpf is running in
> the kernel.
>
> > > I'm still thinking  execveat(AT_CHECK) vs faccessat(AT_CHECK) in
> > > different use cases:
> > >
> > > execveat clearly has less code change, but that also means: we can't
> > > add logic specific to exec (i.e. logic that can't be applied to
> > > config) for this part (from do_execveat_common to
> > > security_bprm_creds_for_exec) in future.  This would require some
> > > agreement/sign-off, I'm not sure from whom.
> >
> > I'm not sure to follow. We could still add new flags, but for now I
> > don't see use cases.  This patch series is not meant to handle all
> > possible "trust checks", only executable code, which makes sense for the
> > kernel.
> >
> I guess the "configfile" discussion is where I get confused, at one
> point, I think this would become a generic "trust checks" api for
> everything related to "generating executable code", e.g. javascript,
> java code, and more.
> We will want to clearly define the scope of execveat(AT_CHECK)
>
> > If we want other checks, we'll need to clearly define their semantic and
> > align with the kernel.  faccessat2(2) might be used to check other file
> > properties, but the executable property is not only defined by the file
> > attributes.
> >
> Agreed.
>
> > >
> > > --------------------------
> > > now looked at user cases (focus on elf for now)
> > >
> > > 1> ld.so /tmp/a.out, /tmp/a.out is on non-exec mount
> > > dynamic linker will first call execveat(fd, AT_CHECK) then execveat(fd)
> > >
> > > 2> execve(/usr/bin/some.out) and some.out has dependency on /tmp/a.so
> > > /usr/bin/some.out will pass AT_CHECK
> > >
> > > 3> execve(usr/bin/some.out) and some.out uses custom /tmp/ld.so
> > > /usr/bin/some.out will pass AT_CHECK, however, it uses a custom
> > > /tmp/ld.so (I assume this is possible  for elf header will set the
> > > path for ld.so because kernel has no knowledge of that, and
> > > binfmt_elf.c allocate memory for ld.so during execveat call)
> > >
> > > 4> dlopen(/tmp/a.so)
> > > I assume dynamic linker will call execveat(AT_CHECK), before map a.so
> > > into memory.
> > >
> > > For case 1>
> > > Alternative solution: Because AT_CHECK is always called, I think we
> > > can avoid the first AT_CHECK call, and check during execveat(fd),
> >
> > There is no need to use AT_CHECK if we're going to call execveat(2) on
> > the same file descriptor.  By design, AT_CHECK is implicit for any
> > execve(2).
> >
> Yes. I realized I was wrong to say that ld.so will call execve() for
> /tmp/a.out, there is no execve() call, otherwise it would have been
> blocked already today.
> The ld.so will  mmap the /tmp/a.out directly.  So case 1 is no
> different than case 2 and 4.  ( the elf objects are mapped to memory
> by dynamic linker.)
> I'm not familiar with dynamic linker, Florian is on this thread, and
> can help to correct me if my guess is wrong.

for Android, this has been the nail in the coffin of previous attempts
to disallow running code from non-trusted filesystems --- instead of
execing /tmp/a.out, the attacker just execs the linker with /tmp/a.out
as an argument. people are doing this already in some cases, because
we already have ineffectual "barriers" in place. [the usual argument
for doing such things anyway is "it makes it harder to be doing this
by _accident_".]

the other workaround for the attacker is to copy and paste the entire
dynamic linker source and change the bits they don't like :-) (if
you're thinking "is that a thing?", yes, so much so that the idea has
been independently reinvented multiple times by several major legit
apps and by basically every piece of DRM middleware. which is why --
although i'm excited by mseal(2) -- i expect to face significant
challenges rolling it out in Android _especially_ in places like
"dynamic linker internal data structures" where i've wanted it for
years!)

this proposal feels like it _ought_ to let a defender tighten their
seccomp filter to require a "safe" fd if i'm using mmap() with an fd,
but in practice as long as JITs exist i can always just copy code into
a non-fd-backed mmap() region. and -- from the perspective of Android,
where all "apps" are code loaded into a Java runtime -- there's not
much getting away from JITs. (and last i looked, ART -- Android's Java
runtime -- uses memfd() for the JIT cache.)

> > > this means the kernel will enforce SECBIT_EXEC_RESTRICT_FILE = 1, the
> > > benefit is that there is no TOCTOU and save one round trip of syscall
> > > for a succesful execveat() case.
> >
> > As long as user space uses the same file descriptor, there is no TOCTOU.
> >
> > SECBIT_EXEC_RESTRICT_FILE only makes sense for user space: it defines
> > the user space security policy.  The kernel already enforces the same
> > security policy for any execve(2), whatever are the calling process's
> > securebits.
> >
> > >
> > > For case 2>
> > > dynamic linker will call execve(AT_CHECK), then mmap(fd) into memory.
> > > However,  the process can all open then mmap() directly, it seems
> > > minimal effort for an attacker to walk around such a defence from
> > > dynamic linker.
> >
> > Which process?  What do you mean by "can all open then mmap() directly"?
> >
> > In this context the dynamic linker (like its parent processes) is
> > trusted (guaranteed by the system).
> >
> > For case 2, the dynamic linker must check with AT_CHECK all files that
> > will be mapped, which include /usr/bin/some.out and /tmp/a.so
> >
> My point is that the process can work around this by mmap() the file directly.
>
> > >
> > > Alternative solution:
> > > dynamic linker call AT_CHECK for each .so, kernel will save the state
> > > (associated with fd)
> > > kernel will check fd state at the time of mmap(fd, executable memory)
> > > and enforce SECBIT_EXEC_RESTRICT_FILE = 1
> >
> > The idea with AT_CHECK is that there is no kernel side effect, no extra
> > kernel state, and the semantic is the same as with execve(2).
> >
> > This also enables us to check file's executable permission and ignore
> > it, which is useful in a "permissive mode" when preparing for a
> > migration without breaking a system, or to do extra integrity checks.
> For preparing a migration (detect all violations), this is useful.
> But as a defense mechanism (SECBIT_EXEC_RESTRICT_FILE = 1) , this
> seems to be weak, at least for elf loading case.
>
> > BTW, this use case would also be more complex with a new openat2(2) flag
> > like the original O_MAYEXEC.
> >
> > >
> > > Alternative solution 2:
> > > a new syscall to load the .so and enforce the AT_CHECK in kernel
> >
> > A new syscall would be overkill for this feature.  Please see Linus's
> > comment.
> >
> maybe, I was thinking on how to prevent "/tmp/a.o" from getting mmap()
> to executable memory.
>
> > >
> > > This also means, for the solution to be complete, we might want to
> > > block creation of executable anonymous memory (e.g. by seccomp, ),
> >
> > How seccomp could create anonymous memory in user space?
> > seccomp filters should be treated (and checked with AT_CHECK) as
> > executable code anyway.
> >
> > > unless the user space can harden the creation of  executable anonymous
> > > memory in some way.
> >
> > User space is already in charge of mmapping its own memory.  I don't see
> > what is missing.
> >
> > >
> > > For case 3>
> > > I think binfmt_elf.c in the kernel needs to check the ld.so to make
> > > sure it passes AT_CHECK, before loading it into memory.
> >
> > All ELF dependencies are opened and checked with open_exec(), which
> > perform the main executability checks (with the __FMODE_EXEC flag).
> > Did I miss something?
> >
> I mean the ld-linux-x86-64.so.2 which is loaded by binfmt in the kernel.
> The app can choose its own dynamic linker path during build, (maybe
> even statically link one ?)  This is another reason that relying on a
> userspace only is not enough.
>
> > However, we must be careful with programs using the (deprecated)
> > uselib(2). They should also check with AT_CHECK because this syscall
> > opens the shared library without __FMODE_EXEC (similar to a simple file
> > open). See
> > https://lore.kernel.org/all/CAHk-=wiUwRG7LuR=z5sbkFVGQh+7qVB6_1NM0Ny9SVNL1Un4Sw@mail.gmail.com/
> >
> > >
> > > For case 4>
> > > same as case 2.
> > >
> > > Consider those cases: I think:
> > > a> relying purely on userspace for enforcement does't seem to be
> > > effective,  e.g. it is trivial  to call open(), then mmap() it into
> > > executable memory.
> >
> > As Steve explained (and is also explained in the patches), it is trivial
> > if the attacker can already execute its own code, which is too late to
> > enforce any script execution control.
> >
> > > b> if both user space and kernel need to call AT_CHECK, the faccessat
> > > seems to be a better place for AT_CHECK, e.g. kernel can call
> > > do_faccessat(AT_CHECK) and userspace can call faccessat(). This will
> > > avoid complicating the execveat() code path.
> >
> > A previous version of this patches series already patched faccessat(2),
> > but this is not the right place.  faccessat2(2) is dedicated to check
> > file permissions, not executability (e.g. with mount's noexec).
> >
> > >
> > > What do you think ?
> >
> > I think there are some misunderstandings.  Please let me know if it's
> > clearer now.
> >
> I'm still not sure about the user case for dynamic linker (elf
> loading) case. Maybe this patch is more suitable for scripts?
> A detailed user case will help demonstrate the use case for dynamic
> linker, e.g. what kind of app will benefit from
> SECBIT_EXEC_RESTRICT_FILE = 1, what kind of threat model are we
> dealing with , what kind of attack chain we blocked as a result.
>
> > >
> > > Thanks
> > > -Jeff
> > >
> > > > With the information that a script interpreter is about to interpret a
> > > > script, an LSM security policy can adjust caller's access rights or log
> > > > execution request as for native script execution (e.g. role transition).
> > > > This is possible thanks to the call to security_bprm_creds_for_exec().
> > > >
> > > > Because LSMs may only change bprm's credentials, use of AT_CHECK with
> > > > current kernel code should not be a security issue (e.g. unexpected role
> > > > transition).  LSMs willing to update the caller's credential could now
> > > > do so when bprm->is_check is set.  Of course, such policy change should
> > > > be in line with the new user space code.
> > > >
> > > > Because AT_CHECK is dedicated to user space interpreters, it doesn't
> > > > make sense for the kernel to parse the checked files, look for
> > > > interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC
> > > > if the format is unknown.  Because of that, security_bprm_check() is
> > > > never called when AT_CHECK is used.
> > > >
> > > > It should be noted that script interpreters cannot directly use
> > > > execveat(2) (without this new AT_CHECK flag) because this could lead to
> > > > unexpected behaviors e.g., `python script.sh` could lead to Bash being
> > > > executed to interpret the script.  Unlike the kernel, script
> > > > interpreters may just interpret the shebang as a simple comment, which
> > > > should not change for backward compatibility reasons.
> > > >
> > > > Because scripts or libraries files might not currently have the
> > > > executable permission set, or because we might want specific users to be
> > > > allowed to run arbitrary scripts, the following patch provides a dynamic
> > > > configuration mechanism with the SECBIT_SHOULD_EXEC_CHECK and
> > > > SECBIT_SHOULD_EXEC_RESTRICT securebits.
> > > >
> > > > This is a redesign of the CLIP OS 4's O_MAYEXEC:
> > > > https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
> > > > This patch has been used for more than a decade with customized script
> > > > interpreters.  Some examples can be found here:
> > > > https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
> > > >
> > > > Cc: Al Viro <viro at zeniv.linux.org.uk>
> > > > Cc: Christian Brauner <brauner at kernel.org>
> > > > Cc: Kees Cook <keescook at chromium.org>
> > > > Cc: Paul Moore <paul at paul-moore.com>
> > > > Link: https://docs.python.org/3/library/io.html#io.open_code [1]
> > > > Signed-off-by: Mickaël Salaün <mic at digikod.net>
> > > > Link: https://lore.kernel.org/r/20240704190137.696169-2-mic@digikod.net
> > > > ---
> > > >
> > > > New design since v18:
> > > > https://lore.kernel.org/r/20220104155024.48023-3-mic@digikod.net
> > > > ---
> > > >  fs/exec.c                  |  5 +++--
> > > >  include/linux/binfmts.h    |  7 ++++++-
> > > >  include/uapi/linux/fcntl.h | 30 ++++++++++++++++++++++++++++++
> > > >  kernel/audit.h             |  1 +
> > > >  kernel/auditsc.c           |  1 +
> > > >  5 files changed, 41 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/fs/exec.c b/fs/exec.c
> > > > index 40073142288f..ea2a1867afdc 100644
> > > > --- a/fs/exec.c
> > > > +++ b/fs/exec.c
> > > > @@ -931,7 +931,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
> > > >                 .lookup_flags = LOOKUP_FOLLOW,
> > > >         };
> > > >
> > > > -       if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
> > > > +       if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH | AT_CHECK)) != 0)
> > > >                 return ERR_PTR(-EINVAL);
> > > >         if (flags & AT_SYMLINK_NOFOLLOW)
> > > >                 open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
> > > > @@ -1595,6 +1595,7 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int fl
> > > >                 bprm->filename = bprm->fdpath;
> > > >         }
> > > >         bprm->interp = bprm->filename;
> > > > +       bprm->is_check = !!(flags & AT_CHECK);
> > > >
> > > >         retval = bprm_mm_init(bprm);
> > > >         if (!retval)
> > > > @@ -1885,7 +1886,7 @@ static int bprm_execve(struct linux_binprm *bprm)
> > > >
> > > >         /* Set the unchanging part of bprm->cred */
> > > >         retval = security_bprm_creds_for_exec(bprm);
> > > > -       if (retval)
> > > > +       if (retval || bprm->is_check)
> > > >                 goto out;
> > > >
> > > >         retval = exec_binprm(bprm);
> > > > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > > > index 70f97f685bff..8ff9c9e33aed 100644
> > > > --- a/include/linux/binfmts.h
> > > > +++ b/include/linux/binfmts.h
> > > > @@ -42,7 +42,12 @@ struct linux_binprm {
> > > >                  * Set when errors can no longer be returned to the
> > > >                  * original userspace.
> > > >                  */
> > > > -               point_of_no_return:1;
> > > > +               point_of_no_return:1,
> > > > +               /*
> > > > +                * Set by user space to check executability according to the
> > > > +                * caller's environment.
> > > > +                */
> > > > +               is_check:1;
> > > >         struct file *executable; /* Executable to pass to the interpreter */
> > > >         struct file *interpreter;
> > > >         struct file *file;
> > > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > > > index c0bcc185fa48..bcd05c59b7df 100644
> > > > --- a/include/uapi/linux/fcntl.h
> > > > +++ b/include/uapi/linux/fcntl.h
> > > > @@ -118,6 +118,36 @@
> > > >  #define AT_HANDLE_FID          AT_REMOVEDIR    /* file handle is needed to
> > > >                                         compare object identity and may not
> > > >                                         be usable to open_by_handle_at(2) */
> > > > +
> > > > +/*
> > > > + * AT_CHECK only performs a check on a regular file and returns 0 if execution
> > > > + * of this file would be allowed, ignoring the file format and then the related
> > > > + * interpreter dependencies (e.g. ELF libraries, script's shebang).  AT_CHECK
> > > > + * should only be used if SECBIT_SHOULD_EXEC_CHECK is set for the calling
> > > > + * thread.  See securebits.h documentation.
> > > > + *
> > > > + * Programs should use this check to apply kernel-level checks against files
> > > > + * that are not directly executed by the kernel but directly passed to a user
> > > > + * space interpreter instead.  All files that contain executable code, from the
> > > > + * point of view of the interpreter, should be checked.  The main purpose of
> > > > + * this flag is to improve the security and consistency of an execution
> > > > + * environment to ensure that direct file execution (e.g. ./script.sh) and
> > > > + * indirect file execution (e.g. sh script.sh) lead to the same result.  For
> > > > + * instance, this can be used to check if a file is trustworthy according to
> > > > + * the caller's environment.
> > > > + *
> > > > + * In a secure environment, libraries and any executable dependencies should
> > > > + * also be checked.  For instance dynamic linking should make sure that all
> > > > + * libraries are allowed for execution to avoid trivial bypass (e.g. using
> > > > + * LD_PRELOAD).  For such secure execution environment to make sense, only
> > > > + * trusted code should be executable, which also requires integrity guarantees.
> > > > + *
> > > > + * To avoid race conditions leading to time-of-check to time-of-use issues,
> > > > + * AT_CHECK should be used with AT_EMPTY_PATH to check against a file
> > > > + * descriptor instead of a path.
> > > > + */
> > > > +#define AT_CHECK               0x10000
> > > > +
> > > >  #if defined(__KERNEL__)
> > > >  #define AT_GETATTR_NOSEC       0x80000000
> > > >  #endif
> > > > diff --git a/kernel/audit.h b/kernel/audit.h
> > > > index a60d2840559e..8ebdabd2ab81 100644
> > > > --- a/kernel/audit.h
> > > > +++ b/kernel/audit.h
> > > > @@ -197,6 +197,7 @@ struct audit_context {
> > > >                 struct open_how openat2;
> > > >                 struct {
> > > >                         int                     argc;
> > > > +                       bool                    is_check;
> > > >                 } execve;
> > > >                 struct {
> > > >                         char                    *name;
> > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > > index 6f0d6fb6523f..b6316e284342 100644
> > > > --- a/kernel/auditsc.c
> > > > +++ b/kernel/auditsc.c
> > > > @@ -2662,6 +2662,7 @@ void __audit_bprm(struct linux_binprm *bprm)
> > > >
> > > >         context->type = AUDIT_EXECVE;
> > > >         context->execve.argc = bprm->argc;
> > > > +       context->execve.is_check = bprm->is_check;
> > > >  }
> > > >
> > > >
> > > > --
> > > > 2.45.2
> > > >
> > >