[PATCH v4 bpf-next 02/12] bpf: introduce BPF token object
Andrii Nakryiko
andrii.nakryiko at gmail.com
Tue Sep 12 21:46:27 UTC 2023
On Tue, Sep 12, 2023 at 2:30 PM Andrii Nakryiko <andrii at kernel.org> wrote:
>
> Add new kind of BPF kernel object, BPF token. BPF token is meant to
> allow delegating privileged BPF functionality, like loading a BPF
> program or creating a BPF map, from privileged process to a *trusted*
> unprivileged process, all while have a good amount of control over which
> privileged operations could be performed using provided BPF token.
>
> This is achieved through mounting BPF FS instance with extra delegation
> mount options, which determine what operations are delegatable, and also
> constraining it to the owning user namespace (as mentioned in the
> previous patch).
>
> BPF token itself is just a derivative from BPF FS and can be created
> through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> a path specification (using the usual fd + string path combo) to a BPF
> FS mount. Currently, BPF token "inherits" delegated command, map types,
> prog type, and attach type bit sets from BPF FS as is. In the future,
> having an BPF token as a separate object with its own FD, we can allow
> to further restrict BPF token's allowable set of things either at the creation
> time or after the fact, allowing the process to guard itself further
> from, e.g., unintentionally trying to load undesired kind of BPF
> programs. But for now we keep things simple and just copy bit sets as is.
>
> When BPF token is created from BPF FS mount, we take reference to the
> BPF super block's owning user namespace, and then use that namespace for
> checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> capabilities that are normally only checked against init userns (using
> capable()), but now we check them using ns_capable() instead (if BPF
> token is provided). See bpf_token_capable() for details.
>
> Such setup means that BPF token in itself is not sufficient to grant BPF
> functionality. User namespaced process has to *also* have necessary
> combination of capabilities inside that user namespace. So while
> previously CAP_BPF was useless when granted within user namespace, now
> it gains a meaning and allows container managers and sys admins to have
> a flexible control over which processes can and need to use BPF
> functionality within the user namespace (i.e., container in practice).
> And BPF FS delegation mount options and derived BPF tokens serve as
> a per-container "flag" to grant overall ability to use bpf() (plus further
> restrict on which parts of bpf() syscalls are treated as namespaced).
>
> The alternative to creating BPF token object was:
> a) not having any extra object and just pasing BPF FS path to each
> relevant bpf() command. This seems suboptimal as it's racy (mount
> under the same path might change in between checking it and using it
> for bpf() command). And also less flexible if we'd like to further
> restrict ourselves compared to all the delegated functionality
> allowed on BPF FS.
> b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
> a dedicated FD that would represent a token-like functionality. This
> doesn't seem superior to having a proper bpf() command, so
> BPF_TOKEN_CREATE was chosen.
>
> Signed-off-by: Andrii Nakryiko <andrii at kernel.org>
> ---
> include/linux/bpf.h | 36 +++++++
> include/uapi/linux/bpf.h | 39 +++++++
> kernel/bpf/Makefile | 2 +-
> kernel/bpf/inode.c | 4 +-
> kernel/bpf/syscall.c | 17 +++
> kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++
> tools/include/uapi/linux/bpf.h | 39 +++++++
> 7 files changed, 324 insertions(+), 2 deletions(-)
> create mode 100644 kernel/bpf/token.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e9a3ab390844..6abd2b96e096 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -51,6 +51,8 @@ struct module;
> struct bpf_func_state;
> struct ftrace_ops;
> struct cgroup;
> +struct bpf_token;
> +struct user_namespace;
>
> extern struct idr btf_idr;
> extern spinlock_t btf_idr_lock;
> @@ -1568,6 +1570,13 @@ struct bpf_mount_opts {
> u64 delegate_attachs;
> };
>
> +struct bpf_token {
> + struct work_struct work;
> + atomic64_t refcnt;
> + struct user_namespace *userns;
> + u64 allowed_cmds;
> +};
> +
> struct bpf_struct_ops_value;
> struct btf_member;
>
> @@ -2192,6 +2201,15 @@ int bpf_link_new_fd(struct bpf_link *link);
> struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
>
> +void bpf_token_inc(struct bpf_token *token);
> +void bpf_token_put(struct bpf_token *token);
> +int bpf_token_create(union bpf_attr *attr);
> +int bpf_token_new_fd(struct bpf_token *token);
> +struct bpf_token *bpf_token_get_from_fd(u32 ufd);
> +
> +bool bpf_token_capable(const struct bpf_token *token, int cap);
> +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
> +
> int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
>
> @@ -2551,6 +2569,24 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags)
> return -EOPNOTSUPP;
> }
>
> +static inline void bpf_token_inc(struct bpf_token *token)
> +{
> +}
> +
> +static inline void bpf_token_put(struct bpf_token *token)
> +{
> +}
> +
> +static inline int bpf_token_new_fd(struct bpf_token *token)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
> +{
> + return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> static inline void __dev_flush(void)
> {
> }
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 73b155e52204..36e98c6f8944 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -847,6 +847,37 @@ union bpf_iter_link_info {
> * Returns zero on success. On error, -1 is returned and *errno*
> * is set appropriately.
> *
> + * BPF_TOKEN_CREATE
> + * Description
> + * Create BPF token with embedded information about what
> + * BPF-related functionality it allows:
> + * - a set of allowed bpf() syscall commands;
> + * - a set of allowed BPF map types to be created with
> + * BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
> + * - a set of allowed BPF program types and BPF program attach
> + * types to be loaded with BPF_PROG_LOAD command, if
> + * BPF_PROG_LOAD itself is allowed.
> + *
> + * BPF token is created (derived) from an instance of BPF FS,
> + * assuming it has necessary delegation mount options specified.
> + * BPF FS mount is specified with openat()-style path FD + string.
> + * This BPF token can be passed as an extra parameter to various
> + * bpf() syscall commands to grant BPF subsystem functionality to
> + * unprivileged processes.
> + *
> + * When created, BPF token is "associated" with the owning
> + * user namespace of BPF FS instance (super block) that it was
> + * derived from, and subsequent BPF operations performed with
> + * BPF token would be performing capabilities checks (i.e.,
> + * CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
> + * that user namespace. Without BPF token, such capabilities
> + * have to be granted in init user namespace, making bpf()
> + * syscall incompatible with user namespace, for the most part.
> + *
> + * Return
> + * A new file descriptor (a nonnegative integer), or -1 if an
> + * error occurred (in which case, *errno* is set appropriately).
> + *
> * NOTES
> * eBPF objects (maps and programs) can be shared between processes.
> *
> @@ -901,6 +932,8 @@ enum bpf_cmd {
> BPF_ITER_CREATE,
> BPF_LINK_DETACH,
> BPF_PROG_BIND_MAP,
> + BPF_TOKEN_CREATE,
> + __MAX_BPF_CMD,
> };
>
> enum bpf_map_type {
> @@ -1694,6 +1727,12 @@ union bpf_attr {
> __u32 flags; /* extra flags */
> } prog_bind_map;
>
> + struct { /* struct used by BPF_TOKEN_CREATE command */
> + __u32 flags;
> + __u32 bpffs_path_fd;
> + __u64 bpffs_pathname;
> + } token_create;
> +
> } __attribute__((aligned(8)));
>
> /* The description below is an attempt at providing documentation to eBPF
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index f526b7573e97..4ce95acfcaa7 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
> endif
> CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>
> -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
> +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
> obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
> obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 8f66b57d3546..82f11fbffd3e 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -603,11 +603,13 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
> {
> struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
> umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
> + u64 mask;
>
> if (mode != S_IRWXUGO)
> seq_printf(m, ",mode=%o", mode);
>
> - if (opts->delegate_cmds == ~0ULL)
> + mask = (1ULL << __MAX_BPF_CMD) - 1;
> + if ((opts->delegate_cmds & mask) == mask)
> seq_printf(m, ",delegate_cmds=any");
> else if (opts->delegate_cmds)
> seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 6a692f3bea15..4fae678c1f48 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -5297,6 +5297,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
> return ret;
> }
>
> +#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname
> +
> +static int token_create(union bpf_attr *attr)
> +{
> + if (CHECK_ATTR(BPF_TOKEN_CREATE))
> + return -EINVAL;
> +
> + /* no flags are supported yet */
> + if (attr->token_create.flags)
> + return -EINVAL;
A question to people looking at this: should BPF_TOKEN_CREATE be
guarded with ns_capable(CAP_BPF) or it's fine to rely on FS
permissions alone for this? It can't be capable(CAP_BPF), obviously,
but having it guarded by ns_capable(CAP_BPF) would make it impossible
to even construct a BPF token without having namespaced CAP_BPF inside
the container.
> +
> + return bpf_token_create(attr);
> +}
> +
> static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
> {
> union bpf_attr attr;
> @@ -5430,6 +5444,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
> case BPF_PROG_BIND_MAP:
> err = bpf_prog_bind_map(&attr);
> break;
> + case BPF_TOKEN_CREATE:
> + err = token_create(&attr);
> + break;
> default:
> err = -EINVAL;
> break;
> diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> new file mode 100644
> index 000000000000..f6ea3eddbee6
> --- /dev/null
> +++ b/kernel/bpf/token.c
> @@ -0,0 +1,189 @@
> +#include <linux/bpf.h>
> +#include <linux/vmalloc.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/fdtable.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/kernel.h>
> +#include <linux/idr.h>
> +#include <linux/namei.h>
> +#include <linux/user_namespace.h>
> +
> +bool bpf_token_capable(const struct bpf_token *token, int cap)
> +{
> + /* BPF token allows ns_capable() level of capabilities */
> + if (token) {
> + if (ns_capable(token->userns, cap))
> + return true;
> + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> + return true;
> + }
> + /* otherwise fallback to capable() checks */
> + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> +}
> +
> +void bpf_token_inc(struct bpf_token *token)
> +{
> + atomic64_inc(&token->refcnt);
> +}
> +
> +static void bpf_token_free(struct bpf_token *token)
> +{
> + put_user_ns(token->userns);
> + kvfree(token);
> +}
> +
> +static void bpf_token_put_deferred(struct work_struct *work)
> +{
> + struct bpf_token *token = container_of(work, struct bpf_token, work);
> +
> + bpf_token_free(token);
> +}
> +
> +void bpf_token_put(struct bpf_token *token)
> +{
> + if (!token)
> + return;
> +
> + if (!atomic64_dec_and_test(&token->refcnt))
> + return;
> +
> + INIT_WORK(&token->work, bpf_token_put_deferred);
> + schedule_work(&token->work);
> +}
> +
> +static int bpf_token_release(struct inode *inode, struct file *filp)
> +{
> + struct bpf_token *token = filp->private_data;
> +
> + bpf_token_put(token);
> + return 0;
> +}
> +
> +static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz,
> + loff_t *ppos)
> +{
> + /* We need this handler such that alloc_file() enables
> + * f_mode with FMODE_CAN_READ.
> + */
> + return -EINVAL;
> +}
> +
> +static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf,
> + size_t siz, loff_t *ppos)
> +{
> + /* We need this handler such that alloc_file() enables
> + * f_mode with FMODE_CAN_WRITE.
> + */
> + return -EINVAL;
> +}
> +
> +static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
> +{
> + struct bpf_token *token = filp->private_data;
> + u64 mask;
> +
> + mask = (1ULL << __MAX_BPF_CMD) - 1;
> + if ((token->allowed_cmds & mask) == mask)
> + seq_printf(m, "allowed_cmds:\tany\n");
> + else
> + seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds);
> +}
> +
> +static const struct file_operations bpf_token_fops = {
> + .release = bpf_token_release,
> + .read = bpf_dummy_read,
> + .write = bpf_dummy_write,
> + .show_fdinfo = bpf_token_show_fdinfo,
> +};
> +
> +static struct bpf_token *bpf_token_alloc(void)
> +{
> + struct bpf_token *token;
> +
> + token = kvzalloc(sizeof(*token), GFP_USER);
> + if (!token)
> + return NULL;
> +
> + atomic64_set(&token->refcnt, 1);
> +
> + return token;
> +}
> +
> +int bpf_token_create(union bpf_attr *attr)
> +{
> + struct path path;
> + struct bpf_mount_opts *mnt_opts;
> + struct bpf_token *token;
> + int ret;
> +
> + ret = user_path_at(attr->token_create.bpffs_path_fd,
> + u64_to_user_ptr(attr->token_create.bpffs_pathname),
> + LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> + if (ret)
> + return ret;
> +
> + if (path.mnt->mnt_root != path.dentry) {
> + ret = -EINVAL;
> + goto out;
> + }
> + ret = path_permission(&path, MAY_ACCESS);
> + if (ret)
> + goto out;
> +
> + token = bpf_token_alloc();
> + if (!token) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + /* remember bpffs owning userns for future ns_capable() checks */
> + token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> +
> + mnt_opts = path.dentry->d_sb->s_fs_info;
> + token->allowed_cmds = mnt_opts->delegate_cmds;
> +
> + ret = bpf_token_new_fd(token);
> + if (ret < 0)
> + bpf_token_free(token);
> +out:
> + path_put(&path);
> + return ret;
> +}
> +
> +#define BPF_TOKEN_INODE_NAME "bpf-token"
> +
> +/* Alloc anon_inode and FD for prepared token.
> + * Returns fd >= 0 on success; negative error, otherwise.
> + */
> +int bpf_token_new_fd(struct bpf_token *token)
> +{
> + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);
> +}
> +
> +struct bpf_token *bpf_token_get_from_fd(u32 ufd)
> +{
> + struct fd f = fdget(ufd);
> + struct bpf_token *token;
> +
> + if (!f.file)
> + return ERR_PTR(-EBADF);
> + if (f.file->f_op != &bpf_token_fops) {
> + fdput(f);
> + return ERR_PTR(-EINVAL);
> + }
> +
> + token = f.file->private_data;
> + bpf_token_inc(token);
> + fdput(f);
> +
> + return token;
> +}
> +
> +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> +{
> + if (!token)
> + return false;
> +
> + return token->allowed_cmds & (1ULL << cmd);
> +}
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 73b155e52204..36e98c6f8944 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -847,6 +847,37 @@ union bpf_iter_link_info {
> * Returns zero on success. On error, -1 is returned and *errno*
> * is set appropriately.
> *
> + * BPF_TOKEN_CREATE
> + * Description
> + * Create BPF token with embedded information about what
> + * BPF-related functionality it allows:
> + * - a set of allowed bpf() syscall commands;
> + * - a set of allowed BPF map types to be created with
> + * BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
> + * - a set of allowed BPF program types and BPF program attach
> + * types to be loaded with BPF_PROG_LOAD command, if
> + * BPF_PROG_LOAD itself is allowed.
> + *
> + * BPF token is created (derived) from an instance of BPF FS,
> + * assuming it has necessary delegation mount options specified.
> + * BPF FS mount is specified with openat()-style path FD + string.
> + * This BPF token can be passed as an extra parameter to various
> + * bpf() syscall commands to grant BPF subsystem functionality to
> + * unprivileged processes.
> + *
> + * When created, BPF token is "associated" with the owning
> + * user namespace of BPF FS instance (super block) that it was
> + * derived from, and subsequent BPF operations performed with
> + * BPF token would be performing capabilities checks (i.e.,
> + * CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
> + * that user namespace. Without BPF token, such capabilities
> + * have to be granted in init user namespace, making bpf()
> + * syscall incompatible with user namespace, for the most part.
> + *
> + * Return
> + * A new file descriptor (a nonnegative integer), or -1 if an
> + * error occurred (in which case, *errno* is set appropriately).
> + *
> * NOTES
> * eBPF objects (maps and programs) can be shared between processes.
> *
> @@ -901,6 +932,8 @@ enum bpf_cmd {
> BPF_ITER_CREATE,
> BPF_LINK_DETACH,
> BPF_PROG_BIND_MAP,
> + BPF_TOKEN_CREATE,
> + __MAX_BPF_CMD,
> };
>
> enum bpf_map_type {
> @@ -1694,6 +1727,12 @@ union bpf_attr {
> __u32 flags; /* extra flags */
> } prog_bind_map;
>
> + struct { /* struct used by BPF_TOKEN_CREATE command */
> + __u32 flags;
> + __u32 bpffs_path_fd;
> + __u64 bpffs_pathname;
> + } token_create;
> +
> } __attribute__((aligned(8)));
>
> /* The description below is an attempt at providing documentation to eBPF
> --
> 2.34.1
>
>
More information about the Linux-security-module-archive
mailing list