[RFC][PATCH 00/10] Mount, FS, Block and Keyrings notifications [ver #3]
Stephen Smalley
sds at tycho.nsa.gov
Thu Jun 6 12:32:29 UTC 2019
On 6/6/19 5:41 AM, David Howells wrote:
>
> Hi Al,
>
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
>
> (1) Mount topology events, such as mounting, unmounting, mount expiry,
> mount reconfiguration.
>
> (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
> errors (not complete yet).
>
> (3) Key/keyring events, such as creating, linking and removal of keys.
>
> (4) General device events (single common queue) including:
>
> - Block layer events, such as device errors
>
> - USB subsystem events, such as device/bus attach/remove, device
> reset, device errors.
>
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem. To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
>
>
> LSM support is included, but controversial:
>
> (1) The creds of the process that did the fput() that reduced the refcount
> to zero are cached in the file struct.
>
> (2) __fput() overrides the current creds with the creds from (1) whilst
> doing the cleanup, thereby making sure that the creds seen by the
> destruction notification generated by mntput() appears to come from
> the last fputter.
>
> (3) security_post_notification() is called for each queue that we might
> want to post a notification into, thereby allowing the LSM to prevent
> covert communications.
>
> (?) Do I need to add security_set_watch(), say, to rule on whether a watch
> may be set in the first place? I might need to add a variant per
> watch-type.
>
> (?) Do I really need to keep track of the process creds in which an
> implicit object destruction happened? For example, imagine you create
> an fd with fsopen()/fsmount(). It is marked to dissolve the mount it
> refers to on close unless move_mount() clears that flag. Now, imagine
> someone looking at that fd through procfs at the same time as you exit
> due to an error. The LSM sees the destruction notification come from
> the looker if they happen to do their fput() after yours.
I'm not in favor of this approach. Can we check permission to the object
being watched when a watch is set (read-like access), make sure every
access that can trigger a notification requires a (write-like)
permission to the accessed object, and make sure there is some sane way
to control the relationship between the accessed object and the watched
object (write-like)? For cases where we have no object per se or at
least no security structure/label associated with it, we may have to
fall back to a coarse-grained "Can the watcher get this kind of
notification in general?".
>
>
> Design decisions:
>
> (1) A misc chardev is used to create and open a ring buffer:
>
> fd = open("/dev/watch_queue", O_RDWR);
>
> which is then configured and mmap'd into userspace:
>
> ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
>
> The fd cannot be read or written (though there is a facility to use
> write to inject records for debugging) and userspace just pulls data
> directly out of the buffer.
>
> (2) The ring index pointers are stored inside the ring and are thus
> accessible to userspace. Userspace should only update the tail
> pointer and never the head pointer or risk breaking the buffer. The
> kernel checks that the pointers appear valid before trying to use
> them. A 'skip' record is maintained around the pointers.
>
> (3) poll() can be used to wait for data to appear in the buffer.
>
> (4) Records in the buffer are binary, typed and have a length so that they
> can be of varying size.
>
> This means that multiple heterogeneous sources can share a common
> buffer. Tags may be specified when a watchpoint is created to help
> distinguish the sources.
>
> (5) The queue is reusable as there are 16 million types available, of
> which I've used 4, so there is scope for others to be used.
>
> (6) Records are filterable as types have up to 256 subtypes that can be
> individually filtered. Other filtration is also available.
>
> (7) Each time the buffer is opened, a new buffer is created - this means
> that there's no interference between watchers.
>
> (8) When recording a notification, the kernel will not sleep, but will
> rather mark a queue as overrun if there's insufficient space, thereby
> avoiding userspace causing the kernel to hang.
>
> (9) The 'watchpoint' should be specific where possible, meaning that you
> specify the object that you want to watch.
>
> (10) The buffer is created and then watchpoints are attached to it, using
> one of:
>
> keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
>
> where in all three cases, fd indicates the queue and the number after
> is a tag between 0 and 255.
>
> (11) The watch must be removed if either the watch buffer is destroyed or
> the watched object is destroyed.
>
>
> Things I want to avoid:
>
> (1) Introducing features that make the core VFS dependent on the network
> stack or networking namespaces (ie. usage of netlink).
>
> (2) Dumping all this stuff into dmesg and having a daemon that sits there
> parsing the output and distributing it as this then puts the
> responsibility for security into userspace and makes handling
> namespaces tricky. Further, dmesg might not exist or might be
> inaccessible inside a container.
>
> (3) Letting users see events they shouldn't be able to see.
>
>
> Further things that could be considered:
>
> (1) Adding a keyctl call to allow a watch on a keyring to be extended to
> "children" of that keyring, such that the watch is removed from the
> child if it is unlinked from the keyring.
>
> (2) Adding global superblock event queue.
>
> (3) Propagating watches to child superblock over automounts.
>
>
> The patches can be found here also:
>
> http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
>
> Changes:
>
> v3: I've added a USB notification source and reformulated the block
> notification source so that there's now a common watch list, for which
> the system call is now device_notify().
>
> I've assigned a pair of unused ioctl numbers in the 'W' series to the
> ioctls added by this series.
>
> I've also added a description of the kernel API to the documentation.
>
> v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
> krefs for refcounting. I've added some security features to try and
> give Casey Schaufler the LSM control he wants.
>
> David
> ---
> David Howells (10):
> security: Override creds in __fput() with last fputter's creds
> General notification queue with user mmap()'able ring buffer
> keys: Add a notification facility
> vfs: Add a mount-notification facility
> vfs: Add superblock notifications
> fsinfo: Export superblock notification counter
> Add a general, global device notification watch list
> block: Add block layer notifications
> usb: Add USB subsystem notifications
> Add sample notification program
>
>
> Documentation/ioctl/ioctl-number.txt | 1
> Documentation/security/keys/core.rst | 58 ++
> Documentation/watch_queue.rst | 492 ++++++++++++++++++
> arch/x86/entry/syscalls/syscall_32.tbl | 3
> arch/x86/entry/syscalls/syscall_64.tbl | 3
> block/Kconfig | 9
> block/blk-core.c | 29 +
> drivers/base/Kconfig | 9
> drivers/base/Makefile | 1
> drivers/base/notify.c | 82 +++
> drivers/misc/Kconfig | 13
> drivers/misc/Makefile | 1
> drivers/misc/watch_queue.c | 889 ++++++++++++++++++++++++++++++++
> drivers/usb/core/Kconfig | 10
> drivers/usb/core/devio.c | 55 ++
> drivers/usb/core/hub.c | 3
> fs/Kconfig | 21 +
> fs/Makefile | 1
> fs/file_table.c | 12
> fs/fsinfo.c | 12
> fs/mount.h | 33 +
> fs/mount_notify.c | 180 ++++++
> fs/namespace.c | 9
> fs/super.c | 116 ++++
> include/linux/blkdev.h | 15 +
> include/linux/dcache.h | 1
> include/linux/device.h | 7
> include/linux/fs.h | 79 +++
> include/linux/key.h | 4
> include/linux/lsm_hooks.h | 15 +
> include/linux/security.h | 14 +
> include/linux/syscalls.h | 5
> include/linux/usb.h | 19 +
> include/linux/watch_queue.h | 87 +++
> include/uapi/linux/fsinfo.h | 10
> include/uapi/linux/keyctl.h | 1
> include/uapi/linux/watch_queue.h | 213 ++++++++
> kernel/sys_ni.c | 7
> mm/interval_tree.c | 2
> mm/memory.c | 1
> samples/Kconfig | 6
> samples/Makefile | 1
> samples/vfs/test-fsinfo.c | 13
> samples/watch_queue/Makefile | 9
> samples/watch_queue/watch_test.c | 310 +++++++++++
> security/keys/Kconfig | 10
> security/keys/compat.c | 2
> security/keys/gc.c | 5
> security/keys/internal.h | 30 +
> security/keys/key.c | 37 +
> security/keys/keyctl.c | 88 +++
> security/keys/keyring.c | 17 -
> security/keys/request_key.c | 4
> security/security.c | 9
> 54 files changed, 3025 insertions(+), 38 deletions(-)
> create mode 100644 Documentation/watch_queue.rst
> create mode 100644 drivers/base/notify.c
> create mode 100644 drivers/misc/watch_queue.c
> create mode 100644 fs/mount_notify.c
> create mode 100644 include/linux/watch_queue.h
> create mode 100644 include/uapi/linux/watch_queue.h
> create mode 100644 samples/watch_queue/Makefile
> create mode 100644 samples/watch_queue/watch_test.c
>
More information about the Linux-security-module-archive
mailing list