[PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Arnaldo Carvalho de Melo arnaldo.melo at gmail.com
Tue Apr 7 17:23:40 UTC 2020

Em Tue, Apr 07, 2020 at 01:56:43PM -0300, Arnaldo Carvalho de Melo escreveu:
> But then, even with that attr.exclude_kernel set to 1 we _still_ get
> kernel samples, which looks like another bug, now trying with strace,
> which leads us to another rabbit hole:
> [perf at five ~]$ strace -e perf_event_open -o /tmp/out.put perf top --stdio
> Error:
> You may not have permission to collect system-wide stats.
> Consider tweaking /proc/sys/kernel/perf_event_paranoid,
> which controls use of the performance events system by
> unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).
> The current value is 2:
>   -1: Allow use of (almost) all events by all users
>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN
>       Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN
> >= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN
> >= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN
> To make this setting permanent, edit /etc/sysctl.conf too, e.g.:
> 	kernel.perf_event_paranoid = -1
> [perf at five ~]$
> If I remove that strace -e ... from the front, 'perf top' is back
> working as a non-cap_sys_admin user, just with cap_perfmon.

So I couldn't figure it out so far why is that exclude_kernel is being
set to 1, as perf-top when no event is passed defaults to this to find
out what to use as a default event:

	struct perf_event_attr attr = {
                .type   = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
                .exclude_kernel = !perf_event_can_profile_kernel(),

			        return perf_cap__capable(CAP_SYS_ADMIN) ||
				       perf_cap__capable(CAP_PERFMON) ||
				       perf_event_paranoid() <= max_level;

And then that second condition should hold true, it returns true, and
then .exclude_kernel should be set to !true -> zero.o

Now the wallclock says I need to stop being a programmer and turn into a
daycare provider for Pedro, cya!

- Arnaldo

