[PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

Mon Aug 5 07:36:50 UTC 2019

Hi Andy, 

> On Aug 4, 2019, at 10:47 PM, Andy Lutomirski <luto at kernel.org> wrote:
> 
> On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski <luto at kernel.org> wrote:
>> 
>> On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto at kernel.org> wrote:
>>> 
>>> On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving at fb.com> wrote:
>>>> 
>>>> Hi Andy,
>>>> 
>>>>> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
>>>>>> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
>>>>> 
>>>>> It's been done before -- it's not that hard.  IMO the main tricky bit
>>>>> would be try be somewhat careful about defining exactly what
>>>>> CAP_BPF_ADMIN does.
>>>> 
>>>> Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
>>>> Plumbers conference.
>>>> 
>>>> OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
>>>> like systemd to do sys_bpf() without root.
>>> 
>>> I don't understand the use case here.  Are you talking about systemd
>>> --user?  As far as I know, a user is expected to be able to fully
>>> control their systemd --user process, so giving it unrestricted bpf
>>> access is very close to giving it superuser access, and this doesn't
>>> sound like a good idea.  I think that, if systemd --user needs bpf(),
>>> it either needs real unprivileged bpf() or it needs a privileged
>>> helper (SUID or a daemon) to intermediate this access.
>>> 
>>>> 
>>>>> 
>>>>>>> I don't see why you need to invent a whole new mechanism for this.
>>>>>>> The entire cgroup ecosystem outside bpf() does just fine using the
>>>>>>> write permission on files in cgroupfs to control access.  Why can't
>>>>>>> bpf() do the same thing?
>>>>>> 
>>>>>> It is easier to use write permission for BPF_PROG_ATTACH. But it is
>>>>>> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
>>>>>> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
>>>>>> we should have target concept for all these commands. But that is a
>>>>>> much bigger project. OTOH, "all or nothing" model allows all these
>>>>>> commands at once.
>>>>> 
>>>>> For BPF_PROG_LOAD, I admit I've never understood why permission is
>>>>> required at all.  I think that CAP_SYS_ADMIN or similar should be
>>>>> needed to get is_priv in the verifier, but I think that should mainly
>>>>> be useful for tracing, and that requires lots of privilege anyway.
>>>>> BPF_MAP_* is probably the trickiest part.  One solution would be some
>>>>> kind of bpffs, but I'm sure other solutions are possible.
>>>> 
>>>> Improving permission management of cgroup_bpf is another good topic to
>>>> discuss. However, it is also an overkill for current use case.
>>>> 
>>> 
>>> I looked at the code some more, and I don't think this is so hard
>>> after all.  As I understand it, all of the map..by_id stuff is, to
>>> some extent, deprecated in favor of persistent maps.  As I see it, the
>>> map..by_id calls should require privilege forever, although I can
>>> imagine ways to scope that privilege to a namespace if the maps
>>> themselves were to be scoped to a namespace.
>>> 
>>> Instead, unprivileged tools would use the persistent map interface
>>> roughly like this:
>>> 
>>> $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
>>> 8 entries 64 name mapname
>>> 
>>> This would require that the caller have either CAP_DAC_OVERRIDE or
>>> that the caller have permission to create files in /sys/fs/bpf/my_dir
>>> (using the same rules as for any filesystem), and the resulting map
>>> would end up owned by the creating user and have mode 0600 (or maybe
>>> 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
>>> the various capable() checks that are currently involved in accessing
>>> a persistent map would instead check FMODE_READ or FMODE_WRITE on the
>>> map file as appropriate.
>>> 
>>> Half of this stuff already works.  I just set my system up like this:
>>> 
>>> $ ls -l /sys/fs/bpf
>>> total 0
>>> drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
>>> 
>>> $ mkdir /sys/fs/bpf/luto/test
>>> 
>>> $ ls -l /sys/fs/bpf/luto
>>> total 0
>>> drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
>>> 
>>> I bet that making the bpf() syscalls work appropriately in this
>>> context without privilege would only be a couple of hours of work.
>>> The hard work, creating bpffs and making it function, is already done
>>> :)
>>> 
>>> P.S. The docs for bpftool create are less than fantastic.  The
>>> complete lack of any error message at all when the syscall returns
>>> -EACCES is also not fantastic.
>> 
>> This isn't remotely finished, but I spent a bit of time fiddling with this:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms
>> 
>> What do you think?  (It's obviously not done.  It doesn't compile, and
>> I haven't gotten to the permissions needed to do map operations.  I
>> also haven't touched the capable() checks.)
> 
> I updated the branch.  It compiles, and basic map functionality works!

Thanks a lot for trying this out. This is a very interesting direction
that we will explore. 

> 
> # mount -t bpf bpf /sys/fs/bpf
> # cd /sys/fs/bpf
> # mkdir luto
> # chown luto: luto
> # setpriv --euid=1000 --ruid=1000 bash
> $ pwd
> /sys/fs/bpf
> bash-5.0$ ls -l
> total 0
> drwxr-xr-x 2 luto luto 0 Aug  4 22:41 luto
> bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
> value 8 entries 64 name mapname
> bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Found 0 elements
> 
> # chown root: /sys/fs/bpf/luto/filename
> 
> $ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Error: bpf obj get (/sys/fs/bpf/luto): Permission denied
> 
> So I think it's possible to get a respectable subset of bpf()
> functionality working without privilege in short order :)

I think we have two key questions to answer: 
  1. What subset of bpf() functionality will the users need?
  2. Who are the users? 

Different answers to these two questions lead to different directions.

In our use case, the answers are 
  1) almost all bpf() functionality
  2) highly trusted users (sudoers)

So our initial approach of /dev/bpf allows all bpf() functionality
in one bit in task_struct. (Yes, we can just sudo. But, we would 
rather not use sudo when possible.)

"cgroup management" use case may have answers like:
  1) cgroup_bpf only
  2) users in their own containers

For this case, getting cgroup_bpf related features (cgroup_bpf progs; 
some map types, etc.) work with unprivileged users would be the right 
direction. 

"USDT tracing" use case may have answers like:
  1) uprobe, stockmap, histogram, etc.
  2) unprivileged user, w/ or w/o containers

For this case, the first step is likely hacking sys_perf_event_open(). 

I guess we will need more discussions to decide how to make bpf() 
work better for all these (and more) use cases. 

Thanks,
Song