[RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

Fri Feb 9 17:33:17 UTC 2024

On 2/6/24 20:40, Neeraj Upadhyay wrote:
> Gentle ping.
>   
> John,
> 
> Could you please confirm that:
> 
> a. The AppArmor refcount usage described in the RFC is correct?
> b. Approach taken to fix the scalability issue is valid/correct?
> 

Hi Neeraj,

I know your patchset has been waiting on review for a long time.
Unfortunately I have been very, very busy lately. I will try to
get to it this weekend, but I can't promise that I will be able
to get the review fully done.

john

> 
> On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:
>> Problem Statement
>> =================
>>
>> Nginx performance testing with Apparmor enabled (with nginx
>> running in unconfined profile), on kernel versions 6.1 and 6.5
>> show significant drop in throughput scalability, when Nginx
>> workers are scaled to higher number of CPUs across various
>> L3 cache domains.
>>
>> Below is one sample data on the throughput scalability loss,
>> based on results on AMD Zen4 system with 96 CPUs with SMT
>> core count 2; so, overall, 192 CPUs:
>>
>> Config      Cache Domains     apparmor=off        apparmor=on
>>                               scaling eff (%)      scaling eff (%)
>> 8C16T          1                  100%             100%
>> 16C32T         2                   95%              94%
>> 24C48T         3                   94%              93%
>> 48C96T         6                   92%              88%
>> 96C192T        12                  85%              68%
>>
>> If we look at above data, there is a significant drop in
>> scaling efficiency, when we move to 96 CPUs/192 SMT threads.
>>
>> Perf tool shows most of the contention coming from below
>> 6.56%     nginx  [kernel.vmlinux]      [k] apparmor_current_getsecid_subj
>> 6.22%     nginx  [kernel.vmlinux]      [k] apparmor_file_open
>>
>> The majority of the CPU cycles is found to be due to memory contention
>> in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
>> kref_put() operations on label.
>>
>> Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
>> from 6.7 alleviates the issue to an extent, but not completely:
>>
>> Config      Cache Domains     apparmor=on        apparmor=on (patched)
>>                               scaling eff (%)      scaling eff (%)
>> 8C16T          1                  100%             100%
>> 16C32T         2                   97%              93%
>> 24C48T         3                   94%              92%
>> 48C96T         6                   88%              88%
>> 96C192T        12                  65%              79%
>>
>> This adverse impact gets more pronounced when we move to >192 CPUs.
>> The memory contention and impact increases with high frequency label
>> update operations and labels are marked stale more frequently.
>>
>>
>> Label Refcount Management
>> =========================
>>
>> Apparmor uses label objects (struct aa_label) to manage refcounts for
>> below set of objects:
>>
>> - Applicable profiles
>> - Namespaces (unconfined profile)
>> - Other non-profile references
>>
>> These label references are acquired on various apparmor lsm hooks,
>> on operations such as file open, task kill operations, socket bind,
>> and other file, socket, misc operations which use current task's cred,
>> when the label for the current cred, has been marked stale. This is
>> done to check these operations against the set of allowed operations
>> for the task performing them.
>>
>> Use Percpu refcount for ref management?
>> =======================================
>>
>> The ref put operations (percpu_ref_put()) in percpu refcount,
>> in active mode, do not check whether ref count has dropped to
>> 0. The users of the percpu_ref need to explicitly invoke
>> a percpu_ref_kill() operation, to drop the initial reference,
>> at shutdown paths. After the percpu_ref_kill() operation, ref
>> switches to atomic mode and any new percpu_ref_put() operation
>> checks for the drop to 0 case and invokes the release operation
>> on that label.
>>
>> Labels are marked stale is situations like profile removal,
>> profile updates. For a namespace, the unconfined label reference
>> is dropped when the namespace is destroyed. These points
>> are potential shutdown points for labels. However, killing
>> the percpu ref from these points has few issues:
>>
>> - The label could still be referenced by tasks, which are
>>    still holding the reference to the now stale label.
>>    Killing the label ref while these operations are in progress
>>    will make all subsequent ref-put operations on the stale label
>>    to be atomic, which would still result in memory contention.
>>    Also, any new reference to the stale label, which is acquired
>>    with the elevated refcount will have atomic op contention.
>>
>> - The label is marked stale using a non-atomic write operation.
>>    It is possible that new operations do not observe this flag
>>    and still reference it for quite some time.
>>
>> - Explicitly tracking the shutdown points might not be maintainable
>>    at a per label granularity, as there can be various paths where
>>    label reference could get dropped, such as, before the label has
>>    gone live - object initialization error paths. Also, tracking
>>    the shutdown points for labels which reference other labels -
>>    subprofiles, merged labels requires careful analysis, and adds
>>    heavy burden on ensuring the memory contention is not introduced
>>    by these ref kill points.
>>
>>
>> Proposed Solution
>> =================
>>
>> One potential solution to the refcount scalability problem is to
>> convert the label refcount to a percpu refcount, and manage
>> the initial reference from kworker context. The kworker
>> keeps an extra reference to the label and periodically scans
>> labels and release them if their refcount drops to 0.
>>
>> Below is the sequence of operations, which shows the refcount
>> management with this approach:
>>
>> 1. During label initialization, the percpu ref is initialized in
>>     atomic mode. This is done to ensure that, for cases where the
>>     label hasn't gone live (->ns isn't assigned), mostly during
>>     initialization error paths.
>>
>> 2. Labels are switched to percpu mode at various points -
>>     when a label is added to labelset tree, when a unconfined profile
>>     has been assigned a namespace.
>>
>> 3. As part of the initial prototype, only the in tree labels
>>     are managed by the kworker. These labels are added to a lockless
>>     list. The unconfined labels invoke a percpu_ref_kill() operation
>>     when the namespace is destroyed.
>>
>> 4. The kworker does a periodic scan of all the labels in the
>>     llist. It does below sequence of operations:
>>
>>     a. Enqueue a dummy node to mark the start of scan. This dummy
>>        node is used as start point of scan and ensures that we
>>        there is no additional synchronization required with new
>>        label node additions to the llist. Any new labels will
>>        be processed in next run of the kworker.
>>
>>                        SCAN START PTR
>>                            |
>>                            v
>>        +----------+     +------+    +------+    +------+
>>        |          |     |      |    |      |    |      |
>>        |   head   ------> dummy|--->|label |--->| label|--->NULL
>>        |          |     | node |    |      |    |      |
>>        +----------+     +------+    +------+    +------+
>>
>>
>>        New label addition:
>>
>>                              SCAN START PTR
>>                                   |
>>                                   v
>>        +----------+  +------+  +------+    +------+    +------+
>>        |          |  |      |  |      |    |      |    |      |
>>        |   head   |--> label|--> dummy|--->|label |--->| label|--->NULL
>>        |          |  |      |  | node |    |      |    |      |
>>        +----------+  +------+  +------+    +------+    +------+
>>
>>      b. Traverse through the llist, starting from dummy->next.
>>         If the node is a dummy node, mark it free.
>>         If the node is a label node, do,
>>
>>         i) Switch the label ref to atomic mode. The ref switch wait
>>            for the existing percpu_ref_get() and percpu_ref_put()
>>            operations to complete, by waiting for a RCU grace period.
>>
>>            Once the switch is complete, from this point onwards, any
>>            percpu_ref_get(), percpu_ref_put() operations use
>>            atomic operations.
>>
>>        ii) Drop the initial reference, which was taken while adding
>>            the label node to the llist.
>>
>>       iii) Use a percpu_ref_tryget() increment operation on the
>>            ref, to see if we dropped the last ref count. if we
>>            dropped the last count, we remove the node from the llist.
>>
>>            All of these operations are done inside a RCU critical
>>            section, to avoid race with the release operations,
>>            which can potentially trigger, as soon as we drop
>>            the initial ref count.
>>
>>        iv) If we didn't drop the last ref, switch back the counter
>>            to percpu mode.
>>
>> Using this approach, to move the atomic refcount manipulation out of the
>> contended paths, there is a significant scalability improvement seen on
>> nginx test, and scalability efficiency is close to apparmor-off case.
>>
>> Config      Cache Domains     apparmor=on (percpuref)
>>                                 scaling eff (%)
>> 8C16T          1                  100%
>> 16C32T         2                   96%
>> 24C48T         3                   94%
>> 48C96T         6                   93%
>> 96C192T        12                  90%
>>
>> Limitations
>> ===========
>>
>> 1. Switching to percpu refcount increases memory size overhead, as
>>     percpu memory is allocated for all labels.
>>
>> 2. Deferring labels reclaim could potentially result in memory
>>     pressure, when there are high frequency of label update operations.
>>
>> 3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
>>     These can impact energy efficiency, due to back to back hurry
>>     callbacks. Using deferrable workqueue partly mitigates this.
>>     However, deferring kworker can delay reclaims.
>>
>> 4. Back to back label switches can delay other percpu users, as
>>     there is a single global switch spinlock used by percpu refcount
>>     lib.
>>
>> 5. Long running kworker can delay other use cases like system suspend.
>>     This is mitigated using freezable workqueue and litming node
>>     scans to a max count.
>>
>> 6. There is a window where label operates is atomic mode, when its
>>     counter is being checked for zero. This can potentially result
>>     in high memory contention, during this window which spans RCU
>>     grace period (plus callback execution). For example, when
>>     scanning label corresponding to unconfined profile, all
>>     applications which use unconfined profile would be using
>>     atomic ref increment and decrement operations.
>>
>>     There are a few apparoaches which were tried to mitigate this issue:
>>
>>     a. At a lower time interval, check if scanned label's counter
>>        has changed since the start of label scan. If there is a change
>>        in count, terminate the switch to atomic mode. Below shows the
>>        apparoch using rcuwait.
>>
>>        static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
>>        {
>>           WRITE_ONCE(aa_atomic_switch_complete, true);
>>           rcuwait_wake_up(&aa_reclaim_rcuwait);
>>        }
>>
>>        rcuwait_init(&aa_reclaim_rcuwait);
>>        percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);
>>
>>        atomic_count = percpu_ref_count_read(&label->count);
>>        do {
>>          rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
>>                             (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
>>                             TASK_IDLE,
>>                             msecs_to_jiffies(5));
>>          if (percpu_ref_count_read(&label->count) != atomic_count)
>>                  break;
>>         } while (!READ_ONCE(switch_complete));
>>
>>         However, this approach does not work, as percpu refcount lib does not
>>         allow termination of an ongoing switch operation. Also, the counter
>>         can return to the original value with set of get() and put() operations
>>         before we check the current value.
>>
>>     b. Approaches to notify the reclaim kworker from ref get and put operations
>>        can potentially disturb cache line state between the various CPU
>>        contexts, which are referncing the label, and can potentially impact
>>        scalability again.
>>
>>     c. Swith the label to an immortal percpu ref, while the scan operates
>>        on the current counter.
>>
>>        Below is the sequence of operations to do this:
>>
>>        1. Ensure that both immortal ref and label ref are in percpu mode.
>>           Reinit the immortal ref in percpu mode.
>>
>>           Swap percpu and atomic counters of label refcount and immortal ref
>> 	                          percpu-ref
>>        	                  +-------------------+
>>        +-------+           |  percpu-ctr-addr1 |
>>        | label | --------->|-------------------|    +----------------+
>>        +-------+           |   data            |--->| Atomic counter1|
>>                            +-------------------+    +----------------+
>>        +-------+           +-------------------+
>>        |ImmLbl |---------->|  percpu-ctr-addr2 |    +----------------+
>>        +-------+           |-------------------|--->| Atomic counter2|
>>                            |    data           |    +----------------+
>>                            +-------------------+
>>
>>            label ->percpu-ctr-addr  = percpu-ctr-addr2
>>            ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
>>            label ->data->count      = Atomic counter2
>>            ImmLbl ->data->count     = Atomic counter1
>>    
>>    
>>        2. Check the counters collected in immortal label, by switch it
>>           to atomic mode.
>>
>>        3. If the count is 0, do,
>>           a. Switch immortal counter to percpu again, giving it an
>>              initial count of 1.
>>           b. Swap the label and immortal counters again. The immortal
>>              ref now has the counter values from new percpu ref get
>>              and get operations on the label ref, from the point
>>              when we did the initial swap operation.
>>           c. Transfer the percpu counts in immortal ref to atomic
>>              counter of label percpu refcount.
>>           d. Kill immortal ref, for reinit on next iteration.
>>           e. Switch label percpu ref to atomic mode.
>>           f. If the counter is 1, drop the initial ref.
>>
>>         4. If the count is not 0, re-swap the counters.
>>            a. Switch immortal counter to percpu again, giving it an
>>               initial count of 1.
>>            b. Swap the label and immortal counters again. The immortal
>>               ref now has the counter values from new percpu ref get
>>               and get operations on the label ref, from the point
>>               when we did the initial swap operation.
>>            c. Transfer the percpu counts in immortal ref to atomic
>>               counter of label percpu refcount.
>>            d. Kill immortal ref, for reinit on next iteration.
>>
>>
>>            Using this approach, we ensure that, label ref users do not switch
>>            to atomic mode, while there are active references on the label.
>>            However, this approach requires multiple percpu ref mode switches
>>            and adds high overhead and complexity to the scanning code.
>>
>> Extended/Future Work
>> ====================
>>
>> 1. Look for ways to fix the limitations, as described in the "Limitations"
>>     section.
>>
>> 2. Generalize the approach to percpu rcuref, which is used for contexts
>>     where release path uses RCU grace period for release operations. Patch
>>     7 creates an initial prototype for this.
>>
>> 3. Explore hazard pointers for scalable refcounting of labels.
>>
>> Highly appreciate any feedback/suggestions on the design approach.
>>
>> The patches of this patchset introduce following changes:
>>
>> 1.      Documentation of Apparmor Refcount management.
>>
>> 2.      Switch labels to percpu refcount in atomic mode.
>>
>>          Use percpu refcount for apparmor labels. Initial patch to init
>>          the percpu ref in atomic mode, to evaluate the potential
>>          impact of percpuref on top of kref based implementation.
>>
>> 3.      Switch unconfined namespaces refcount to percpu mode.
>>
>>          Switch unconfined ns labels to percpu mode, and kill the
>>          initial refcount from namespace destroy path.
>>
>> 4.      Add infrastructure to reclaim percpu labels.
>>
>>          Add a label reclaim infrastructure for labels which are
>>          in percpu mode, for managing their inital refcount.
>>
>> 5.      Switch intree labels to percpu mode.
>>
>>          Use label reclaim infrastruture to manage intree labels.
>>
>> 6.      Initial prototype for optimizing ref switch.
>>
>>          Prototype for reducing the time window when a label
>>          scan switches the label ref to atomic mode.
>>
>> 7.      percpu-rcuref: Add basic infrastructure.
>>
>>          Prototype for Percpu refcounts for objects, which protect
>>          their object reclaims using RCU grace period.
>>
>> 8.      Switch labels to percpu rcurefcount in unmanaged mode.
>>
>>          Use percpu rcuref for labels. Start with unmanaged/atomic
>>          mode.
>>
>> 9.      Switch unconfined and in tree labels to managed ref mode.
>>
>>          Use percpu mode with manager worker for unconfined and intree
>>          labels.
>>
>>
>> ------------------------------------------------------------------------
>>
>>   b/Documentation/admin-guide/LSM/ApparmorRefcount.rst |  351 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   b/Documentation/admin-guide/LSM/index.rst            |    1
>>   b/Documentation/admin-guide/kernel-parameters.txt    |    8 +
>>   b/include/linux/percpu-rcurefcount.h                 |  115 ++++++++++++++++
>>   b/include/linux/percpu-refcount.h                    |    2
>>   b/lib/Makefile                                       |    2
>>   b/lib/percpu-rcurefcount.c                           |  336 +++++++++++++++++++++++++++++++++++++++++++++++
>>   b/lib/percpu-refcount.c                              |   93 +++++++++++++
>>   b/security/apparmor/include/label.h                  |   16 +-
>>   b/security/apparmor/include/policy.h                 |    8 -
>>   b/security/apparmor/include/policy_ns.h              |   24 +++
>>   b/security/apparmor/label.c                          |   11 +
>>   b/security/apparmor/lsm.c                            |  145 ++++++++++++++++++++
>>   b/security/apparmor/policy_ns.c                      |    6
>>   include/linux/percpu-refcount.h                      |    2
>>   lib/percpu-refcount.c                                |   93 -------------
>>   security/apparmor/include/label.h                    |   17 +-
>>   security/apparmor/include/policy.h                   |   56 +++----
>>   security/apparmor/include/policy_ns.h                |   24 ---
>>   security/apparmor/label.c                            |   11 -
>>   security/apparmor/lsm.c                              |  325 ++++++++++++----------------------------------
>>   security/apparmor/policy_ns.c                        |    8 -
>>   22 files changed, 1237 insertions(+), 417 deletions(-)
>>
>> base-commit: ab27740f7665