[RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

Wed Feb 7 04:40:15 UTC 2024

Gentle ping.

John,

Could you please confirm that:

a. The AppArmor refcount usage described in the RFC is correct?
b. Approach taken to fix the scalability issue is valid/correct?

Thanks
Neeraj

On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:
> Problem Statement
> =================
> 
> Nginx performance testing with Apparmor enabled (with nginx
> running in unconfined profile), on kernel versions 6.1 and 6.5
> show significant drop in throughput scalability, when Nginx
> workers are scaled to higher number of CPUs across various
> L3 cache domains.
> 
> Below is one sample data on the throughput scalability loss,
> based on results on AMD Zen4 system with 96 CPUs with SMT
> core count 2; so, overall, 192 CPUs:
> 
> Config      Cache Domains     apparmor=off        apparmor=on
>                              scaling eff (%)      scaling eff (%)
> 8C16T          1                  100%             100%
> 16C32T         2                   95%              94%
> 24C48T         3                   94%              93%
> 48C96T         6                   92%              88%
> 96C192T        12                  85%              68%
> 
> If we look at above data, there is a significant drop in
> scaling efficiency, when we move to 96 CPUs/192 SMT threads.
> 
> Perf tool shows most of the contention coming from below
> 6.56%     nginx  [kernel.vmlinux]      [k] apparmor_current_getsecid_subj 
> 6.22%     nginx  [kernel.vmlinux]      [k] apparmor_file_open
> 
> The majority of the CPU cycles is found to be due to memory contention
> in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
> kref_put() operations on label.
> 
> Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
> from 6.7 alleviates the issue to an extent, but not completely:
> 
> Config      Cache Domains     apparmor=on        apparmor=on (patched)
>                              scaling eff (%)      scaling eff (%)
> 8C16T          1                  100%             100%
> 16C32T         2                   97%              93%
> 24C48T         3                   94%              92%
> 48C96T         6                   88%              88%
> 96C192T        12                  65%              79%
> 
> This adverse impact gets more pronounced when we move to >192 CPUs.
> The memory contention and impact increases with high frequency label
> update operations and labels are marked stale more frequently.
> 
> 
> Label Refcount Management
> =========================
> 
> Apparmor uses label objects (struct aa_label) to manage refcounts for
> below set of objects:
> 
> - Applicable profiles
> - Namespaces (unconfined profile)
> - Other non-profile references
> 
> These label references are acquired on various apparmor lsm hooks,
> on operations such as file open, task kill operations, socket bind,
> and other file, socket, misc operations which use current task's cred,
> when the label for the current cred, has been marked stale. This is
> done to check these operations against the set of allowed operations
> for the task performing them.
> 
> Use Percpu refcount for ref management?
> =======================================
> 
> The ref put operations (percpu_ref_put()) in percpu refcount,
> in active mode, do not check whether ref count has dropped to
> 0. The users of the percpu_ref need to explicitly invoke
> a percpu_ref_kill() operation, to drop the initial reference,
> at shutdown paths. After the percpu_ref_kill() operation, ref
> switches to atomic mode and any new percpu_ref_put() operation
> checks for the drop to 0 case and invokes the release operation
> on that label.
> 
> Labels are marked stale is situations like profile removal,
> profile updates. For a namespace, the unconfined label reference
> is dropped when the namespace is destroyed. These points
> are potential shutdown points for labels. However, killing
> the percpu ref from these points has few issues:
> 
> - The label could still be referenced by tasks, which are
>   still holding the reference to the now stale label.
>   Killing the label ref while these operations are in progress
>   will make all subsequent ref-put operations on the stale label
>   to be atomic, which would still result in memory contention.
>   Also, any new reference to the stale label, which is acquired
>   with the elevated refcount will have atomic op contention.
> 
> - The label is marked stale using a non-atomic write operation.
>   It is possible that new operations do not observe this flag
>   and still reference it for quite some time.
> 
> - Explicitly tracking the shutdown points might not be maintainable
>   at a per label granularity, as there can be various paths where
>   label reference could get dropped, such as, before the label has
>   gone live - object initialization error paths. Also, tracking
>   the shutdown points for labels which reference other labels -
>   subprofiles, merged labels requires careful analysis, and adds
>   heavy burden on ensuring the memory contention is not introduced
>   by these ref kill points.
> 
> 
> Proposed Solution
> =================
> 
> One potential solution to the refcount scalability problem is to
> convert the label refcount to a percpu refcount, and manage
> the initial reference from kworker context. The kworker
> keeps an extra reference to the label and periodically scans
> labels and release them if their refcount drops to 0.
> 
> Below is the sequence of operations, which shows the refcount
> management with this approach:
> 
> 1. During label initialization, the percpu ref is initialized in
>    atomic mode. This is done to ensure that, for cases where the
>    label hasn't gone live (->ns isn't assigned), mostly during
>    initialization error paths.
> 
> 2. Labels are switched to percpu mode at various points -
>    when a label is added to labelset tree, when a unconfined profile
>    has been assigned a namespace.
> 
> 3. As part of the initial prototype, only the in tree labels
>    are managed by the kworker. These labels are added to a lockless
>    list. The unconfined labels invoke a percpu_ref_kill() operation
>    when the namespace is destroyed.
> 
> 4. The kworker does a periodic scan of all the labels in the
>    llist. It does below sequence of operations:
> 
>    a. Enqueue a dummy node to mark the start of scan. This dummy
>       node is used as start point of scan and ensures that we
>       there is no additional synchronization required with new
>       label node additions to the llist. Any new labels will
>       be processed in next run of the kworker.
> 
>                       SCAN START PTR
>                           |
>                           v
>       +----------+     +------+    +------+    +------+
>       |          |     |      |    |      |    |      |
>       |   head   ------> dummy|--->|label |--->| label|--->NULL
>       |          |     | node |    |      |    |      |
>       +----------+     +------+    +------+    +------+
> 
> 
>       New label addition:
> 
>                             SCAN START PTR
>                                  |
>                                  v
>       +----------+  +------+  +------+    +------+    +------+
>       |          |  |      |  |      |    |      |    |      |
>       |   head   |--> label|--> dummy|--->|label |--->| label|--->NULL
>       |          |  |      |  | node |    |      |    |      |
>       +----------+  +------+  +------+    +------+    +------+
> 
>     b. Traverse through the llist, starting from dummy->next.
>        If the node is a dummy node, mark it free.
>        If the node is a label node, do,
> 
>        i) Switch the label ref to atomic mode. The ref switch wait
>           for the existing percpu_ref_get() and percpu_ref_put()
>           operations to complete, by waiting for a RCU grace period.
> 
>           Once the switch is complete, from this point onwards, any
>           percpu_ref_get(), percpu_ref_put() operations use
>           atomic operations.
> 
>       ii) Drop the initial reference, which was taken while adding
>           the label node to the llist.
> 
>      iii) Use a percpu_ref_tryget() increment operation on the
>           ref, to see if we dropped the last ref count. if we
>           dropped the last count, we remove the node from the llist.
> 
>           All of these operations are done inside a RCU critical
>           section, to avoid race with the release operations,
>           which can potentially trigger, as soon as we drop
>           the initial ref count.
> 
>       iv) If we didn't drop the last ref, switch back the counter
>           to percpu mode.
> 
> Using this approach, to move the atomic refcount manipulation out of the
> contended paths, there is a significant scalability improvement seen on
> nginx test, and scalability efficiency is close to apparmor-off case.
> 
> Config      Cache Domains     apparmor=on (percpuref)
>                                scaling eff (%)
> 8C16T          1                  100%
> 16C32T         2                   96%
> 24C48T         3                   94%
> 48C96T         6                   93%
> 96C192T        12                  90%
> 
> Limitations
> ===========
> 
> 1. Switching to percpu refcount increases memory size overhead, as
>    percpu memory is allocated for all labels.
> 
> 2. Deferring labels reclaim could potentially result in memory
>    pressure, when there are high frequency of label update operations.
> 
> 3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
>    These can impact energy efficiency, due to back to back hurry
>    callbacks. Using deferrable workqueue partly mitigates this.
>    However, deferring kworker can delay reclaims.
> 
> 4. Back to back label switches can delay other percpu users, as
>    there is a single global switch spinlock used by percpu refcount
>    lib.
> 
> 5. Long running kworker can delay other use cases like system suspend.
>    This is mitigated using freezable workqueue and litming node
>    scans to a max count.
> 
> 6. There is a window where label operates is atomic mode, when its
>    counter is being checked for zero. This can potentially result
>    in high memory contention, during this window which spans RCU
>    grace period (plus callback execution). For example, when
>    scanning label corresponding to unconfined profile, all
>    applications which use unconfined profile would be using
>    atomic ref increment and decrement operations.
> 
>    There are a few apparoaches which were tried to mitigate this issue:
> 
>    a. At a lower time interval, check if scanned label's counter
>       has changed since the start of label scan. If there is a change
>       in count, terminate the switch to atomic mode. Below shows the
>       apparoch using rcuwait.
> 
>       static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
>       {
>          WRITE_ONCE(aa_atomic_switch_complete, true);
>          rcuwait_wake_up(&aa_reclaim_rcuwait);
>       }
> 
>       rcuwait_init(&aa_reclaim_rcuwait);
>       percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);
> 
>       atomic_count = percpu_ref_count_read(&label->count);
>       do {
>         rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
>                            (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
>                            TASK_IDLE,
>                            msecs_to_jiffies(5));
>         if (percpu_ref_count_read(&label->count) != atomic_count)
>                 break;
>        } while (!READ_ONCE(switch_complete));
> 
>        However, this approach does not work, as percpu refcount lib does not
>        allow termination of an ongoing switch operation. Also, the counter
>        can return to the original value with set of get() and put() operations
>        before we check the current value.
> 
>    b. Approaches to notify the reclaim kworker from ref get and put operations
>       can potentially disturb cache line state between the various CPU
>       contexts, which are referncing the label, and can potentially impact
>       scalability again.
> 
>    c. Swith the label to an immortal percpu ref, while the scan operates
>       on the current counter. 
> 
>       Below is the sequence of operations to do this:
> 
>       1. Ensure that both immortal ref and label ref are in percpu mode.
>          Reinit the immortal ref in percpu mode.
> 
>          Swap percpu and atomic counters of label refcount and immortal ref
> 	                          percpu-ref
>       	                  +-------------------+                
>       +-------+           |  percpu-ctr-addr1 |                
>       | label | --------->|-------------------|    +----------------+ 
>       +-------+           |   data            |--->| Atomic counter1| 
>                           +-------------------+    +----------------+ 
>       +-------+           +-------------------+                
>       |ImmLbl |---------->|  percpu-ctr-addr2 |    +----------------+
>       +-------+           |-------------------|--->| Atomic counter2|
>                           |    data           |    +----------------+
>                           +-------------------+                
> 
>           label ->percpu-ctr-addr  = percpu-ctr-addr2
>           ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
>           label ->data->count      = Atomic counter2
>           ImmLbl ->data->count     = Atomic counter1
>   
>   
>       2. Check the counters collected in immortal label, by switch it
>          to atomic mode.
> 
>       3. If the count is 0, do,
>          a. Switch immortal counter to percpu again, giving it an
>             initial count of 1.
>          b. Swap the label and immortal counters again. The immortal
>             ref now has the counter values from new percpu ref get
>             and get operations on the label ref, from the point
>             when we did the initial swap operation.
>          c. Transfer the percpu counts in immortal ref to atomic
>             counter of label percpu refcount.
>          d. Kill immortal ref, for reinit on next iteration.
>          e. Switch label percpu ref to atomic mode.
>          f. If the counter is 1, drop the initial ref.
> 
>        4. If the count is not 0, re-swap the counters.
>           a. Switch immortal counter to percpu again, giving it an
>              initial count of 1.
>           b. Swap the label and immortal counters again. The immortal
>              ref now has the counter values from new percpu ref get
>              and get operations on the label ref, from the point
>              when we did the initial swap operation.
>           c. Transfer the percpu counts in immortal ref to atomic
>              counter of label percpu refcount.
>           d. Kill immortal ref, for reinit on next iteration.
> 
> 
>           Using this approach, we ensure that, label ref users do not switch
>           to atomic mode, while there are active references on the label.
>           However, this approach requires multiple percpu ref mode switches
>           and adds high overhead and complexity to the scanning code.
> 
> Extended/Future Work
> ====================
> 
> 1. Look for ways to fix the limitations, as described in the "Limitations"
>    section.
> 
> 2. Generalize the approach to percpu rcuref, which is used for contexts
>    where release path uses RCU grace period for release operations. Patch
>    7 creates an initial prototype for this.
> 
> 3. Explore hazard pointers for scalable refcounting of labels.
> 
> Highly appreciate any feedback/suggestions on the design approach.
> 
> The patches of this patchset introduce following changes:
> 
> 1.      Documentation of Apparmor Refcount management.
> 
> 2.      Switch labels to percpu refcount in atomic mode.
> 
>         Use percpu refcount for apparmor labels. Initial patch to init
>         the percpu ref in atomic mode, to evaluate the potential
>         impact of percpuref on top of kref based implementation.
> 
> 3.      Switch unconfined namespaces refcount to percpu mode.
> 
>         Switch unconfined ns labels to percpu mode, and kill the
>         initial refcount from namespace destroy path.
> 
> 4.      Add infrastructure to reclaim percpu labels.
> 
>         Add a label reclaim infrastructure for labels which are
>         in percpu mode, for managing their inital refcount.
> 
> 5.      Switch intree labels to percpu mode.
> 
>         Use label reclaim infrastruture to manage intree labels.
> 
> 6.      Initial prototype for optimizing ref switch.
> 
>         Prototype for reducing the time window when a label
>         scan switches the label ref to atomic mode.
> 
> 7.      percpu-rcuref: Add basic infrastructure.
> 
>         Prototype for Percpu refcounts for objects, which protect
>         their object reclaims using RCU grace period.
> 
> 8.      Switch labels to percpu rcurefcount in unmanaged mode.
> 
>         Use percpu rcuref for labels. Start with unmanaged/atomic
>         mode.
> 
> 9.      Switch unconfined and in tree labels to managed ref mode.
> 
>         Use percpu mode with manager worker for unconfined and intree
>         labels.
> 
> 
> ------------------------------------------------------------------------
> 
>  b/Documentation/admin-guide/LSM/ApparmorRefcount.rst |  351 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  b/Documentation/admin-guide/LSM/index.rst            |    1
>  b/Documentation/admin-guide/kernel-parameters.txt    |    8 +
>  b/include/linux/percpu-rcurefcount.h                 |  115 ++++++++++++++++
>  b/include/linux/percpu-refcount.h                    |    2
>  b/lib/Makefile                                       |    2
>  b/lib/percpu-rcurefcount.c                           |  336 +++++++++++++++++++++++++++++++++++++++++++++++
>  b/lib/percpu-refcount.c                              |   93 +++++++++++++
>  b/security/apparmor/include/label.h                  |   16 +-
>  b/security/apparmor/include/policy.h                 |    8 -
>  b/security/apparmor/include/policy_ns.h              |   24 +++
>  b/security/apparmor/label.c                          |   11 +
>  b/security/apparmor/lsm.c                            |  145 ++++++++++++++++++++
>  b/security/apparmor/policy_ns.c                      |    6
>  include/linux/percpu-refcount.h                      |    2
>  lib/percpu-refcount.c                                |   93 -------------
>  security/apparmor/include/label.h                    |   17 +-
>  security/apparmor/include/policy.h                   |   56 +++----
>  security/apparmor/include/policy_ns.h                |   24 ---
>  security/apparmor/label.c                            |   11 -
>  security/apparmor/lsm.c                              |  325 ++++++++++++----------------------------------
>  security/apparmor/policy_ns.c                        |    8 -
>  22 files changed, 1237 insertions(+), 417 deletions(-)
> 
> base-commit: ab27740f7665