[RFC 0/6] Managed Percpu Refcount
Neeraj Upadhyay
Neeraj.Upadhyay at amd.com
Mon Sep 16 05:08:05 UTC 2024
Introduction
------------
This patch series adds a new "managed mode" to percpu-refcounts for
managing references for objects that are released after an RCU grace
period has passed since their last reference drop.
Typical usage pattern looks like below
// Called with elevated refcount
get()
p = get_ptr();
kref_get(&p->count);
return p;
get()
rcu_read_lock();
p = get_ptr();
if (p && !kref_get_unless_zero(&p->count))
p = NULL;
rcu_read_unlock();
return p;
release()
remove_ptr(p);
call_rcu(&p->rcu, freep);
release()
remove_ptr(p);
kfree_rcu((p, rcu);
Requirement and Use Case
------------------------
Percpu refcount requires an explicit percpu_ref_kill() operation at the
object's usage site where the initial ref count is being dropped. For
optimal performance, the object's usage should reach a teardown point,
after which the references shouldn't be acquired or released frequently
before the final reference is dropped. Following the percpu_ref_kill(),
any refcount operations on the object are carried out on the
centralized atomic counter. The performance and scalability of those
usages decrease if the references are still being added or removed
after the percpu_ref_kill() operation because of the atomic counter's
cache line ping-pong between CPUs.
The throughput scalability issue that is seen when Nginx runs with the
AppArmor linux security module enabled is the primary motivation for
this change. Performance profiling shows that memory contention in the
atomic_fetch_add and atomic_fetch_sub operations carried out in
kref_get() and kref_put() operations on AppArmor labels accounts for
the majority of CPU cycles. Further information regarding the impact
of performance on Nginx throughput scalabilityand enhancements through
percpu references can be found in [1].
However, because of the way references are used in AppArmor, switching
from kref usage to per-cpu refcount was found to be non-trivial.
Although the specifics of AppArmor refcount management have already
been covered at [1], the explanation that follows aims to update that
information with more detailed (and hopefully more accurate)
information that support the requirement of managed percpu ref.
Within the AppArmor framework, label (struct aa_label) manages
references for different kinds of objects. Labels are associated with:
- Profiles for applications.
- Namespaces, via their unconfined profile.
- Audit, secmark rules and compound labels.
Labels are referenced by file contexts, security contexts, secid,
sockets.
The diagram below illustrates the relationship between different
AppArmor objects via their label references.
----------------
| Root Namespace |
----------------
/ ^ | ^
(a) | (c) |
/ (b) | (d)
v / v |
------------ -----------------
| Profile 1 | | Child Namespace |
------------ -----------------
| ^ | ^
(e) | (g) |
| (f) | (h)
v | v |
--------------- -----------
| Child Profile | | Profile 2 |
--------------- -----------
^ ^
\ /
\ /
\ /
(i)
|
----------------
| Compound Label |
----------------
(a) The Root namespace keeps track of every profile that exists there.
When a profile is loaded and unpacked, a reference to profile is
taken for this. This reference to the profile object is also used
its **init reference**.
(b) Root namespace is referenced by a profile that is part of it.
(c) To control confinement within a certain domain, such as a chroot
environment, a root namespace may include child namespaces. Through
each child namespace's unconfined label, the subnamespaces list in
the root namespace maintains a (init) reference to child
namespaces.
(d) A child namespace maintains a reference to its parent namespace.
(e) Profile can have child subprofiles which are called hat profiles.
Certain program segments can be run with permissions differing
from the base permissions using these profiles. For instance,
executing user-supplied CGI programs in a different Apache profile,
or running authorized and unauthenticated traffic in several
OpenSSH profiles. By use of its policy profiles list, the parent
profile maintains a reference to the child subprofiles. This serves
as the child profile's init reference.
(f) Child profiles keep a reference to their parent profile.
(g) Child namespace keeps a reference to all profiles in it.
(h) A reference to the parent non-root namespace is maintained by child
profiles.
(i) Application of context-specific application confinement is done
using compound/stack labels. When ls is started from bash, for
instance, the confinement rules for the profile /bin/bash///bin/ls
may differ from the system-level rules for ls execution. Compund
labels are vector of profiles and maintain reference to every
profile in its vector.
Label references
----------------
- Tasks are linked to labels via the security field of their cred. The
cred label is copied from the parent task during the bprm exec's cred
preparation, and the bprm is transitioned to the new label using the
parent task's profile transition rules. A compound/stack label or the
label of a single profile may be used in the transition depending on
the perms rule for the bprm's path.
When performing policy checks in AppArmor's security hooks for
operations like file permissions, mkdir, rmdir, mount, and so on, the
label linked to the task's cred is used. When the associated label is
marked as stale, the cred label of a task can change (from within its
context) while it is being executed.
A task maintains references to previous labels for hat transitions,
onexec labels, and nnp (no new privilege) labels for exec domain
transition checks.
Labels are cached in file context for file permissions checks on open
files. As a result of task profile updates, this label is updated
with new profiles from the task's current label during revalidations
of cached file permissions.
- Socket contexts store the labels of the current task and peer.
- Profile fs maintains references to the label proxy and namespace in
the inode->i_private fields.
- The label parsed from the rule string is referenced by Secmark rule
objects.
- The label parsed from the rule string is referenced by audit rule
objects.
Label's Initial Ref Teardown
----------------------------
- When a profile is deleted, the initial reference on its label is
dropped and it is no longer a part of the parent namespace or
parent profile. Furthermore, every one of its child profiles is
deleted recursively. As a result, all profiles that are reachable
from the base profile have their initial reference removed in a
cascaded manner.
- When a namespace is destroyed, the initial reference to its
unconfined label is dropped and it is removed from the parent
namespace view. Furthermore, all profiles in that namespace,
all sub namespaces, and all profiles inside those sub namespaces
are recursively removed and their initial label reference is dropped.
- The reference to parent label is dropped with the release of a label
reference post its last reference drop. A profile's parent profile
and namespace references are dropped upon ref release. On the
namespace ref release path, a namespace drops its reference to its
parent namespace. As part of the label release, references to
profiles in the compound label's vector are removed.
Stale Labels and Label Redirection
----------------------------------
- The label associated with profile/namespace that is deleted is marked
as stale. When any profile of a compound label is stale, the compound
label is also marked stale.
- Label's proxy is used to redirect stale labels to the most recent or
active version of the object. For example, when a profile is deleted,
its proxy is redirected to the unconfined label of the namespace. This
indicates that every application that the profile confined has been
moved to an unconfined profile. In a same manner, proxy is redirected
to the new profile's label when a profile is replaced. The proxy of a
namespace's unconfined label is redirected to the unconfined label of
its parent namespace on namespace deletion.
Redirection to new label is done during reference get operation:
struct aa_label *aa_get_newest_label(struct aa_label *l)
{
struct aa_label __rcu **l = &l->proxy->label;
struct aa_label *c;
rcu_read_lock();
do {
c = rcu_dereference(*l);
} while (c && !kref_get_unless_zero(&c->count));
rcu_read_unlock();
return c;
}
Label reclaims
--------------
A label is completely initialized when it is linked to a namespace.
Label destruction is deferred until the end of a RCU grace period which
starts after the last reference drop. Enqueuing an RCU callback for
label and associated object destruction is done from the ref release
callback.
void aa_label_kref(struct kref *kref)
{
struct aa_label *label = container_of(kref, struct aa_label, count);
struct aa_ns *ns = labels_ns(label);
if (!ns) {
label_free_switch(label);
return;
}
call_rcu(&label->rcu, label_free_rcu);
}
Using Label Stale operation for percpu_ref_kill()?
--------------------------------------------------
Marking a label as stale can serve as a reference termination point
since stale labels are redirected to the current label linked to its
objects. There are other labels, though, that are not associated with
namespaces or profiles. These labels are compound labels linked to
audit and secmark rule rules or running tasks that contain those
label references in their cred structure. These labels are:
- The label that is created from rule string is referenced by audit
rules. It is possible that a multi element vector audit rule label
already exists in the root labelset or that a new label is created
during audit rule init. The reference is removed upon audit rule
free. It's possible that the created label is actively referenced
from other contexts, causing atomic contention on the label's ref
operations if percpu_ref_kill() is called on audit rule free.
- The stacked labels which are created on profile exec/domain
transitions are stored in task's cred structure. These labels are
released when all tasks drop their cred reference to those labels.
- Transition labels which are created during change hat or change
profile transitions could be referenced by multiple tasks. These
labels are released when all tasks drop their cred reference to
those labels.
- Tasks' most recent label is combined with and cached in open file
contexts. These cached labels don't have a defined termination point
and can be actively referenced from multiple contexts.
- Other compound labels with similar ref lifetimes include pivotroot
and secmark rules.
There exist further scenarios in which stale references may still be
referenced:
- Stale flags on labels are set using plain writes, and until the CPU
observes the stale flag, new references may be incremented or
decreased on the stale label.
- A task may make reference a namespace which is marked stale.
- Stale cred label, for which a proxy points to its namespace's stale
unconfined label, the stale unconfined label can be referenced until
the cred label is updated.
In summary, though percpuref kill can be used for labels when they are
maked stale, compound labels are not guaranteed to be marked stale
during their lifetime and they do not have a context where percpuref
kill can be done.
Proposed Solution
-----------------
The solution proposed here attempt to address the issue of
identifying the init reference drop context. A percpu ref manager
thread keeps an extra reference to the ref. This additional reference
is used as a (pseudo) init reference to the object. A percpu managed
ref instance offloads its ref's release work to the ref manager thread.
The ref manager thread uses the following sequence to periodically scan
the list of managed refs and determine whether a ref is active:
scan_ref() {
bool active;
percpu_ref_switch_to_atomic_sync(&ref);
rcu_read_lock();
percpu_ref_put(&ref);
active = percpu_ref_tryget(&ref);
rcu_read_unlock();
if (active)
percpu_ref_switch_to_percpu(&ref);
}
The sequence above drops the pseudo-init reference, converts the
reference to atomic mode, and verifies (within RCU read side
protection) that all references have been dropped. The reference
is switched back to perCPU mode (with the pseudo-init reference
obtained through the try operation) if there are any active
references.
The two approaches used in this patch series, with slightly differing
permitted ref mode switches and semantics, are listed below.
Approach 1
----------
Approach 1 is implemented in patch 1 and has below semantics for ref
init and switch.
a. Init
A ref can be set to managed mode at initialization time in
percpu_ref_init(), by passing the PERCPU_REF_REL_MANAGED flag, or by
calling percpu_ref_switch_to_managed() post init to switch a
reinitable ref to managed mode. Deferred switches are used in
situations like module initialization error, when the reference to
an inited reference is released before the object is used. One example
of this is the release of AppArmor labels which are not associated with a
namespace, which is done without waiting for RCU grace period.
Below are the allowed initialization modes for managed ref
Atomic Percpu Dead Reinit Managed
Managed-ref Y N Y Y Y
b. Switching modes and operations
Below are the allowed transitions for managed ref.
To --> A P P(RI) M D D(RI) D(RI/M) KLL REI RES
A y n y y n y y y y y
P n n n n y n n y n n
M n n n y n n y n y y
P(RI) y n y y n y y y y y
D(RI) y n y y n y y - y y
D(RI/M) n n n y n n y - y y
Modes:
A - Atomic P - PerCPU M - Managed P(RI) - PerCPU with ReInit
D(RI) - Dead with ReInit D(RI/M) - Dead with ReInit and Managed
PerCPU Ref Ops:
KLL - Kill REI - Reinit RES - Resurrect
A percpu reference that has been switched to managed mode cannot be
switched back to any other active mode. Managed ref is reinitialized
to managed mode upon reinit/resurrect.
Approach 2
----------
The second approach provides a managed reference greater runtime mode
switching flexibility. This may be helpful in situations where the object
of a managed reference can enter a shutdown phase in some scenarios. For
example, for stale singular/compund labels, user can directly call
percpu_ref_kill() for the ref rather than waiting for the manager
thread to process the ref.
The init modes are the same as in the previous approach. Runtime mode
switching provides the ability to convert from managed mode to
unmanaged mode, hence enabling transitions to all reinitable modes.
To --> A P P(RI) M D D(RI) D(RI/M) KLL REI RES
A y n y y n y y y y y
P n n n n y n n y n n
M y* n y* y n y* y y* y y
P(RI) y n y y n y y y y y
D(RI) y n y y n y y - y y
D(RI/M) y* n y* y n y* y - y y
(RI) refers to modes whose initialization was done using
PERCPU_REF_ALLOW_REINIT. The aforementioned transitions are permitted
and may be indirect transitions. For example, when
percpu_ref_switch_to_unmanaged() is invoked for it, managed ref
switches to P(RI) mode. percpu_ref_switch_to_atomic() can be used to
switch from P(RI) mode to A mode.
Design Implications
-------------------
1. Deferring the release of a referenced object to the manager thread
may delay its memory release. This can result in memory pressure.
By turning a managed reference to an unmanaged ref and then
executing percpu_ref_kill() on it at known shutdown points in
the execution, this issue can be partially resolved using the
second approach.
Flush the scanning work on memory pressure is another strategy that
can be used.
2. call_rcu_hurry() is used by percpu refcount lib to perform mode
switch operations. Back to back hurry callbacks can impact energy
efficiency. The current implementation allows moving the execution
to housekeeping cores by using an unbounded workqueue. A deferrable
timer can be used to prevent these invocations when the core is
idle by delaying the worker execution. Deferring, though, may cause
ref reclaims to be delayed.
3. Since the percpu refcount lib uses a single global switch spinlock,
back-to-back label switches can delay other percpu users.
4. Long running kworkers may cause other use cases, such as system
suspend, to be delayed. By using a freezable work queue and limiting
node scans to a maximum count, this is mitigated.
5. Because all managed refs undergo switch-to-atomic mode operation
serially, an inactive ref must wait for all prior grace periods to
complete before it can be assessed. Ref release may be greatly
delayed as a result of this. Batching ref switches can be one
method to deal with this, ensuring that all of those RCU callbacks
are completed by single grace period.
6. A label's refcount can operate in atomic mode within the window
while its counter is being checked for zero. This could lead to
high memory contention within the RCU grace period (together with
callback execution) duration. In AppArmor, all application that use
unconfined profiles will execute atomic ref increment and decrement
operations on the ref during that window if the currently scanned
label belongs to an unconfined profile. In order to handle this,
a prototype is described and implemented in [1], which replaces the
atomic and percpu counters of the scanned ref with a temporary
percpu ref. Given that the grace period window is of small duration
(compared to the scan interval), overall impact of this might not be
significant enough to consider the massive complexity of that
prototype implementation. This problem requires more investigation
in order to find a simpler solution.
Extended/Future Work
--------------------
1. Another design approach, which was considered was to define a new
percpu rcuref type for RCU managed percpu refcounts. This approach
is prototyped in [1]. Although this approach provides cleaner
semantics w.r.t. mode switches and allowed operations, its current
implementation, using composition of percpu ref, could be suboptimal
in terms of the struct's cacheline space requirement and feature
extensibility. An independent implementation would require
refactoring of the common logic out of the percpu refcount
implementation. Additionally, the users of new api could require
the modes (ex. ref kill/reinit) supported by percpu refcount.
Extending percpu rcuref to support this can result in duplication
of functionality/semantics between the two percpu ref types.
2. Explore hazard pointers for scalable refcounting of objects, which
provides a more generic solution and has more efficient memory
space requirements.
Below is the organization of the patches in this series:
1. Implementation of first approach described in "Proposed Solution"
section.
2. Torture test for managed ref to validate early ref release and
imbalanced refcount.
The test is verified on AMD 4th Generation EPYC Processor wth 96C/192T
with following test parameters:
nusers = 300
nrefs = 50
niterations = 50000
onoff_holdoff = 5
onoff_interval = 10
3. Implementation of second approach described in "Proposed Solution"
section.
4. Updates to torture test to test runtime mode switches from managed
to unmanaged modes.
5. Switch Label refcount management to percpu ref in atomic mode.
6. Switch Label refcount management to managed mode.
Highly appreciate any feedback/suggestions on the design approach.
[1] https://lore.kernel.org/lkml/20240110111856.87370-7-Neeraj.Upadhyay@amd.com/T/
- Neeraj
Neeraj Upadhyay (6):
percpu-refcount: Add managed mode for RCU released objects
percpu-refcount: Add torture test for percpu refcount
percpu-refcount: Extend managed mode to allow runtime switching
percpu-refcount-torture: Extend test with runtime mode switches
apparmor: Switch labels to percpu refcount in atomic mode
apparmor: Switch labels to percpu ref managed mode
.../admin-guide/kernel-parameters.txt | 69 +++
include/linux/percpu-refcount.h | 14 +
lib/Kconfig.debug | 9 +
lib/Makefile | 1 +
lib/percpu-refcount-torture.c | 404 ++++++++++++++++++
lib/percpu-refcount.c | 329 +++++++++++++-
lib/percpu-refcount.h | 6 +
security/apparmor/include/label.h | 16 +-
security/apparmor/include/policy.h | 8 +-
security/apparmor/label.c | 12 +-
security/apparmor/policy_ns.c | 2 +
11 files changed, 836 insertions(+), 34 deletions(-)
create mode 100644 lib/percpu-refcount-torture.c
create mode 100644 lib/percpu-refcount.h
--
2.34.1
More information about the Linux-security-module-archive
mailing list