[PATCH RFC 0/9] sk_buff: optimize layout for GRO

Thu Jul 22 18:41:30 UTC 2021

On Thu, Jul 22, 2021 at 12:59 PM Paolo Abeni <pabeni at redhat.com> wrote:
> On Thu, 2021-07-22 at 09:04 -0700, Casey Schaufler wrote:
> > On 7/22/2021 12:10 AM, Paolo Abeni wrote:
> > > On Wed, 2021-07-21 at 11:15 -0700, Casey Schaufler wrote:
> > > > On 7/21/2021 9:44 AM, Paolo Abeni wrote:
> > > > > This is a very early draft - in a different world would be
> > > > > replaced by hallway discussion at in-person conference - aimed at
> > > > > outlining some ideas and collect feedback on the overall outlook.
> > > > > There are still bugs to be fixed, more test and benchmark need, etc.
> > > > >
> > > > > There are 3 main goals:
> > > > > - [try to] avoid the overhead for uncommon conditions at GRO time
> > > > >   (patches 1-4)
> > > > > - enable backpressure for the veth GRO path (patches 5-6)
> > > > > - reduce the number of cacheline used by the sk_buff lifecycle
> > > > >   from 4 to 3, at least in some common scenarios (patches 1,7-9).
> > > > >   The idea here is avoid the initialization of some fields and
> > > > >   control their validity with a bitmask, as presented by at least
> > > > >   Florian and Jesper in the past.
> > > > If I understand correctly, you're creating an optimized case
> > > > which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
> > > > and if so, why those particular fields? What impact will this have
> > > > in the non-optimal (with any of the excluded fields) case?
> > > Thank you for the feedback.
> >
> > You're most welcome. You did request comments.
> >
> > > There are 2 different relevant points:
> > >
> > > - the GRO stage.
> > >   packets carring any of CT, dst, sk or skb_ext will do 2 additional
> > > conditionals per gro_receive WRT the current code. My understanding is
> > > that having any of such field set at GRO receive time is quite
> > > exceptional for real nic. All others packet will do 4 or 5 less
> > > conditionals, and will traverse a little less code.
> > >
> > > - sk_buff lifecycle
> > >   * packets carrying vlan and UDP will not see any differences: sk_buff
> > > lifecycle will stil use 4 cachelines, as currently does, and no
> > > additional conditional is introduced.
> > >   * packets carring nfct or secmark will see an additional conditional
> > > every time such field is accessed. The number of cacheline used will
> > > still be 4, as in the current code. My understanding is that when such
> > > access happens, there is already a relevant amount of "additional" code
> > > to be executed, the conditional overhead should not be measurable.
> >
> > I'm responsible for some of that "additonal" code. If the secmark
> > is considered to be outside the performance critical data there are
> > changes I would like to make that will substantially improve the
> > performance of that "additional" code that would include a u64
> > secmark. If use of a secmark is considered indicative of a "slow"
> > path, the rationale for restricting it to u32, that it might impact
> > the "usual" case performance, seems specious. I can't say that I
> > understand all the nuances and implications involved. It does
> > appear that the changes you've suggested could negate the classic
> > argument that requires the u32 secmark.
>
> I see now I did not reply to one of you questions - why I picked-up
>  vlan, tunnel secmark fields to move them at sk_buff tail.
>
> Tow main drivers on my side:
> - there are use cases/deployments that do not use them.
> - moving them around was doable in term of required changes.
>
> There are no "slow-path" implications on my side. For example, vlan_*
> fields are very critical performance wise, if the traffic is tagged.
> But surely there are busy servers not using tagget traffic which will
> enjoy the reduced cachelines footprint, and this changeset will not
> impact negatively the first case.
>
> WRT to the vlan example, secmark and nfct require an extra conditional
> to fetch the data. My understanding is that such additional conditional
> is not measurable performance-wise when benchmarking the security
> modules (or conntrack) because they have to do much more intersting
> things after fetching a few bytes from an already hot cacheline.
>
> Not sure if the above somehow clarify my statements.
>
> As for expanding secmark to 64 bits, I guess that could be an
> interesting follow-up discussion :)

The intersection between netdev and the LSM has a long and somewhat
tortured past with each party making sacrifices along the way to get
where we are at today.  It is far from perfect, at least from a LSM
perspective, but it is what we've got and since performance is usually
used as a club to beat back any changes proposed by the LSM side, I
would like to object to these changes that negatively impact the LSM
performance without some concession in return.  It has been a while
since Casey and I have spoken about this, but I think the prefered
option would be to exchange the current __u32 "sk_buff.secmark" field
with a void* "sk_buff.security" field, like so many other kernel level
objects.  Previous objections have eventually boiled down to the
additional space in the sk_buff for the extra bits (there is some
additional editorializing that could be done here, but I'll refrain),
but based on the comments thus far in this thread it sounds like
perhaps we can now make a deal here: move the LSM field down to a
"colder" cacheline in exchange for converting the LSM field to a
proper pointer.

Thoughts?

-- 
paul moore
www.paul-moore.com