Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

Open
wants to merge 95 commits into
base: master
Choose a base branch
from

Conversation

FelixMcFelix
Copy link
Collaborator

@FelixMcFelix FelixMcFelix commented Aug 21, 2024

This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the ingot library. This enables a few changes which get us just shy of the 3Gbps mark.

  • [2.36 -> 2.7] The use of ingot for modifying packets in both the slowpath (UFT miss) and existing fastpath (UFT hit).
    • Parsing is faster -- we no longer copy out all packet header bytes onto the stack, and we do not allocate a vector to decompose an mblk_t into individual links.
    • Packet field modifications are applied directly to the mblk_t as they happen, and field reads are made from the same source.
    • Non-encap layers are not copied back out.
  • [2.7 -> 2.75] Packet L4 hashes are cached as part of the UFT, speeding up multipath route selection over the underlay.
  • [2.75 -> 2.8] Incremental Internet checksum recalculation is only performed when applicable fields change on inner flow headers (e.g., NAT'd packets).
    • VM-to-VM / intra-VPC traffic is the main use case here.
  • [2.8 -> 3.05] NetworkParsers now have the concept of inbound & outbound LightweightMeta formats. These support the key operations needed to execute all our UFT flows today (FlowId lookup, inner headers modification, encap push/pop, cksum update).
    • This also allows us to pre-serialize any bytes to be pushed in front of a packet, speeding up EmitSpec.
    • This is crucial for outbound traffic in particular, which has far smaller (in struct-size) metadata.
    • UFT misses or incompatible flows fallback to using the full metadata.
  • [3.05 -> 2.95] TCP state tracking uses a separate per-flow lock and does not require any lookup from a UFT.
    • I do not have numbers on how large the performance loss would be if we held the Port lock for the whole time.
  • (Not measured) Packet/UFT L4 Hashes are used as the Geneve source port, spreading inbound packets over NIC Rx queues based on the inner flow.
    • This is now possible because of Move underlay NICs back into H/W Classification #504 -- software classification would have limited us to the default inbound queue/group.
    • I feel bad for removing one more FF7 reference, but that is the way of these things. RIP port 7777.
    • Previously, Rx queue affinity was derived solely from (Src Sled, Dst Sled).

There are several other changes here made to how OPTE functions which are needed to support the zero-copy model.

  • Each collected set of header transforms are Arc<>'d, such that we can apply them outside of the Port lock.
  • FlowTable<S>s now store Arc<FlowEntry<S>>, rather than FlowEntry<S>.
    • This enables the UFT entry for any flow to store its matched TCP flow, update its hit count and timestamp, and then update the TCP state without reacquring the Port lock.
    • This also drastically simplifies TCP state handling in fast path cases to not rely on post-transformation packets for lookup.
  • Opte::process returns an EmitSpec which is needed to finalise a packet before it can be used.
    • I'm not too happy about the ergonomics, but we have this problem because otherwise we'd need Packet to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process).

Closes #571, closes #481, closes #460.

Slightly alleviates #435.

Original testing notes.

This is not exactly a transformative increase, according to testing on glasgow. But it is an increase by around 15--20% zone-to-zone vs #504:

root@a:~# iperf -c 10.0.0.1 -P8
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 39797 connected to 10.0.0.1 port 5201
[  6] local 10.0.0.2 port 55568 connected to 10.0.0.1 port 5201
[  8] local 10.0.0.2 port 55351 connected to 10.0.0.1 port 5201
[ 10] local 10.0.0.2 port 49474 connected to 10.0.0.1 port 5201
[ 12] local 10.0.0.2 port 61952 connected to 10.0.0.1 port 5201
[ 14] local 10.0.0.2 port 47930 connected to 10.0.0.1 port 5201
[ 16] local 10.0.0.2 port 53057 connected to 10.0.0.1 port 5201
[ 18] local 10.0.0.2 port 63541 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  38.2 MBytes   320 Mbits/sec
[  6]   0.00-1.00   sec  38.3 MBytes   321 Mbits/sec
[  8]   0.00-1.00   sec  38.3 MBytes   321 Mbits/sec
[ 10]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[ 12]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[ 14]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec
[ 16]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec
[ 18]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[SUM]   0.00-1.00   sec   305 MBytes  2.56 Gbits/sec

...

- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  43.0 MBytes   361 Mbits/sec
[  6]   9.00-10.00  sec  42.9 MBytes   359 Mbits/sec
[  8]   9.00-10.00  sec  42.8 MBytes   359 Mbits/sec
[ 10]   9.00-10.00  sec  42.7 MBytes   358 Mbits/sec
[ 12]   9.00-10.00  sec  42.9 MBytes   360 Mbits/sec
[ 14]   9.00-10.00  sec  42.9 MBytes   359 Mbits/sec
[ 16]   9.00-10.00  sec  43.0 MBytes   360 Mbits/sec
[ 18]   9.00-10.00  sec  42.8 MBytes   359 Mbits/sec
[SUM]   9.00-10.00  sec   343 MBytes  2.88 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  4]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[  6]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  6]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[  8]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  8]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[ 10]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  sender
[ 10]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  receiver
[ 12]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[ 12]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[ 14]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  sender
[ 14]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  receiver
[ 16]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[ 16]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[ 18]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  sender
[ 18]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  3.32 GBytes  2.85 Gbits/sec                  sender
[SUM]   0.00-10.00  sec  3.32 GBytes  2.85 Gbits/sec                  receiver

The only thing is that we have basically cut the time we're spending doing non-MAC things down to the bone, and we are no longer the most contended lock-haver, courtesy of lockstat.
tx

Zooming in a little on a representative call (percentages here of CPU time across examined stacks):
image
for context, xde_mc_tx is listed as taking 39.92% on this path, and str_mdata_fastpath_put as 21.50%. Packet parsing (3.36%) and processing times (1.86%) are nice and low! So we're now spending less time on each packet than MAC and the device driver do.

@FelixMcFelix FelixMcFelix added this to the 11 milestone Aug 22, 2024
@FelixMcFelix
Copy link
Collaborator Author

Using an L4-hash-derived source port looks like it is driving Rx traffic onto separate cores from a quick look in dtrace -- in a one port scenario this puts us back at being the second most-contended lock during a -P 8 iperf run. (A single-threaded remains fairly uncontended.)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller                  
260502  12%  41% 0.00      637 0xfffffcfa34386be8     _ZN4opte6engine4port13Port$LT$N$GT$12thin_process17h316b1c1b8ce14471E+0x7c

      nsec ------ Time Distribution ------ count                             
       256 |@@@@@@@@@@                     90486     
       512 |@@@@@@@@@@@                    97481     
      1024 |@@@@                           37487     
      2048 |@@                             20513     
      4096 |@                              11065     
      8192 |                               2786      
     16384 |                               533       
     32768 |                               116       
     65536 |                               20        
    131072 |                               13        
    262144 |                               1         
    524288 |                               1         

This doesn't really affect speed, but I expect this should mean that different port traffic will at least be able to avoid processing on the same CPU in many cases. E.g., when sled $A$ hosts ports $A_1, A_2$ and sled $B$ hosts ports $B_1, B_2$, all $A \leftrightarrow B$ combinations had the same outer 5-tuple $(A_{\mathit{IP6}},B_{\mathit{IP6}},\mathit{UDP},7777,6081)$ -- so identical Rx queue mapping. There's work to be done to get contention down further but that's beyond this PR's scope.

@twinfees twinfees added the customer For any bug reports or feature requests tied to customer requests label Aug 27, 2024
@FelixMcFelix FelixMcFelix self-assigned this Sep 6, 2024
Packet Rx is apparently 180% more costly now on `glasgow`.
TODO: find where the missing 250 Mbps has gone.
Notes from rough turning-off-and-on of the Old Way:

* Thin process is slower than it was before. I suspect this is due to
  the larger amount of things which have been shoved into the full
  Packet<Parsed> type once again. We're at 2.8--2.9 rather than 2.9--3.
* Thin process has a bigger performance impact on the Rx pathway than
  Tx:
   - Rx-only: 2.8--2.9
   - Tx-only: 2.74
   - None:    2.7
   - Old:   <=2.5

There might be value in first-classing an extra parse state for the
cases that we know we don't need to do arbitrary full-on transforms.
@FelixMcFelix FelixMcFelix marked this pull request as ready for review October 29, 2024 19:22
@FelixMcFelix
Copy link
Collaborator Author

Clippy clean, and the test suite is happy once again. Code is mostly reorganised how I'd like.

I've taken the time to test in some small instances locally in omicron, so external networking seems to be behaving. I saw around 3.5--4Gbps local VM-to-VM over viona (iPerf3, alpine linux) in that setup.

One more round of self-review tomorrow.

zone = { git = "https://github.com/oxidecomputer/zone" }
ztest = { git = "https://github.com/oxidecomputer/falcon", branch = "main" }
poptrie = { git = "https://github.com/oxidecomputer/poptrie", branch = "multipath" }

[profile.release]
debug = 2
lto = true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that lto has historically had no effect on my benchmarking, but in ingot's microbenchmarks it was fairly crucial.

I don't know if we want to define an alternate profile so that it is scoped to only be used on ubench and the kmodule -- opteadm takes quite a bit longer to build now.

Comment on lines +173 to +191
ParserKind::OxideVpc => {
let pkt = Packet::new(pkt_m.iter_mut());
let res = match dir {
In => {
let pkt =
pkt.parse_inbound(VpcParser {}).unwrap();
port.port.process(dir, black_box(pkt)).unwrap()
}
Out => {
let pkt =
pkt.parse_outbound(VpcParser {}).unwrap();
port.port.process(dir, black_box(pkt)).unwrap()
}
};
assert!(!matches!(res, ProcessResult::Drop { .. }));
if let Modified(spec) = res {
black_box(spec.apply(pkt_m));
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few differences in what we're measuring here:

  • process is now effectively parse + process, because a packet requires a mutable reference over an mblk. If we want to figure out actual relative improvement, we need to subtract parse time.
  • The old benchmarks did not include the cost associated in decomposing an mblk chain into a vec of individual mblks.
  • I think we can't elide the drop cost of mblks here -- we need LargeInput as our iterator.

E.g., we have a benchmark sample like:

parse/ULP-FastPath/wallclock/V4-Tcp-OUT-0B
                        time:   [64.845 ns 65.433 ns 65.976 ns]
                        change: [-37.370% -36.515% -35.702%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

process/ULP-FastPath/wallclock/V4-Tcp-OUT-0B
                        time:   [189.99 ns 190.45 ns 190.99 ns]
                        change: [-19.914% -19.531% -19.088%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

So, pure process time is 125.017ns, which is ~-47.600%. We have similar reduction in parse and process time in all fast path and most slowpath cases (e.g., new TCP & UDP flows). The exceptions are hairpin packets (DHCP, ICMP, ...), where I might be doing something silly which I've yet to spot. Total allocation size looks to be up in some of those, while it's down universally elsewhere.

Full log below.

2024-10-30-bmark.log

@@ -316,6 +316,8 @@ pub type offset_t = c_longlong;
pub type pid_t = c_int;
pub type zoneid_t = id_t;

/// A standard boolean in illumos.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent nightly clippy runs have started complaining about doc-comment structure. There are a few such changes throughout.

Comment on lines -20 to -27
pub struct Vni {
// A VNI is 24-bit. By storing it this way we don't have to check
// the value on the opte-core side to know if it's a valid VNI, we
// just decode the bytes.
//
// The bytes are in network order.
inner: [u8; 3],
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VNI (and serialisation logic) have been pulled up into ingot.

) -> Result<LiteInPkt<MsgBlkIterMut<'_>, NP>, ParseError> {
let pkt = Packet::new(pkt.iter_mut());
pkt.parse_inbound(parser)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packet<Initialized> feels a bit superfluous looking at this pattern, especially now that len is computed when we move into LiteParsed. We might want to do away with it.

Comment on lines -1078 to -1082
assert_eq!(
off.inner.ip.unwrap(),
HdrOffset { pkt_pos: pos, seg_idx: 0, seg_pos: pos, hdr_len },
);
pos += hdr_len;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on this and similar removals: OPTE no longer stores the internal offsets of each header, given that all headers are backed by slices into the mblk.

// we remove and re-add the mblks to work on them.
// We might want also want to return either a chain/mblk_t in an enum, but
// practically XDE will always assume it has a chain from MAC.
pub struct MsgBlkChain {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PacketChain -> MsgBlkChain. This type is basically unchanged, only that we dequeue MsgBlks rather than Packets.

/// an Ethernet _frame_, but we prefer to use the colloquial
/// nomenclature of "packet".
#[derive(Debug)]
pub struct MsgBlk {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mblk_t relevant logic from Packet has been recreated here. The treatment of an mblk_t as an iterator of byeslices is new and I think some extra scrutiny is warranted here.

There is an open design question around MsgBlks vs their child MsgBlkNodes. Currently, these nodes received from a MsbBlkIter(Mut) allow fewer operations than the top-level MsgBlk -- this needn't be the case, apart from operations that would invalidate the iterator.

Comment on lines +785 to +804
/// For the `no_std`/illumos kernel environment, we want the `mblk_t`
/// drop to occur at the packet level, where we can make use of
/// `freemsg(9F)`.
impl Drop for MsgBlk {
fn drop(&mut self) {
// Drop the segment chain if there is one. Consumers of MsgBlk
// will never own a packet with no segments.
// This guarantees that we only free the segment chain once.
cfg_if! {
if #[cfg(all(not(feature = "std"), not(test)))] {
// Safety: This is safe as long as the original
// `mblk_t` came from a call to `allocb(9F)` (or
// similar API).
unsafe { ddi::freemsg(self.inner.as_ptr()) };
} else {
mock_freemsg(self.inner.as_ptr());
}
}
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop logic and std emulation of mblks are copied from the previous packet module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer For any bug reports or feature requests tied to customer requests perf
Projects
None yet
2 participants