Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

FelixMcFelix · 2024-08-21T17:48:15Z

This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the ingot library. This enables a few changes which get us just shy of the 3Gbps mark.

[2.36 -> 2.7] The use of ingot for modifying packets in both the slowpath (UFT miss) and existing fastpath (UFT hit).
- Parsing is faster -- we no longer copy out all packet header bytes onto the stack, and we do not allocate a vector to decompose an mblk_t into individual links.
- Packet field modifications are applied directly to the mblk_t as they happen, and field reads are made from the same source.
- Non-encap layers are not copied back out.
[2.7 -> 2.75] Packet L4 hashes are cached as part of the UFT, speeding up multipath route selection over the underlay.
[2.75 -> 2.8] Incremental Internet checksum recalculation is only performed when applicable fields change on inner flow headers (e.g., NAT'd packets).
- VM-to-VM / intra-VPC traffic is the main use case here.
[2.8 -> 3.05] NetworkParsers now have the concept of inbound & outbound LightweightMeta formats. These support the key operations needed to execute all our UFT flows today (FlowId lookup, inner headers modification, encap push/pop, cksum update).
- This also allows us to pre-serialize any bytes to be pushed in front of a packet, speeding up EmitSpec.
- This is crucial for outbound traffic in particular, which has far smaller (in struct-size) metadata.
- UFT misses or incompatible flows fallback to using the full metadata.
[3.05 -> 2.95] TCP state tracking uses a separate per-flow lock and does not require any lookup from a UFT.
- I do not have numbers on how large the performance loss would be if we held the Port lock for the whole time.
(Not measured) Packet/UFT L4 Hashes are used as the Geneve source port, spreading inbound packets over NIC Rx queues based on the inner flow.
- This is now possible because of Move underlay NICs back into H/W Classification #504 -- software classification would have limited us to the default inbound queue/group.
- I feel bad for removing one more FF7 reference, but that is the way of these things. RIP port 7777.
- Previously, Rx queue affinity was derived solely from (Src Sled, Dst Sled).

There are several other changes here made to how OPTE functions which are needed to support the zero-copy model.

Each collected set of header transforms are Arc<>'d, such that we can apply them outside of the Port lock.
FlowTable<S>s now store Arc<FlowEntry<S>>, rather than FlowEntry<S>.
- This enables the UFT entry for any flow to store its matched TCP flow, update its hit count and timestamp, and then update the TCP state without reacquring the Port lock.
- This also drastically simplifies TCP state handling in fast path cases to not rely on post-transformation packets for lookup.
Opte::process returns an EmitSpec which is needed to finalise a packet before it can be used.
- I'm not too happy about the ergonomics, but we have this problem because otherwise we'd need Packet to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process).

Closes #571, closes #481, closes #460.

Slightly alleviates #435.

Original testing notes.

This is not exactly a transformative increase, according to testing on glasgow. But it is an increase by around 15--20% zone-to-zone vs #504:

root@a:~# iperf -c 10.0.0.1 -P8
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 39797 connected to 10.0.0.1 port 5201
[  6] local 10.0.0.2 port 55568 connected to 10.0.0.1 port 5201
[  8] local 10.0.0.2 port 55351 connected to 10.0.0.1 port 5201
[ 10] local 10.0.0.2 port 49474 connected to 10.0.0.1 port 5201
[ 12] local 10.0.0.2 port 61952 connected to 10.0.0.1 port 5201
[ 14] local 10.0.0.2 port 47930 connected to 10.0.0.1 port 5201
[ 16] local 10.0.0.2 port 53057 connected to 10.0.0.1 port 5201
[ 18] local 10.0.0.2 port 63541 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  38.2 MBytes   320 Mbits/sec
[  6]   0.00-1.00   sec  38.3 MBytes   321 Mbits/sec
[  8]   0.00-1.00   sec  38.3 MBytes   321 Mbits/sec
[ 10]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[ 12]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[ 14]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec
[ 16]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec
[ 18]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[SUM]   0.00-1.00   sec   305 MBytes  2.56 Gbits/sec

...

- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  43.0 MBytes   361 Mbits/sec
[  6]   9.00-10.00  sec  42.9 MBytes   359 Mbits/sec
[  8]   9.00-10.00  sec  42.8 MBytes   359 Mbits/sec
[ 10]   9.00-10.00  sec  42.7 MBytes   358 Mbits/sec
[ 12]   9.00-10.00  sec  42.9 MBytes   360 Mbits/sec
[ 14]   9.00-10.00  sec  42.9 MBytes   359 Mbits/sec
[ 16]   9.00-10.00  sec  43.0 MBytes   360 Mbits/sec
[ 18]   9.00-10.00  sec  42.8 MBytes   359 Mbits/sec
[SUM]   9.00-10.00  sec   343 MBytes  2.88 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  4]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[  6]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  6]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[  8]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  8]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[ 10]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  sender
[ 10]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  receiver
[ 12]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[ 12]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[ 14]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  sender
[ 14]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  receiver
[ 16]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[ 16]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[ 18]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  sender
[ 18]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  3.32 GBytes  2.85 Gbits/sec                  sender
[SUM]   0.00-10.00  sec  3.32 GBytes  2.85 Gbits/sec                  receiver

The only thing is that we have basically cut the time we're spending doing non-MAC things down to the bone, and we are no longer the most contended lock-haver, courtesy of lockstat.

Zooming in a little on a representative call (percentages here of CPU time across examined stacks):

for context, xde_mc_tx is listed as taking 39.92% on this path, and str_mdata_fastpath_put as 21.50%. Packet parsing (3.36%) and processing times (1.86%) are nice and low! So we're now spending less time on each packet than MAC and the device driver do.

+= 100 Mbps

FelixMcFelix · 2024-08-24T12:42:19Z

Using an L4-hash-derived source port looks like it is driving Rx traffic onto separate cores from a quick look in dtrace -- in a one port scenario this puts us back at being the second most-contended lock during a -P 8 iperf run. (A single-threaded remains fairly uncontended.)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller                  
260502  12%  41% 0.00      637 0xfffffcfa34386be8     _ZN4opte6engine4port13Port$LT$N$GT$12thin_process17h316b1c1b8ce14471E+0x7c

      nsec ------ Time Distribution ------ count                             
       256 |@@@@@@@@@@                     90486     
       512 |@@@@@@@@@@@                    97481     
      1024 |@@@@                           37487     
      2048 |@@                             20513     
      4096 |@                              11065     
      8192 |                               2786      
     16384 |                               533       
     32768 |                               116       
     65536 |                               20        
    131072 |                               13        
    262144 |                               1         
    524288 |                               1

This doesn't really affect speed, but I expect this should mean that different port traffic will at least be able to avoid processing on the same CPU in many cases. E.g., when sled $A$ hosts ports $A_1, A_2$ and sled $B$ hosts ports $B_1, B_2$, all $A \leftrightarrow B$ combinations had the same outer 5-tuple $(A_{\mathit{IP6}},B_{\mathit{IP6}},\mathit{UDP},7777,6081)$ -- so identical Rx queue mapping. There's work to be done to get contention down further but that's beyond this PR's scope.

We don't actually lose any real-terms perf, go us.

Packet Rx is apparently 180% more costly now on `glasgow`.

TODO: find where the missing 250 Mbps has gone.

Notes from rough turning-off-and-on of the Old Way: * Thin process is slower than it was before. I suspect this is due to the larger amount of things which have been shoved into the full Packet<Parsed> type once again. We're at 2.8--2.9 rather than 2.9--3. * Thin process has a bigger performance impact on the Rx pathway than Tx: - Rx-only: 2.8--2.9 - Tx-only: 2.74 - None: 2.7 - Old: <=2.5 There might be value in first-classing an extra parse state for the cases that we know we don't need to do arbitrary full-on transforms.

FelixMcFelix · 2024-10-29T19:24:41Z

Clippy clean, and the test suite is happy once again. Code is mostly reorganised how I'd like.

I've taken the time to test in some small instances locally in omicron, so external networking seems to be behaving. I saw around 3.5--4Gbps local VM-to-VM over viona (iPerf3, alpine linux) in that setup.

One more round of self-review tomorrow.

FelixMcFelix · 2024-10-30T11:46:14Z

Cargo.toml

 zone = { git = "https://github.com/oxidecomputer/zone" }
 ztest = { git = "https://github.com/oxidecomputer/falcon", branch = "main" }
 poptrie = { git = "https://github.com/oxidecomputer/poptrie", branch = "multipath" }

 [profile.release]
 debug = 2
+lto = true


Worth noting that lto has historically had no effect on my benchmarking, but in ingot's microbenchmarks it was fairly crucial.

I don't know if we want to define an alternate profile so that it is scoped to only be used on ubench and the kmodule -- opteadm takes quite a bit longer to build now.

FelixMcFelix · 2024-10-30T12:11:03Z

bench/benches/userland.rs

+                    ParserKind::OxideVpc => {
+                        let pkt = Packet::new(pkt_m.iter_mut());
+                        let res = match dir {
+                            In => {
+                                let pkt =
+                                    pkt.parse_inbound(VpcParser {}).unwrap();
+                                port.port.process(dir, black_box(pkt)).unwrap()
+                            }
+                            Out => {
+                                let pkt =
+                                    pkt.parse_outbound(VpcParser {}).unwrap();
+                                port.port.process(dir, black_box(pkt)).unwrap()
+                            }
+                        };
+                        assert!(!matches!(res, ProcessResult::Drop { .. }));
+                        if let Modified(spec) = res {
+                            black_box(spec.apply(pkt_m));
+                        }
+                    }


There are a few differences in what we're measuring here:

process is now effectively parse + process, because a packet requires a mutable reference over an mblk. If we want to figure out actual relative improvement, we need to subtract parse time.

The old benchmarks did not include the cost associated in decomposing an mblk chain into a vec of individual mblks.

I think we can't elide the drop cost of mblks here -- we need LargeInput as our iterator.

E.g., we have a benchmark sample like:

parse/ULP-FastPath/wallclock/V4-Tcp-OUT-0B time: [64.845 ns 65.433 ns 65.976 ns] change: [-37.370% -36.515% -35.702%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild process/ULP-FastPath/wallclock/V4-Tcp-OUT-0B time: [189.99 ns 190.45 ns 190.99 ns] change: [-19.914% -19.531% -19.088%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe

So, pure process time is 125.017ns, which is ~-47.600%. We have similar reduction in parse and process time in all fast path and most slowpath cases (e.g., new TCP & UDP flows). The exceptions are hairpin packets (DHCP, ICMP, ...), where I might be doing something silly which I've yet to spot. Total allocation size looks to be up in some of those, while it's down universally elsewhere.

Full log below.

2024-10-30-bmark.log

FelixMcFelix · 2024-10-30T12:40:29Z

crates/illumos-sys-hdrs/src/lib.rs

@@ -316,6 +316,8 @@ pub type offset_t = c_longlong;
 pub type pid_t = c_int;
 pub type zoneid_t = id_t;

+/// A standard boolean in illumos.


Recent nightly clippy runs have started complaining about doc-comment structure. There are a few such changes throughout.

FelixMcFelix · 2024-10-30T12:41:22Z

crates/opte-api/src/encap.rs

-pub struct Vni {
-    // A VNI is 24-bit. By storing it this way we don't have to check
-    // the value on the opte-core side to know if it's a valid VNI, we
-    // just decode the bytes.
-    //
-    // The bytes are in network order.
-    inner: [u8; 3],
-}


VNI (and serialisation logic) have been pulled up into ingot.

FelixMcFelix · 2024-10-30T13:01:39Z

lib/opte-test-utils/src/lib.rs

+) -> Result<LiteInPkt<MsgBlkIterMut<'_>, NP>, ParseError> {
+    let pkt = Packet::new(pkt.iter_mut());
+    pkt.parse_inbound(parser)
+}


Packet<Initialized> feels a bit superfluous looking at this pattern, especially now that len is computed when we move into LiteParsed. We might want to do away with it.

FelixMcFelix · 2024-10-30T13:12:45Z

lib/opte-test-utils/src/lib.rs

-        assert_eq!(
-            off.inner.ip.unwrap(),
-            HdrOffset { pkt_pos: pos, seg_idx: 0, seg_pos: pos, hdr_len },
-        );
-        pos += hdr_len;


Comment on this and similar removals: OPTE no longer stores the internal offsets of each header, given that all headers are backed by slices into the mblk.

FelixMcFelix · 2024-10-30T13:18:26Z

lib/opte/src/ddi/mblk.rs

+// we remove and re-add the mblks to work on them.
+// We might want also want to return either a chain/mblk_t in an enum, but
+// practically XDE will always assume it has a chain from MAC.
+pub struct MsgBlkChain {


PacketChain -> MsgBlkChain. This type is basically unchanged, only that we dequeue MsgBlks rather than Packets.

FelixMcFelix · 2024-10-30T13:35:18Z

lib/opte/src/ddi/mblk.rs

+/// an Ethernet _frame_, but we prefer to use the colloquial
+/// nomenclature of "packet".
+#[derive(Debug)]
+pub struct MsgBlk {


The mblk_t relevant logic from Packet has been recreated here. The treatment of an mblk_t as an iterator of byeslices is new and I think some extra scrutiny is warranted here.

There is an open design question around MsgBlks vs their child MsgBlkNodes. Currently, these nodes received from a MsbBlkIter(Mut) allow fewer operations than the top-level MsgBlk -- this needn't be the case, apart from operations that would invalidate the iterator.

FelixMcFelix · 2024-10-30T13:49:26Z

lib/opte/src/ddi/mblk.rs

+/// For the `no_std`/illumos kernel environment, we want the `mblk_t`
+/// drop to occur at the packet level, where we can make use of
+/// `freemsg(9F)`.
+impl Drop for MsgBlk {
+    fn drop(&mut self) {
+        // Drop the segment chain if there is one. Consumers of MsgBlk
+        // will never own a packet with no segments.
+        // This guarantees that we only free the segment chain once.
+        cfg_if! {
+            if #[cfg(all(not(feature = "std"), not(test)))] {
+                // Safety: This is safe as long as the original
+                // `mblk_t` came from a call to `allocb(9F)` (or
+                // similar API).
+                unsafe { ddi::freemsg(self.inner.as_ptr()) };
+            } else {
+                mock_freemsg(self.inner.as_ptr());
+            }
+        }
+    }
+}


Drop logic and std emulation of mblks are copied from the previous packet module.

Will see if I can cleanup PktBodyWalker further.

Pretty helpful for showing off operation.

FelixMcFelix added 10 commits August 19, 2024 11:52

Bump zerocopy to 0.8 prerelease, begin the Great Work

8fa90c8

Iterating...

3d9eb6f

The most hacked-together 'fast-path' imaginable.

cdea9f4

Merge branch 'master' into ingot

ff0fa3e

Merge conflict errors.

7ba068b

Attempt to minimise fastpath lock contention.

759b337

Nudge along CI a little bit

7b5100a

Wow, that was wasteful.

17b5c02

...CI?

b905c8c

...CI??

470f2bc

FelixMcFelix added the perf label Aug 22, 2024

FelixMcFelix added this to the 11 milestone Aug 22, 2024

FelixMcFelix added 2 commits August 23, 2024 18:57

Don't recompute flowhash for UFTs

6345ba6

+= 100 Mbps

Reintroduce headroom for ETH alignment.

fb4f9c1

twinfees added the customer For any bug reports or feature requests tied to customer requests label Aug 27, 2024

FelixMcFelix added 7 commits August 29, 2024 12:29

Test fixups, post-V6EH world.

6200a29

We don't actually lose any real-terms perf, go us.

Try out emit_uninit.

f1346b6

Long march to complete integration... in progress.

c4be1cd

Good spot to call it for the day.

9a359cc

Iterating.

d417e69

We're now past OPTE::engine, at least.

4aabcb0

Lazy workarounds to get back to later.

750c8ac

FelixMcFelix self-assigned this Sep 6, 2024

FelixMcFelix added 6 commits September 6, 2024 18:16

Against the odds, XDE compiles.

c1a1658

Unbreak DHCPv4 responses.

f73b532

Works, but I am dropping perf on the floor somewhere now.

640b962

Packet Rx is apparently 180% more costly now on `glasgow`.

Actually remember to use cached l4 hash.

ea26bbd

TODO: find where the missing 250 Mbps has gone.

Re-enable cksum update when needed.

f88fe1b

FelixMcFelix added 13 commits October 28, 2024 11:51

A test suite.

d187bbe

One more test to reintroduce.

5294b99

More Movement of Stuff.

cec6ebe

More tweaks.

af6bbe8

Fix ubench.

e14bf22

Fixup port entry/return probes

d5fa92e

Cleanup, find meaningful TODOs.

fbc661d

Finally do away with ingot_packet.rs

7698a31

Licensing, docs fixes.

852cfd4

An import.

a60f598

Again.

de099e8

Thanks, clippy. Thippy.

3fcedc9

Clippy appeasement at last.

54219da

FelixMcFelix marked this pull request as ready for review October 29, 2024 19:22

FelixMcFelix added 7 commits October 30, 2024 10:20

Fixup fuzzers.

f7e64e5

Self-review pt.1

22a0ede

Self review pt.2 -- less copies on gateway ping

40c8840

Self review pt.4

f0070f9

SR Pt.5

603f916

SR Pt.6

3dc3a59

SR pt.7

cda3564

FelixMcFelix commented Oct 30, 2024

View reviewed changes

FelixMcFelix added 7 commits October 30, 2024 18:30

..

956779d

Bump ingot.

84d8651

SR pt.8

aac32dc

Self-review pt.9

bd7641a

Will see if I can cleanup PktBodyWalker further.

SR Pt.9

c3ebd68

Indicate fast/slowpath in port-process-return

5328994

Pretty helpful for showing off operation.

Inline to prevent port-process-return from dinging us.

9fa4856

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

FelixMcFelix commented Aug 21, 2024 •

edited

Loading

FelixMcFelix commented Aug 24, 2024

FelixMcFelix commented Oct 29, 2024

FelixMcFelix Oct 30, 2024

FelixMcFelix Oct 30, 2024

FelixMcFelix Oct 30, 2024

FelixMcFelix Oct 30, 2024

FelixMcFelix Oct 30, 2024

FelixMcFelix Oct 30, 2024

FelixMcFelix Oct 30, 2024

FelixMcFelix Oct 30, 2024

FelixMcFelix Oct 30, 2024

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

Are you sure you want to change the base?

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

Conversation

FelixMcFelix commented Aug 21, 2024 • edited Loading

FelixMcFelix commented Aug 24, 2024

FelixMcFelix commented Oct 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FelixMcFelix commented Aug 21, 2024 •

edited

Loading