-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585
base: master
Are you sure you want to change the base?
Conversation
Using an L4-hash-derived source port looks like it is driving Rx traffic onto separate cores from a quick look in dtrace -- in a one port scenario this puts us back at being the second most-contended lock during a
This doesn't really affect speed, but I expect this should mean that different port traffic will at least be able to avoid processing on the same CPU in many cases. E.g., when sled |
We don't actually lose any real-terms perf, go us.
Packet Rx is apparently 180% more costly now on `glasgow`.
TODO: find where the missing 250 Mbps has gone.
Notes from rough turning-off-and-on of the Old Way: * Thin process is slower than it was before. I suspect this is due to the larger amount of things which have been shoved into the full Packet<Parsed> type once again. We're at 2.8--2.9 rather than 2.9--3. * Thin process has a bigger performance impact on the Rx pathway than Tx: - Rx-only: 2.8--2.9 - Tx-only: 2.74 - None: 2.7 - Old: <=2.5 There might be value in first-classing an extra parse state for the cases that we know we don't need to do arbitrary full-on transforms.
Clippy clean, and the test suite is happy once again. Code is mostly reorganised how I'd like. I've taken the time to test in some small instances locally in omicron, so external networking seems to be behaving. I saw around 3.5--4Gbps local VM-to-VM over viona (iPerf3, alpine linux) in that setup. One more round of self-review tomorrow. |
zone = { git = "https://github.com/oxidecomputer/zone" } | ||
ztest = { git = "https://github.com/oxidecomputer/falcon", branch = "main" } | ||
poptrie = { git = "https://github.com/oxidecomputer/poptrie", branch = "multipath" } | ||
|
||
[profile.release] | ||
debug = 2 | ||
lto = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that lto has historically had no effect on my benchmarking, but in ingot's microbenchmarks it was fairly crucial.
I don't know if we want to define an alternate profile so that it is scoped to only be used on ubench
and the kmodule -- opteadm
takes quite a bit longer to build now.
ParserKind::OxideVpc => { | ||
let pkt = Packet::new(pkt_m.iter_mut()); | ||
let res = match dir { | ||
In => { | ||
let pkt = | ||
pkt.parse_inbound(VpcParser {}).unwrap(); | ||
port.port.process(dir, black_box(pkt)).unwrap() | ||
} | ||
Out => { | ||
let pkt = | ||
pkt.parse_outbound(VpcParser {}).unwrap(); | ||
port.port.process(dir, black_box(pkt)).unwrap() | ||
} | ||
}; | ||
assert!(!matches!(res, ProcessResult::Drop { .. })); | ||
if let Modified(spec) = res { | ||
black_box(spec.apply(pkt_m)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few differences in what we're measuring here:
process
is now effectivelyparse + process
, because a packet requires a mutable reference over an mblk. If we want to figure out actual relative improvement, we need to subtract parse time.- The old benchmarks did not include the cost associated in decomposing an mblk chain into a vec of individual mblks.
- I think we can't elide the drop cost of mblks here -- we need
LargeInput
as our iterator.
E.g., we have a benchmark sample like:
parse/ULP-FastPath/wallclock/V4-Tcp-OUT-0B
time: [64.845 ns 65.433 ns 65.976 ns]
change: [-37.370% -36.515% -35.702%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
process/ULP-FastPath/wallclock/V4-Tcp-OUT-0B
time: [189.99 ns 190.45 ns 190.99 ns]
change: [-19.914% -19.531% -19.088%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
So, pure process time is 125.017ns, which is ~-47.600%
. We have similar reduction in parse and process time in all fast path and most slowpath cases (e.g., new TCP & UDP flows). The exceptions are hairpin packets (DHCP, ICMP, ...), where I might be doing something silly which I've yet to spot. Total allocation size looks to be up in some of those, while it's down universally elsewhere.
Full log below.
@@ -316,6 +316,8 @@ pub type offset_t = c_longlong; | |||
pub type pid_t = c_int; | |||
pub type zoneid_t = id_t; | |||
|
|||
/// A standard boolean in illumos. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recent nightly clippy
runs have started complaining about doc-comment structure. There are a few such changes throughout.
pub struct Vni { | ||
// A VNI is 24-bit. By storing it this way we don't have to check | ||
// the value on the opte-core side to know if it's a valid VNI, we | ||
// just decode the bytes. | ||
// | ||
// The bytes are in network order. | ||
inner: [u8; 3], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VNI (and serialisation logic) have been pulled up into ingot.
) -> Result<LiteInPkt<MsgBlkIterMut<'_>, NP>, ParseError> { | ||
let pkt = Packet::new(pkt.iter_mut()); | ||
pkt.parse_inbound(parser) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packet<Initialized>
feels a bit superfluous looking at this pattern, especially now that len is computed when we move into LiteParsed
. We might want to do away with it.
assert_eq!( | ||
off.inner.ip.unwrap(), | ||
HdrOffset { pkt_pos: pos, seg_idx: 0, seg_pos: pos, hdr_len }, | ||
); | ||
pos += hdr_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on this and similar removals: OPTE no longer stores the internal offsets of each header, given that all headers are backed by slices into the mblk.
// we remove and re-add the mblks to work on them. | ||
// We might want also want to return either a chain/mblk_t in an enum, but | ||
// practically XDE will always assume it has a chain from MAC. | ||
pub struct MsgBlkChain { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PacketChain
-> MsgBlkChain
. This type is basically unchanged, only that we dequeue MsgBlk
s rather than Packet
s.
/// an Ethernet _frame_, but we prefer to use the colloquial | ||
/// nomenclature of "packet". | ||
#[derive(Debug)] | ||
pub struct MsgBlk { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mblk_t
relevant logic from Packet
has been recreated here. The treatment of an mblk_t
as an iterator of byeslices is new and I think some extra scrutiny is warranted here.
There is an open design question around MsgBlk
s vs their child MsgBlkNode
s. Currently, these nodes received from a MsbBlkIter(Mut)
allow fewer operations than the top-level MsgBlk
-- this needn't be the case, apart from operations that would invalidate the iterator.
/// For the `no_std`/illumos kernel environment, we want the `mblk_t` | ||
/// drop to occur at the packet level, where we can make use of | ||
/// `freemsg(9F)`. | ||
impl Drop for MsgBlk { | ||
fn drop(&mut self) { | ||
// Drop the segment chain if there is one. Consumers of MsgBlk | ||
// will never own a packet with no segments. | ||
// This guarantees that we only free the segment chain once. | ||
cfg_if! { | ||
if #[cfg(all(not(feature = "std"), not(test)))] { | ||
// Safety: This is safe as long as the original | ||
// `mblk_t` came from a call to `allocb(9F)` (or | ||
// similar API). | ||
unsafe { ddi::freemsg(self.inner.as_ptr()) }; | ||
} else { | ||
mock_freemsg(self.inner.as_ptr()); | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop logic and std
emulation of mblks are copied from the previous packet
module.
Will see if I can cleanup PktBodyWalker further.
Pretty helpful for showing off operation.
This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the
ingot
library. This enables a few changes which get us just shy of the 3Gbps mark.mblk_t
into individual links.mblk_t
as they happen, and field reads are made from the same source.NetworkParser
s now have the concept of inbound & outboundLightweightMeta
formats. These support the key operations needed to execute all our UFT flows today (FlowId
lookup, inner headers modification, encap push/pop, cksum update).EmitSpec
.struct
-size) metadata.Port
lock for the whole time.7777
.(Src Sled, Dst Sled)
.There are several other changes here made to how OPTE functions which are needed to support the zero-copy model.
Arc<>
'd, such that we can apply them outside of thePort
lock.FlowTable<S>
s now storeArc<FlowEntry<S>>
, rather thanFlowEntry<S>
.Port
lock.Opte::process
returns anEmitSpec
which is needed to finalise a packet before it can be used.Packet
to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process).Closes #571, closes #481, closes #460.
Slightly alleviates #435.
Original testing notes.
This is not exactly a transformative increase, according to testing on
glasgow
. But it is an increase by around 15--20% zone-to-zone vs #504:The only thing is that we have basically cut the time we're spending doing non-MAC things down to the bone, and we are no longer the most contended lock-haver, courtesy of lockstat.
Zooming in a little on a representative call (percentages here of CPU time across examined stacks):
for context,
xde_mc_tx
is listed as taking 39.92% on this path, andstr_mdata_fastpath_put
as 21.50%. Packet parsing (3.36%) and processing times (1.86%) are nice and low! So we're now spending less time on each packet than MAC and the device driver do.