Introduction to vectorization with Vector128 and Vector256

TL;DR: Go to Summary

Introduction to vectorization with Vector128 and Vector256

Vectorization is the art of converting an algorithm from operating on a single value per iteration to operating on a set of values (vector) per iteration. It can greatly improve performance at a cost of increased code complexity.

In recent releases, .NET has introduced many new APIs for vectorization. The vast majority of them are hardware specific, so they require users to provide an implementation per processor architecture (such as x86, x64, Arm64, WASM, or other platforms), with the option of using the most optimal instructions for hardware that is executing the code.

.NET 7 introduced a set of new APIs for Vector64<T>, Vector128<T> and Vector256<T> for writing hardware-agnostic, cross platform vectorized code. Similarly, .NET 8 introduced Vector512<T>. The purpose of this document is to introduce you to the new APIs and provide a set of best practices.

Code structure

Vector128<T> is the "common denominator" across all platforms that support vectorization (and this is expected to always be the case). It represents a 128-bit vector containing elements of type T.

T is constrained to specific primitive types:

byte and sbyte (8 bits).
short and ushort (16 bits).
int, uint and float (32 bits).
long, ulong and double (64 bits).
nint and nuint (32 or 64 bits, depending on the architecture, available in .NET 7+)

.NET 8 introduced a Vector128<T>.IsSupported that indicates whether a given T will throw to help identify what works per runtime, including from generic contexts.

A single Vector128 operation allows you to operate on: 16 (s)bytes, 8 (u)shorts, 4 (u)ints/floats, or 2 (u)longs/double(s).

------------------------------128-bits---------------------------
|             64                |               64              |
-----------------------------------------------------------------
|      32       |      32       |      32       |      32       |
----------------------------------------------------------------|
|  16   |  16   |  16   |  16   |  16   |  16   |  16   |  16   |
-----------------------------------------------------------------
| 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
-----------------------------------------------------------------

Vector256<T> is twice as big as Vector128<T>, so when it is hardware accelerated, the data is large enough, and the benchmarks prove that it offers better performance, you should consider using it instead of Vector128<T>. Benchmarking your code can be important as not all platforms treat larger vectors the same.

For example, Vector256<T> on x86/x64 is mostly treated as 2x Vector128<T> rather than 1x Vector256<T>, where each Vector128<T> is considered a "lane". For most operations, this doesn't present any additional considerations they only operate on individual elements of the vector. However, some operations could "cross lanes" such as shuffling or pairwise operations and that may require additional overhead to handle.

As an example, consider Add(Vector128<float> lhs, Vector128<float> rhs) where you end up effectively doing (pseudo-code):

result[0] = lhs[0] + rhs[0];
result[1] = lhs[1] + rhs[1];
result[2] = lhs[2] + rhs[2];
result[3] = lhs[3] + rhs[3];

With this algorithm it doesn't matter what size vector we have as we're accessing the same index of the input vectors and only one at a time. So regardless of whether we have Vector128<T> or Vector256<T> or Vector512<T>, it all operates the same.

However, if you then consider AddPairwise(Vector128<float> lhs, Vector128<float> rhs) (sometimes called HorizontalAdd) where you instead end up effectively doing:

// process left
result[0] = lhs[0] + lhs[1];
result[1] = lhs[2] + lhs[3];
// process right
result[2] = rhs[0] + rhs[1];
result[3] = rhs[2] + rhs[3];

You may notice that this algorithm would change behavior if expanded up to operate on a single 256-bit vector (note result[2] is now lhs[4] + lhs[6] and not rhs[0] + rhs[1]):

// process left
result[0] = lhs[0] + lhs[1];
result[1] = lhs[2] + lhs[3];
result[2] = lhs[4] + lhs[5];
result[3] = lhs[6] + lhs[7];
// process right
result[4] = rhs[0] + rhs[1];
result[5] = rhs[2] + rhs[3];
result[6] = rhs[4] + rhs[5];
result[7] = rhs[6] + rhs[7];

Because this behavior would change, the x86/x64 platform opted to treat the operation as 2x Vector128<float> inputs giving you instead:

// process lower left
result[0] = lhs[0] + lhs[1];
result[1] = lhs[2] + lhs[3];
// process lower right
result[2] = rhs[0] + rhs[1];
result[3] = rhs[2] + rhs[3];
// process upper left
result[4] = lhs[4] + lhs[5];
result[5] = lhs[6] + lhs[7];
// process upper right
result[6] = rhs[4] + rhs[5];
result[7] = rhs[6] + rhs[7];

This ends up preserving behavior and making it much easier to transition from 128-bit to 256-bit or higher as you're effectively just unrolling the loop again. It does, however, mean that some algorithms may need additional handling if you need to truly do anything involving the upper and lower lanes together. The exact additional expense here depends on what is being done, what the underlying hardware supports, and several other factors covered in more detail later.

Checking for Hardware Acceleration

To check if a given vector size is hardware accelerated, use the IsHardwareAccelerated property on the relevant non-generic vector class. For example, Vector128.IsHardwareAccelerated or Vector256.IsHardwareAccelerated. Note that even when a vector size is accelerated, there may still be some operations that are not hardware-accelerated; e.g. floating-point division can be accelerated on some hardware while integer division is not.

The size of the input also matters. It needs to be at least of the size of a single vector to be able to execute the vectorized code path (there are some advanced tricks that can allow you to operate on smaller inputs, but we won't describe them here). The Count properties (for example Vector128<T>.Count or Vector256<T>.Count) return the number of elements of the given type T in a single vector.

When Vector256 is accelerated, Vector128 generally will be as well, but there's no guarantee of that. The best practice is to always check IsHardwareAccelerated explicitly. You may be tempted to cache the values from the IsHardwareAccelerated and Count properties, but this is not needed or recommended. Both IsHardwareAccelerated and Count are turned into constants by the Just-In-Time compiler and no method call is required to retrieve the information.

Example Code Structure

void CodeStructure(ReadOnlySpan<byte> buffer)
{
    if (Vector256.IsHardwareAccelerated && buffer.Length >= Vector256<byte>.Count)
    {
        // Vector256 code path
    }
    else if (Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<byte>.Count)
    {
        // Vector128 code path
    }
    else
    {
        // non-vectorized && small inputs code path
    }
}

To reduce the number of comparisons for small inputs, we can re-arrange it in the following way:

void OptimalCodeStructure(ReadOnlySpan<byte> buffer)
{
    if (!Vector128.IsHardwareAccelerated || buffer.Length < Vector128<byte>.Count)
    { 
        // scalar code path
    } 
    else if (!Vector256.IsHardwareAccelerated || buffer.Length < Vector256<byte>.Count)
    { 
        // Vector128 code path
    } 
    else
    { 
        // Vector256 code path
    }
}

Both vector types provide the same functionality, but arm64 hardware does not support Vector256, so for the sake of simplicity we will be using Vector128 in all examples. All examples shown also assume little endian architecture and/or do not need to deal with endianness. BitConverter.IsLittleEndian is available (and turned into a constant by the JIT) for algorithms that need to consider endianness.

With these assumptions, all examples shown in the document assume that they are being executed as part of the following if block:

else if (Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<byte>.Count)
{
    // Vector128 code path
}

Testing

Such a code structure requires us to test all possible code paths:

Vector256 is accelerated:
- The input is large enough to benefit from vectorization with Vector256.
- The input is not large enough to benefit from vectorization with Vector256, but it can benefit from vectorization with Vector128.
- The input is too small to benefit from any kind of vectorization.
Vector128 is accelerated
- The input is large enough to benefit from vectorization with Vector128.
- The input is too small to benefit from any kind of vectorization.
Neither Vector128 or Vector256 are accelerated.

It's possible to implement tests that cover some of the scenarios based on the size, but it's impossible to toggle hardware acceleration at the unit test level. It can be controlled with environment variables before .NET process is started:

When DOTNET_EnableAVX2 is set to 0, Vector256.IsHardwareAccelerated returns false.
When DOTNET_EnableHWIntrinsic is set to 0, not only do both mentioned APIs return false, but so also do Vector64.IsHardwareAccelerated and Vector.IsHardwareAccelerated.

Assuming that we run the tests on an x64 machine that supports Vector256, we need to write tests that cover all size scenarios and run them with:

no custom settings
DOTNET_EnableAVX2=0
DOTNET_EnableHWIntrinsic=0

The alternative is running tests on enough variation of hardware to cover all the paths.

Benchmarking

All that complexity needs to pay off. We need to benchmark the code to verify that the investment is beneficial. We can do that with BenchmarkDotNet.

Custom config

It's possible to define a config that instructs the harness to run the benchmarks for all three scenarios:

static void Main(string[] args)
{
    Job enough = Job.Default
        .WithWarmupCount(1)
        .WithIterationTime(TimeInterval.FromSeconds(0.25))
        .WithMaxIterationCount(20);

    IConfig config = DefaultConfig.Instance
        .HideColumns(Column.EnvironmentVariables, Column.RatioSD, Column.Error)
        .AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig
            (exportGithubMarkdown: true, printInstructionAddresses: false)))
        .AddJob(enough.WithEnvironmentVariable("DOTNET_EnableHWIntrinsic", "0").WithId("Scalar").AsBaseline());

    if (Vector256.IsHardwareAccelerated)
    {
        config = config
            .AddJob(enough.WithId("Vector256"))
            .AddJob(enough.WithEnvironmentVariable("DOTNET_EnableAVX2", "0").WithId("Vector128"));

    }
    else if (Vector128.IsHardwareAccelerated)
    {
        config = config.AddJob(enough.WithId("Vector128"));
    }

    BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly)
        .Run(args, config);
}

Note: the config defines a disassembler, which exports a disassembly in GitHub markdown format (supported on both x64 and arm64, Windows and Linux). It is very often an invaluable tool when working with high-performance code where inspecting generated assembly code is required.

Memory alignment

BenchmarkDotNet does a lot of heavy lifting for the end users, but it cannot protect us from the random memory alignment which can be different per each benchmark run and can affect the stability of the benchmarks.

We have three possibilities:

We can enforce the alignment ourselves and have very stable results.
We can ask the harness to try to randomize the memory and observe the entire possible distribution with each run.
We can do nothing and wonder why the results have additional noise across many runs.

Enforcing memory alignment

We can allocate aligned unmanaged memory by using the NativeMemory.AlignedAlloc.

public unsafe class Benchmarks
{
    private void* _pointer;

    [Params(6, 32, 1024)] // test various sizes
    public uint Size;

    [GlobalSetup]
    public void Setup()
    {
        _pointer = NativeMemory.AlignedAlloc(byteCount: Size * sizeof(int), alignment: 32);
        NativeMemory.Clear(_pointer, byteCount: Size * sizeof(int)); // ensure it's all zeros, so 1 is never found
    }

    [Benchmark]
    public bool Contains()
    {
        ReadOnlySpan<int> buffer = new (_pointer, (int)Size);
        return buffer.Contains(1);
    }

    [GlobalCleanup]
    public void Cleanup() => NativeMemory.AlignedFree(_pointer);
}

Sample results (please mind the AVX2, AVX and SSE4.2 information printed in the summary):

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1413/22H2/2022Update/SunValley2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-alpha.1.22558.1
  [Host]    : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2
  Scalar    : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT
  Vector128 : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX
  Vector256 : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2

|   Method |       Job | Size |       Mean |    StdDev | Ratio | Code Size |
|--------- |---------- |----- |-----------:|----------:|------:|----------:|
| Contains |    Scalar | 1024 | 143.844 ns | 0.6234 ns |  1.00 |     206 B |
| Contains | Vector128 | 1024 | 104.544 ns | 1.2792 ns |  0.73 |     335 B |
| Contains | Vector256 | 1024 |  55.769 ns | 0.6720 ns |  0.39 |     391 B |

Note: as you can see, even such simple method like Contains did not observe a perfect performance boost: x8 for Vector256 (256/32) and x4 for Vector128 (128/32). To understand why, we would need to use a profiler that provides information on CPU instruction level, which depending on the hardware could be Intel VTune or amd uprof.

The results should be very stable (flat distributions), but on the other hand we are measuring the performance of the best case scenario (the input is large, aligned and its entire contents are searched through, as the value is never found).

Explaining benchmark design guidelines is outside of the scope of this document, but we have a dedicated document about it. To make a long story short, you should benchmark all scenarios that are realistic for your production environment, so your customers can actually benefit from your improvements.

Memory randomization

The alternative is to enable memory randomization. Before every iteration, the harness is going to allocate random-size objects, keep them alive and re-run the setup that should allocate the actual memory.

You can read more about it here. It requires an understanding of what distribution is and how to read it. It's also out of scope of this document, but a book on statistics, such as Pro .NET Benchmarking can help you get a very good understanding of the subject.

No matter how you are going to benchmark your code, you need to keep in mind that the larger the input, the more you can benefit from vectorization. If your code uses small buffers, performance might even get worse.

Loops

To work with inputs that are bigger than a single vector, you typically need to loop over the entire input. This should be split into two parts:

vectorized loop that operates on multiple values at a time
handling of the remainder

Example: our input is a buffer of ten integers, assuming that Vector128 is accelerated, we handle the first four values in the first loop iteration, the next four in the second iteration and then we stop, as only two are left. Depending on how we can handle the remainder, we distinguish two approaches.

Scalar remainder handling

Imagine that we want to calculate the sum of all the numbers in given buffer. We definitely want to add every element just once, without repetitions. That is why in the first loop, we add four (128 bits / 32 bits) integers in one iteration. In the second loop, we handle the remaining values.

int Sum(Span<int> buffer)
{
    Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);

    // The initial sum is zero, so we need a vector with all elements initialized to zero.
    Vector128<int> sum = Vector128<int>.Zero;

    // We need to obtain the reference to first value in the buffer, it's used later for loading vectors from memory.
    ref int searchSpace = ref MemoryMarshal.GetReference(buffer);
    // And an offset, that is going to be used by vectorized and scalar loops.
    nuint elementOffset = 0;
    // And the last valid offset from which we can load the values
    nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128<int>.Count);
    for (; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128<int>.Count)
    {
        // We load a vector from given offset.
        Vector128<int> loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset);
        // We add 4 integers at a time:
        sum += loaded;
    }

    // We sum all 4 integers from the vector to one
    int result = Vector128.Sum(sum);

    // And handle the remaining elements, in a non-vectorized way:
    while (elementOffset < (nuint)buffer.Length)
    {
        result += buffer[(int)elementOffset];
        elementOffset++;
    }

    return result;
}

Note: Use ref MemoryMarshal.GetReference(span) instead of ref span[0] and ref MemoryMarshal.GetArrayDataReference(array) instead of ref array[0] to handle empty buffer scenarios (which would throw IndexOutOfRangeException). If the buffer is empty, these methods return a reference to the location where the 0th element would have been stored. Such a reference may or may not be null. You can use it for pinning but you must never de-reference it.

Note: The GetReference method has an overload that accepts a ReadOnlySpan and returns mutable reference. Please use it with caution! To get a readonly reference, you can use ReadOnlySpan.GetPinnableReference or just do the following:

ref readonly T searchSpace = ref MemoryMarshal.GetReference(buffer);

Note: Please keep in mind that Vector128.Sum is a static method. Vectior128<T> and Vector256<T> provide both instance and static methods (operators like + are just static methods in C#). Vector128 and Vector256 are non-generic static classes with static methods only. It's important to know about their existence when searching for methods.

Vectorized remainder handling

There are scenarios and advanced techniques that can allow for vectorized remainder handling instead of resorting to the non-vectorized approach illustrated above. Some algorithms could use an approach of backtracking to load one more vector's worth of elements and masking off elements that have already been processed. For idempotent algorithms, it is preferable to simply backtrack and process one last vector, repeating the operation for elements as needed.

In the example below, we need to check whether the given buffer contains a specific number; processing values more than once is completely acceptable. The buffer contains six 32-bit integers, Vector128 is accelerated, and it can work with four integers at a time. In the first loop iteration, we handle the first four elements. In the second (and last) iteration we need to handle the remaining two elements. Since the remainder is smaller than one Vector128 and we are not mutating the input, we perform a vectorized operation on a Vector128 containing the last four elements.

bool Contains(Span<int> buffer, int searched)
{
    Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);

    Vector128<int> loaded;
    // We need a vector for storing the searched value.
    Vector128<int> values = Vector128.Create(searched);

    ref int searchSpace = ref MemoryMarshal.GetReference(buffer);
    nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128<int>.Count);
    nuint elementOffset = 0;
    for (; elementOffset <= oneVectorAwayFromEnd; elementOffset += (nuint)Vector128<int>.Count)
    {
        loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset);
        // compare the loaded vector with searched value vector
        if (Vector128.Equals(loaded, values) != Vector128<int>.Zero)
        {
            return true; // return true if a difference was found
        }
    }

    // If any elements remain, process the last vector in the search space.
    if (elementOffset != (uint)buffer.Length)
    {
        loaded = Vector128.LoadUnsafe(ref searchSpace, oneVectorAwayFromEnd);
        if (Vector128.Equals(loaded, values) != Vector128<int>.Zero)
        {
            return true;
        }
    }

    return false;
}

Vector128.Create(value) creates a new vector with all elements initialized to the specified value. So Vector128<int>.Zero is equivalent to Vector128.Create(0).

Vector128.Equals(Vector128 left, Vector128 right) compares two vectors and returns a vector where each element is either all-bits-set or zero, depending on if the corresponding elements in left and right were equal. If the result of comparison is non zero, it means that there was at least one match.

Access violation (AV) testing

Handling the remainder in an invalid way may lead to non-deterministic and hard to diagnose issues.

Let's look at the following code:

nuint elementOffset = 0;
while (elementOffset < (nuint)buffer.Length)
{
    loaded = Vector128.LoadUnsafe(ref searchSpace, elementOffset); // BUG!

    elementOffset += (nuint)Vector128<int>.Count;
}

How many times will the loop execute for a buffer of six integers? Twice! The first time it will load the first four elements, but the second time it will load the random content of the memory following the buffer!

Writing tests that detect that issue is hard, but not impossible. The .NET Team uses a helper utility called BoundedMemory that allocates a memory region which is immediately preceded by or immediately followed by a poison (MEM_NOACCESS) page. Attempting to read the memory immediately before or after it results in AccessViolationException.

Loading and storing vectors

Loading

Both Vector128 and Vector256 provide at least five ways of loading them from memory:

public static class Vector128
{
    public static Vector128<T> Load<T>(T* source) where T : unmanaged
    public static Vector128<T> LoadAligned<T>(T* source) where T : unmanaged
    public static Vector128<T> LoadAlignedNonTemporal<T>(T* source) where T : unmanaged
    public static Vector128<T> LoadUnsafe<T>(ref T source) where T : struct
    public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct
}

The first three overloads require a pointer to the source. To be able to use a pointer to a managed buffer in a safe way, the buffer needs to be pinned first. This is because the GC cannot track unmanaged pointers. It needs help to ensure that it doesn't move the memory while you're using it, as the pointers would silently become invalid. The tricky part here is doing the pointer arithmetic right:

unsafe int UnmanagedPointersSum(Span<int> buffer)
{
    fixed (int* pBuffer = buffer)
    {
        int* pEnd = pBuffer + buffer.Length;
        int* pOneVectorFromEnd = pEnd - Vector128<int>.Count;
        int* pCurrent = pBuffer;

        Vector128<int> sum = Vector128<int>.Zero;

        while (pCurrent <= pOneVectorFromEnd)
        {
            sum += Vector128.Load(pCurrent);

            pCurrent += Vector128<int>.Count;
        }

        int result = Vector128.Sum(sum);

        while (pCurrent < pEnd)
        {
            result += *pCurrent;

            pCurrent++;
        }

        return result;
    }
}

LoadAligned and LoadAlignedNonTemporal require the input to be aligned. Aligned reads and writes should be slightly faster but using them comes at a price of increased complexity. "NonTemporal" means that the hardware is allowed (but not required) to bypass the cache. Non-temporal reads provide a speedup when working with very large amounts of data as it avoids repeatedly filling the cache with values that will never be used again.

Currently .NET exposes only one API for allocating unmanaged aligned memory: NativeMemory.AlignedAlloc. In the future, we might provide a dedicated API for allocating managed, aligned and hence pinned memory buffers.

The alternative to creating aligned buffers (we don't always have the control over input) is to pin the buffer, find first aligned address, handle non-aligned elements, then start aligned loop and afterwards handle the remainder. Adding such complexity to our code may not always be worth it and needs to be proved with proper benchmarking on various hardware.

The fourth method expects only a managed reference (ref T source). We don't need to pin the buffer (GC is tracking managed references and updates them if memory gets moved), but it still requires us to properly handle managed pointer arithmetic:

int ManagedReferencesSum(int[] buffer)
{
    Debug.Assert(Vector128.IsHardwareAccelerated && buffer.Length >= Vector128<int>.Count);

    ref int current = ref MemoryMarshal.GetArrayDataReference(buffer);
    ref int end = ref Unsafe.Add(ref current, buffer.Length);
    ref int oneVectorAwayFromEnd = ref Unsafe.Subtract(ref end, Vector128<int>.Count);

    Vector128<int> sum = Vector128<int>.Zero;

    while (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd))
    {
        sum += Vector128.LoadUnsafe(ref current);

        current = ref Unsafe.Add(ref current, Vector128<int>.Count);
    }

    int result = Vector128.Sum(sum);

    while (Unsafe.IsAddressLessThan(ref current, ref end))
    {
        result += current;

        current = ref Unsafe.Add(ref current, 1);
    }

    return result;
}

Note: Unsafe does not expose a method called IsLessThanOrEqualTo, so we are using a negation of Unsafe.IsAddressGreaterThan to achieve desired effect.

Pointer arithmetic can always go wrong, even if you are an experienced engineer and get a very detailed code review from .NET architects. In #73768 a GC hole was introduced. The code looked simple:

ref TValue currentSearchSpace = ref Unsafe.Add(ref searchSpace, length - Vector128<TValue>.Count);

do
{
    equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref currentSearchSpace));
    if (equals == Vector128<TValue>.Zero)
    {
        currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128<TValue>.Count);
        continue;
    }

    return ...;
}
while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));

It was part of LastIndexOf implementation, where we were iterating from the end to the beginning of the buffer. In the last iteration of the loop, currentSearchSpace could become a pointer to unknown memory that lied before the beginning of the buffer:

currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128<TValue>.Count);

And it was fine until GC kicked right after that, moved objects in memory, updated all valid managed references and resumed the execution, which run following condition:

while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));

Which could return true because currentSearchSpace was invalid and not updated. If you are interested in more details, you can check the issue and the fix.

That is why we recommend using the overload that takes a managed reference and an element offset. It does not require pinning or doing any pointer arithmetic. It still requires care as passing an incorrect offset results in a GC hole.

public static Vector128<T> LoadUnsafe<T>(ref T source, nuint elementOffset) where T : struct

The only thing we need to keep in mind is potential nuint overflow when doing unsigned integer arithmetic.

Span<int> buffer = new int[2] { 1, 2 };
nuint oneVectorAwayFromEnd = (nuint)(buffer.Length - Vector128<int>.Count);
Console.WriteLine(oneVectorAwayFromEnd);

Can you guess the result? For a 64 bit process it's FFFFFFFFFFFFFFFE (a hex representation of 18446744073709551614)! That is why the length of the buffer needs to be always checked before doing similar computations!

Storing

Similarly to loading, both Vector128 and Vector256 provide at least five ways of storing them in memory:

public static class Vector128
{
    public static void Store<T>(this Vector128<T> source, T* destination) where T : unmanaged
    public static void StoreAligned<T>(this Vector128<T> source, T* destination) where T : unmanaged
    public static void StoreAlignedNonTemporal<T>(this Vector128<T> source, T* destination) where T : unmanaged
    public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination) where T : struct
    public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct
}

For the reasons described for loading, we recommend using the overload that takes managed reference and element offset:

public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint elementOffset) where T : struct

Note: when loading values from one buffer and storing them into another, we need to consider whether they overlap or not. MemoryExtensions.Overlap is an API for doing that.

Casting

As mentioned before, Vector128<T> and Vector256<T> are constrained to a specific set of primitive types. Currently, char is not one of them, but it does not mean that we can't implement vectorized text operations with the new APIs. For primitive types of the same size (and value types that don't contain references), casting is the solution.

Unsafe.As<TFrom, TTo> can be used to get a reference to supported type:

void CastingReferences(Span<char> buffer)
{
    ref char charSearchSpace = ref MemoryMarshal.GetReference(buffer);
    ref short searchSpace = ref Unsafe.As<char, short>(ref charSearchSpace);
    // from now on we can use Vector128<short> or Vector256<short>
}

Or MemoryMarshal.Cast<TFrom, TTo>, which casts a span of one primitive type to a span of another primitive type:

void CastingSpans(Span<char> chars)
{
    Span<short> shorts = MemoryMarshal.Cast<char, short>(chars);
}

It's also possible to get managed references from unmanaged pointers:

void PointerToReference(char* pUtf16Buffer, byte* pAsciiBuffer)
{
    // of the same type:
    ref byte asciiBuffer = ref *pAsciiBuffer;
    // of different types:
    ref ushort utf16Buffer = ref *(ushort*)pUtf16Buffer;
}

It's only safe to convert a managed reference to a pointer if it's known that the reference is already pinned. If it's not, the moment after you get the pointer it could be invalid.

Mindset

Vectorizing real-world algorithms seems complex at the beginning. And what do software engineers do with complex problems? We break them down into sub-problems until these become simple enough to be solved directly.

Let's implement a vectorized method for checking whether a given byte buffer consists only from valid ASCII characters to see how similar problems can be solved.

Edge cases

Before we start working on the implementation, let's list all edge cases for our IsAcii(ReadOnlySpan<byte> buffer) method (and ideally write tests):

It does not need to throw any argument exceptions, as ReadOnlySpan is struct and it can never be null or invalid.
It should return true for an empty buffer.
It should detect invalid characters in the entire buffer, regardless of the buffer's length or whether its length is an even multiple of a vector width.
It should not read any bytes that don't belong to the provided buffer.

Scalar solution

Once we know all edge cases, we need to understand our problem and find a scalar solution.

ASCII characters are values in the range from 0 to 127 (inclusive). It means that we can find invalid ASCII bytes by just searching for values that are larger than 127. If we treat byte (unsigned, range from 0 to 255) as sbyte (signed, range from -128 to 127), it's a matter of performing "is less than zero" check.

The binary representation of 0-127 range is following:

00000000
01111111
^
most significant bit

When we look at it, we can realize that another way is checking whether the most significant bit is equal 1. For the scalar version, we could perform a logical AND:

bool IsValidAscii(byte c) => (c & 0b1000_0000) == 0;

Vectorized solution

Another step is vectorizing our scalar solution and choosing the best way of doing that based on data.

If we reuse one of the loops presented in the previous sections, all we need to implement is a method that accepts Vector128<byte> and returns bool and does exactly the same thing that our scalar method did, but for a vector rather than single value:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
bool IsValidAscii(Vector128<byte> vector)
{
    // to perform "> 127" check we can use GreaterThanAny method:
    return !Vector128.GreaterThanAny(vector, Vector128.Create((byte)127))
    // to perform "< 0" check, we need to use AsSByte and LessThanAny methods:
    return !Vector128.LessThanAny(vector.AsSByte(), Vector128<sbyte>.Zero)
    // to perform an AND operation, we need to use & operator
    return (vector & Vector128.Create((byte)0b_1000_0000)) == Vector128<byte>.Zero;
    // we can also just use ExtractMostSignificantBits method:
    return vector.ExtractMostSignificantBits() == 0;
}

We can also use the hardware-specific instructions if they are available:

if (Sse41.IsSupported)
{
    return Sse41.TestZ(vector, Vector128.Create((byte)0b_1000_0000));
}
else if (AdvSimd.Arm64.IsSupported)
{
    Vector128<byte> maxBytes = AdvSimd.Arm64.MaxPairwise(vector, vector);
    return (maxBytes.AsUInt64().ToScalar() & 0x8080808080808080) == 0;
}

Benchmark all available solutions, and choose the one that is the best for us.

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1413/22H2/2022Update/SunValley2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-alpha.1.22558.1
  [Host]   : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2

|                     Method | Size |      Mean | Ratio | Code Size |
|--------------------------- |----- |----------:|------:|----------:|
|                     Scalar | 1024 | 252.13 ns |  1.00 |      69 B |
|             GreaterThanAny | 1024 |  32.49 ns |  0.13 |     178 B |
|                LessThanAny | 1024 |  29.33 ns |  0.12 |     146 B |
|                        And | 1024 |  26.13 ns |  0.10 |     138 B |
|                      TestZ | 1024 |  27.26 ns |  0.11 |     129 B |
| ExtractMostSignificantBits | 1024 |  27.33 ns |  0.11 |     141 B |

Even such a simple problem can be solved in at least 5 different ways and each of them can perform significantly different on different hardware. Using sophisticated hardware-specific instructions does not always provide the best performance, so with the new Vector128 and Vector256 APIs we don't need to become assembly language experts to write fast, vectorized code.

Tool-Chain

Vector128, Vector128<T>, Vector256 and Vector256<T> expose a LOT of APIs. We are constrained by time, so we won't describe all of them with examples. Instead, we have grouped them into categories to give you an overview of their capabilities. It's not required to remember what each of these methods is doing, but it's important to remember what kind of operations they allow for and check the details when needed.

Note: all of these methods have "software fallbacks", which are executed when they cannot be vectorized on given platform.

Creation

Each of the vector types provides a Create method that accepts a single value and returns a vector with all elements initialized to this value.

public static Vector128<T> Create<T>(T value) where T : struct

CreateScalar initializes first element to the specified value, and the remaining elements to zero.

public static Vector128<int> CreateScalar(int value)

CreateScalarUnsafe is similar, but the remaining elements are left uninitialized. It's dangerous!

We also have an overload that allows for specifying every value in given vector:

public static Vector128<short> Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7)

And last but not least we have a Create overload which accepts a buffer. It creates a vector with its elements set to the first VectorXYZ<T>.Count elements of the buffer. It's not recommended to use it in a loop, where Load methods should be used instead (for performance).

public static Vector128<T> Create<T>(ReadOnlySpan<T> values) where T : struct

to perform a copy in the other direction, we can use one of the CopyTo extension methods:

public static void CopyTo<T>(this Vector128<T> vector, Span<T> destination) where T : struct

Bit operations

All size-specific vector types provide a set of APIs for common bit operations.

BitwiseAnd computes the bitwise-and of two vectors, BitwiseOr computes the bitwise-or of two vectors. They can both be expressed by using the corresponding operators (& and |). The same goes for Xor which can be expressed with ^ operator and Negate (~).

Note: The operators should be preferred where possible, as it helps avoid bugs around operator precedence and can improve readability.

public static Vector128<T> BitwiseAnd<T>(Vector128<T> left, Vector128<T> right) where T : struct => left & right;
public static Vector128<T> BitwiseOr<T>(Vector128<T> left, Vector128<T> right) where T : struct => left | right;
public static Vector128<T> Xor<T>(Vector128<T> left, Vector128<T> right) => left ^ right;
public static Vector128<T> Negate<T>(Vector128<T> vector) => ~vector;

AndNot computes the bitwise-and of a given vector and the ones' complement of another vector.

public static Vector128<T> AndNot<T>(Vector128<T> left, Vector128<T> right) => left & ~right;

ShiftLeft shifts each element of a vector left by the specified number of bits. ShiftRightArithmetic performs a signed shift right and ShiftRightLogical performs an unsigned shift:

public static Vector128<sbyte> ShiftLeft(Vector128<sbyte> vector, int shiftCount) => vector << shiftCount;
public static Vector128<sbyte> ShiftRightArithmetic(Vector128<sbyte> vector, int shiftCount) => vector >> shiftCount;
public static Vector128<byte> ShiftRightLogical(Vector128<byte> vector, int shiftCount) => vector >>> shiftCount;

Equality

EqualsAll compares two vectors to determine if all elements are equal. EqualsAny compares two vectors to determine if any elements are equal.

public static bool EqualsAll<T>(Vector128<T> left, Vector128<T> right) where T : struct => left == right;
public static bool EqualsAny<T>(Vector128<T> left, Vector128<T> right) where T : struct

Equals compares two vectors to determine if they are equal on a per-element basis. It returns a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in the left and right arguments were equal.

public static Vector128<T> Equals<T>(Vector128<T> left, Vector128<T> right) where T : struct

How do we calculate the index of the first match? Let's take a closer look at the result of following equality check:

Vector128<int> left = Vector128.Create(1, 2, 3, 4);
Vector128<int> right = Vector128.Create(0, 0, 3, 0);
Vector128<int> equals = Vector128.Equals(left, right);
Console.WriteLine(equals);

<0, 0, -1, 0>

-1 is just 0xFFFFFFFF (all-bits-set). We could use GetElement to get the first non-zero element.

public static T GetElement<T>(this Vector128<T> vector, int index) where T : struct

But it would not be an optimal solution. We should instead extract the most significant bits:

uint mostSignificantBits = equals.ExtractMostSignificantBits();
Console.WriteLine(Convert.ToString(mostSignificantBits, 2).PadLeft(32, '0'));

00000000000000000000000000000100

and use BitOperations.TrailingZeroCount or uint.TrailingZeroCount (introduced in .NET 7) to get the trailing zero count.

To calculate the last index, we should use BitOperations.LeadingZeroCount or uint.LeadingZeroCount (introduced in .NET 7). But the returned value needs to be subtracted from 31 (32 bits in an unit, indexed from 0).

If we were working with a buffer loaded from memory (example: searching for the last index of a given character in the buffer) both results would be relative to the elementOffset provided to the Load method that was used to load the vector from the buffer.

int ComputeLastIndex<T>(nint elementOffset, Vector128<T> equals) where T : struct
{
    uint mostSignificantBits = equals.ExtractMostSignificantBits();

    int index = 31 - BitOperations.LeadingZeroCount(mostSignificantBits); // 31 = 32 (bits in UInt32) - 1 (indexing from zero)

    return (int)elementOffset + index;
}

If we were using the Load overload that takes only the managed reference, we could use Unsafe.ByteOffset(ref T, ref T) to calculate the element offset.

unsafe int ComputeFirstIndex<T>(ref T searchSpace, ref T current, Vector128<T> equals) where T : struct
{
    int elementOffset = (int)Unsafe.ByteOffset(ref searchSpace, ref current) / sizeof(T);

    uint mostSignificantBits = equals.ExtractMostSignificantBits();
    int index = BitOperations.TrailingZeroCount(mostSignificantBits);
    
    return elementOffset + index;
}

Comparison

Beside equality checks, vector APIs allow for comparison. The bool-returning overloads return true when the given condition is true:

public static bool GreaterThanAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static bool GreaterThanAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static bool GreaterThanOrEqualAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static bool GreaterThanOrEqualAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static bool LessThanAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static bool LessThanAny<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static bool LessThanOrEqualAll<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static bool LessThanOrEqualAny<T>(Vector128<T> left, Vector128<T> right) where T : struct

Similarly to Equals, vector-returning overloads return a vector whose elements are all-bits-set or zero, depending on whether the corresponding elements in left and right meet the given condition.

public static Vector128<T> GreaterThan<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static Vector128<T> GreaterThanOrEqual<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static Vector128<T> LessThan<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static Vector128<T> LessThanOrEqual<T>(Vector128<T> left, Vector128<T> right) where T : struct

ConditionalSelect Conditionally selects a value from two vectors on a bitwise basis.

public static Vector128<T> ConditionalSelect<T>(Vector128<T> condition, Vector128<T> left, Vector128<T> right)
    => (left & condition) | (right & ~condition);

This method deserves a self-describing example:

Vector128<float> left = Vector128.Create(1.0f, 2, 3, 4);
Vector128<float> right = Vector128.Create(4.0f, 3, 2, 1);

Vector128<float> result = Vector128.ConditionalSelect(Vector128.GreaterThan(left, right), left, right);

Assert.Equal(Vector128.Create(4.0f, 3, 3, 4), result);

Math

Very simple math operations can be also expressed by using the operators. The operators should be preferred where possible, as it helps avoid bugs around operator precedence and can improve readability.

public static Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right) where T : struct => left + right;
public static Vector128<T> Divide<T>(Vector128<T> left, Vector128<T> right) => left / right;
public static Vector128<T> Divide<T>(Vector128<T> left, T right) => left / right;
public static Vector128<T> Multiply<T>(Vector128<T> left, Vector128<T> right) => left * right;
public static Vector128<T> Multiply<T>(Vector128<T> left, T right) => left * right;
public static Vector128<T> Subtract<T>(Vector128<T> left, Vector128<T> right) => left - right;

Note: Some of the methods accept a single value as the second argument.

Abs, Ceiling, Floor, Max, Min, Sqrt and Sum are also provided:

public static Vector128<T> Abs<T>(Vector128<T> vector) where T : struct
public static Vector128<double> Ceiling(Vector128<double> vector)
public static Vector128<float> Ceiling(Vector128<float> vector)
public static Vector128<double> Floor(Vector128<double> vector)
public static Vector128<float> Floor(Vector128<float> vector)
public static Vector128<T> Max<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static Vector128<T> Min<T>(Vector128<T> left, Vector128<T> right) where T : struct
public static Vector128<T> Sqrt<T>(Vector128<T> vector) where T : struct
public static T Sum<T>(Vector128<T> vector) where T : struct

Conversion

Vector types provide a set of methods dedicated to number conversions:

public static unsafe Vector128<double> ConvertToDouble(Vector128<long> vector)
public static unsafe Vector128<double> ConvertToDouble(Vector128<ulong> vector)
public static unsafe Vector128<int> ConvertToInt32(Vector128<float> vector)
public static unsafe Vector128<long> ConvertToInt64(Vector128<double> vector)
public static unsafe Vector128<float> ConvertToSingle(Vector128<int> vector)
public static unsafe Vector128<float> ConvertToSingle(Vector128<uint> vector)
public static unsafe Vector128<uint> ConvertToUInt32(Vector128<float> vector)
public static unsafe Vector128<ulong> ConvertToUInt64(Vector128<double> vector)

And for reinterpretation (no values are being changed, they can be just used as if they were of a different type):

public static Vector128<TTo> As<TFrom, TTo>(this Vector128<TFrom> vector)
public static Vector128<byte> AsByte<T>(this Vector128<T> vector)
public static Vector128<double> AsDouble<T>(this Vector128<T> vector)
public static Vector128<short> AsInt16<T>(this Vector128<T> vector)
public static Vector128<int> AsInt32<T>(this Vector128<T> vector)
public static Vector128<long> AsInt64<T>(this Vector128<T> vector)
public static Vector128<nint> AsNInt<T>(this Vector128<T> vector)
public static Vector128<nuint> AsNUInt<T>(this Vector128<T> vector)
public static Vector128<sbyte> AsSByte<T>(this Vector128<T> vector)
public static Vector128<float> AsSingle<T>(this Vector128<T> vector)
public static Vector128<ushort> AsUInt16<T>(this Vector128<T> vector)
public static Vector128<uint> AsUInt32<T>(this Vector128<T> vector)
public static Vector128<ulong> AsUInt64<T>(this Vector128<T> vector)

Widening and Narrowing

The first half of every vector is called "lower", the second is "upper".

------------------------------128-bits---------------------------
|           LOWER               |             UPPER             |
-----------------------------------------------------------------
|      32       |      32       |      32       |      32       |
----------------------------------------------------------------|
|  16   |  16   |  16   |  16   |  16   |  16   |  16   |  16   |
-----------------------------------------------------------------
| 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
-----------------------------------------------------------------

In case of Vector128, GetLower gets the value of the lower 64-bits as a new Vector64<T> and GetUpper gets the upper 64-bits.

public static Vector64<T> GetLower<T>(this Vector128<T> vector)
public static Vector64<T> GetUpper<T>(this Vector128<T> vector)

Each vector type provides a Create method that allows for the creation from lower and upper:

public static unsafe Vector128<byte> Create(Vector64<byte> lower, Vector64<byte> upper)
public static Vector256<byte> Create(Vector128<byte> lower, Vector128<byte> upper)

Lower and Upper are also used by Widen. This method widens a Vector128<T1> into two Vector128<T2> where sizeof(T2) == 2 * sizeof(T1).

public static unsafe (Vector128<ushort> Lower, Vector128<ushort> Upper) Widen(Vector128<byte> source)
public static unsafe (Vector128<int> Lower, Vector128<int> Upper) Widen(Vector128<short> source)
public static unsafe (Vector128<long> Lower, Vector128<long> Upper) Widen(Vector128<int> source)
public static unsafe (Vector128<short> Lower, Vector128<short> Upper) Widen(Vector128<sbyte> source)
public static unsafe (Vector128<double> Lower, Vector128<double> Upper) Widen(Vector128<float> source)
public static unsafe (Vector128<uint> Lower, Vector128<uint> Upper) Widen(Vector128<ushort> source)
public static unsafe (Vector128<ulong> Lower, Vector128<ulong> Upper) Widen(Vector128<uint> source)

It's also possible to widen only the lower or upper part:

public static Vector128<ushort> WidenLower(Vector128<byte> source)
public static Vector128<ushort> WidenUpper(Vector128<byte> source)

An example of widening is converting a buffer of ASCII bytes into characters:

byte[] byteBuffer = Enumerable.Range('A', 128 / 8).Select(i => (byte)i).ToArray();
Vector128<byte> byteVector = Vector128.Create(byteBuffer);
Console.WriteLine(byteVector);
(Vector128<ushort> Lower, Vector128<ushort> Upper) = Vector128.Widen(byteVector);
Console.Write(Lower.AsByte());
Console.WriteLine(Upper.AsByte());

Vector256<ushort> ushortVector = Vector256.Create(Lower, Upper);
Span<ushort> ushortBuffer = stackalloc ushort[256 / 16];
ushortVector.CopyTo(ushortBuffer);
Span<char> charBuffer = MemoryMarshal.Cast<ushort, char>(ushortBuffer);
Console.WriteLine(new string(charBuffer));

<65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80>
<65, 0, 66, 0, 67, 0, 68, 0, 69, 0, 70, 0, 71, 0, 72, 0><73, 0, 74, 0, 75, 0, 76, 0, 77, 0, 78, 0, 79, 0, 80, 0>
ABCDEFGHIJKLMNOP

Narrow is the opposite of Widen.

public static unsafe Vector128<float> Narrow(Vector128<double> lower, Vector128<double> upper)
public static unsafe Vector128<sbyte> Narrow(Vector128<short> lower, Vector128<short> upper)
public static unsafe Vector128<short> Narrow(Vector128<int> lower, Vector128<int> upper)
public static unsafe Vector128<int> Narrow(Vector128<long> lower, Vector128<long> upper)
public static unsafe Vector128<byte> Narrow(Vector128<ushort> lower, Vector128<ushort> upper)
public static unsafe Vector128<ushort> Narrow(Vector128<uint> lower, Vector128<uint> upper)
public static unsafe Vector128<uint> Narrow(Vector128<ulong> lower, Vector128<ulong> upper)

In contrast to Sse2.PackUnsignedSaturate and AdvSimd.Arm64.UnzipEven, Narrow applies a mask via AND to cut anything above the max value of returned vector:

Vector256<ushort> ushortVector = Vector256.Create((ushort)300);
Console.WriteLine(ushortVector);
unchecked { Console.WriteLine((byte)300); }
Console.WriteLine(300 & byte.MaxValue);
Console.WriteLine(Vector128.Narrow(ushortVector.GetLower(), ushortVector.GetUpper()));

if (Sse2.IsSupported)
{
    Console.WriteLine(Sse2.PackUnsignedSaturate(ushortVector.GetLower().AsInt16(), ushortVector.GetUpper().AsInt16()));
}

<300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300>
44
44
<44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44>
<255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255>

Shuffle

Shuffle creates a new vector by selecting values from an input vector using a set of indices (values that represent indexes of the input vector).

public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> vector, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> vector, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long> indices)

It can be used for many things, including reversing the input:

Vector128<int> intVector = Vector128.Create(100, 200, 300, 400);
Console.WriteLine(intVector);
Console.WriteLine(Vector128.Shuffle(intVector, Vector128.Create(3, 2, 1, 0)));

<100, 200, 300, 400>
<400, 300, 200, 100>

Vector256.Shuffle vs Avx2.Shuffle

Vector256.Shuffle and Avx2.Shuffle are not identical.

Avx2.Shuffle is effectively 2x128-bit ops while Vector256.Shuffle treats it as a "single 256-bit vector" (rather than "2x128-bit vectors"). This was done for consistency and to better map to a cross-platform mentality where AVX-512 and SVE all operate on "full width".

Summary

The main goal of the new Vector128 and Vector256 APIs is to make writing fast, vectorized code possible without becoming familiar with hardware-specific instructions and becoming an assembly language expert. Our recommendations depend on your current expertise level, software you maintain and the one you need to create:

If you are already an expert and you have vectorized your code for both x64/x86 and arm64/arm code you can use the new APIs to simplify your code, but you most likely won't observe any performance gains. #64451 lists the places where it was/can be done in dotnet/runtime. You can use links to the merged PRs to see real-life examples.
If you have already vectorized your code, but only for x64/x86 or arm64/arm, you can use the new APIs to have a single, cross-platform implementation.
If you have already vectorized your code with Vector<T> you can use the new APIs to check if they can produce better code-gen.
If you are not familiar with hardware specific instructions or you are about to vectorize a scalar algorithm, you should start with the new Vector128 and Vector256 APIs. Get a solid and working implementation and eventually consider using hardware-specific methods for performance critical code paths.
Both managed references and unsafe pointers are dangerous to use incorrectly and each comes with their own tradeoff.

Best practices

Implement tests that cover all code paths, including Access Violations.
Run tests for all hardware acceleration scenarios, use the existing environment variables to do that.
Implement benchmarks that mimic real life scenarios, do not increase the complexity of your code when it's not beneficial for your end users.
Use ref MemoryMarshal.GetReference(span) instead ref span[0] and ref MemoryMarshal.GetArrayDataReference(array) instead ref array[0] to handle empty buffers correctly.
Prefer LoadUnsafe(ref T, nuint elementOffset) and StoreUnsafe(this Vector128<T> source, ref T destination, nuint elementOffset) over other methods for loading and storing vectors as they avoid pinning and the need of doing pointer arithmetic. Be aware of unsigned integer overflow!
Always handle the vectorized loop remainder.
When storing values in memory, be aware of a potential buffer overlap.
When writing a vectorized algorithm, start with writing the tests for edge cases, then implement a scalar solution and afterwards try to express what the scalar code is doing with Vector128/256 APIs.
Vector types provide APIs for creating, loading, storing, comparing, converting, reinterpreting, widening, narrowing and shuffling vectors. It's also possible to perform equality checks, various bit and math operations. Don't try to memorize all the details, treat these APIs as a cookbook that you come back to when needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vectorization-guidelines.md

vectorization-guidelines.md

Introduction to vectorization with Vector128 and Vector256

Code structure

Checking for Hardware Acceleration

Example Code Structure

Testing

Benchmarking

Custom config

Memory alignment

Enforcing memory alignment

Memory randomization

Loops

Scalar remainder handling

Vectorized remainder handling

Access violation (AV) testing

Loading and storing vectors

Loading

Storing

Casting

Mindset

Edge cases

Scalar solution

Vectorized solution

Tool-Chain

Creation

Bit operations

Equality

Comparison

Math

Conversion

Widening and Narrowing

Shuffle

Vector256.Shuffle vs Avx2.Shuffle

Summary

Best practices

Files

vectorization-guidelines.md

Latest commit

History

vectorization-guidelines.md

File metadata and controls

Introduction to vectorization with Vector128 and Vector256

Code structure

Checking for Hardware Acceleration

Example Code Structure

Testing

Benchmarking

Custom config

Memory alignment

Enforcing memory alignment

Memory randomization

Loops

Scalar remainder handling

Vectorized remainder handling

Access violation (AV) testing

Loading and storing vectors

Loading

Storing

Casting

Mindset

Edge cases

Scalar solution

Vectorized solution

Tool-Chain

Creation

Bit operations

Equality

Comparison

Math

Conversion

Widening and Narrowing

Shuffle

Vector256.Shuffle vs Avx2.Shuffle

Summary

Best practices