MDAnalysis · hmacdope · Nov 21, 2022 · Nov 27, 2022 · Nov 28, 2022 · Nov 28, 2022
diff --git a/_posts/2023-03-03-distopia.md b/_posts/2023-03-03-distopia.md
@@ -0,0 +1,358 @@
+---
+layout: post
+title: Introducing Distopia
+---
+
+Calculating Euclidian distances under periodic boundary conditions is a core component in both running and analysing molecular simulations. For example, the calculation of radial distribution functions, hydrogen bonds, structure factors, and many other common analyses require calculation of one or more pairwise Euclidian distances.
+
+The Euclidian distance formula under periodic boundary conditions can be expressed as:
+
+$$
+R = \sqrt{A}
+$$
+
+Many molecular mechanics engines have highly optimised code for calculating distances that often make use of specialised code paths and/or accelerators to achieve maximal performance. However this functionality is often heavily "baked-in" and is not able to be used as a standalone component.
+
+We at MDAnalysis wanted to bring some of this blazing speed to our distance calculations without sacrificing the modularity of MDAnalysis by introducing heavyweight dependencies on a simulation engine.
+
+## Enter Distopia ##
+
+Instead we developed a lightweight C++ / Python package **distopia** that uses explicit x86 SIMD vectorisation to accelerate calculation of Euclidian distances under periodic boundary conditions. The package is small and aims to provide both a Python and C++ layer to interface with if you would like to quickly calculate some distances! It is available on [PyPi][Distopia PyPi] and [conda forge][Distopia conda-forge] for x86 architectures if you would like to try it out. 
+
+Try one of:
+
+```bash
+pip install distopia
+```
+
+or,
+
+```bash
+conda install -c conda-forge distopia
+```
+
+You can also see distopia under development at its home on [github][Distopia github].
+
+## What advantages does Distopia provide? ##
+
+### Distopia is fast! ###
+
+A comparison of distopia with several different approaches used in different simulation analysis packages is provided in the graph below. We compare with [MDAnalysis's][MDAnalysis] current distance backend, [MDTraj's][MDTraj] distance backend and [freud's][freud] distance backend.
+
+
+
+
+We can see that distopia is faster than the approaches taken in the other packages (especially MDAnalysis!). This is due to the extensive efforts taken to improve vectorisation, data alignment and use low-latency CPU operations. More details about how this is achieved can be found below.
+
+### Distopia is easy to use! ###
+
+Distopia is easy to use, with a simple, [well documented python layer][Distopia docs]. Calculating the pairwise distances between two NumPy arrays is as simple as.
+
+ ```python
+import distopia
+
+dists = distopia.calc_bonds_ortho_float(coords1, coords2, box) 
+ ```
+
+Distopia provides both `float` and `double` precision versions of functions to calculate distances and currently supports orthorhombic periodic boundary conditions, or no boundary conditions.
+
+### Distopia is open to the community! ###
+
+We don't intend the benchmarks in this blog post to draw attention to relative performance of one package versus another. Instead we aim to discuss the merits of different approaches and provide an easy to use layer that other packages can adopt if they so wish.
+
+We also welcome all kinds of feedback about distopia on our [github][Distopia github] or on the [MDAnalysis Discord][MDAnalysis discord].
+
+## How does distopia compare to existing approaches? ##
+
+Currently, many simulation analysis packages offer some kind of compiled layer for the numerically intensive job of calculating distances. MDAnalysis currently achieves this with a [C header wrapped in Cython][mda backend].
+
+Distopia uses **explicit SIMD vectorisation** along with a set of "array of structs" -> "struct of arrays" transforms to get the maximum speed out of the available CPU hardware.
+
+Lets break down what each of these mean:
+
+### SIMD vectorisation ###
+
+Optimising compilers aim to use the best possible set of CPU instructions, data pipelines and branch predictions to run code as fast as possible. Part of this involves the automatic compilation of code to use  **Single Instruction Multiple Data** instructions that broadcast an operation over multiple data elements using specialised registers. For example, an AVX SIMD enabled **multiply** may be able to work on a 8 32 bit floats at a time using a x86 `YMM` register introduced with the AVX SIMD instruction set. 
+
+The following C++ snippet compiled with `gcc 12` using the `-O3` and `-mavx2` flags shows this nicely. It computes the scalar product  $c = a * b$
+
+```c++
+void sax(float *c, float *a, float *b, int n) {
+    for (size_t i=0; i<n; i++) {
+        c[i] = a[i] * b[i];
+    }
+}
+
+```
+Resulting in the following assembly with only the core of the unrolled loop
+shown:
+
+```assembly
+        vmovups ymm1, YMMWORD PTR [rsi+rax] ; load data from a
+        vmulps  ymm0, ymm1, YMMWORD PTR [rdx+rax] ; load data from b
+        vmovups YMMWORD PTR [rdi+rax], ymm0 ; multiple
+        ... ; store the result
+```
+
+
+However the compiler cannot see all ends and is often incapable of identifying large scale data transformations that may result in better SIMD utilisation down the line. The MDAnalysis distance library is a good example of this, showing limited used of the SIMD registers that would result in better performance when compiled with SIMD compiler flags.
+
+
+```c++
+#include <cmath>
+void VanillaCalcDistances(const float *r_i, const float *r_j,
+                      unsigned int nvals, float *output)
+{
+  for (unsigned int i = 0; i < nvals; ++i)
+  {
+    float r2 = 0.0;
+    for (unsigned char j = 0; j < 3; ++j)
+    {
+      float rij = r_i[i * 3 + j] - r_j[i * 3 + j];
+      r2 += rij * rij;
+    }
+    *output++ = std::sqrt(r2);
+  }
+}
+
+```
+
+Which results in the following assembly with only one iteration of the core of the inner loop shown:
+
+```assembly
+            vmovss  xmm0, DWORD PTR [rdi+rax*4]  ; load data from r_i
+            vsubss  xmm0, xmm0, DWORD PTR [rsi+rax*4] ; subtract data from r_j
+            vmulss  xmm0, xmm0, xmm0 ; multiply by self to get the square
+            vaddss  xmm0, xmm0, xmm1 ; add to the running total
+            ; repeat twice more (loop unrolled 3 times total)
+
+            vsqrtss xmm0, xmm0, xmm0 ; take the square root of r2
+            vmovss  DWORD PTR [rcx-4], xmm0 ; store the result
+```
+
+Notably this failed to use the YMM registers at all. Additionally the compiler was forced to use the `vsqrtss` instruction which is a scalar instruction that only works on a single float at a time as it was unable to identify that the data **could** be aligned in a way that would allow it to use the vectorised sqrt instruction `vsqrtps`.`
+
+So what can we do to remedy this situation? There are two main possible paths that vary in difficulty.
+
+1. We can try and help the compiler by organising the data in a manner that might help it 
+
+2. We can use **explicit vectorisation** where we use C++ functions that directly embed the assembly instructions we want (or very close to) in the generated assembly. This allows us to have much more control over when and how data is transformed into SIMD constructs.
+
+Packages like `freud` use option 1, while `MDTraj` and `MDAnalysis` use option 2 but in different ways. 
+
+Lets explore each of these in slightly more depth:
+
+
+#### Option 1: Helping the compiler ####
+
+Our first strategy is to try and make the compiler's job easier by laying out the data in a way that it can more readily identify opportunities for vectorisation.
+The easiest way to show this is with an example. Consider the following code for computing distances taken from the `freud` simulation analysis package.
+
+```c++
+
+inline float computeDistance(vec3<float>& r_i, vec3<float>& r_j) {
+    const vec3<float> r_ij = wrap(r_j - r_i);
+    return std::sqrt(dot(r_ij, r_ij));
+}
+
+// where the vec3 datastructure is roughly
+
+template<class Real> struct vec3
+{
+    Real x {0};
+    Real y {0};
+    Real z {0};
+};
+
+template<class Real> inline vec3<Real> operator*(const vec3<Real>& a, const vec3<Real>& b) {
+
+    return vec3<Real>(a.x * b.x, a.y * b.y, a.z * b.z);
+}
+
+```
+
+Using the `vec3` datastructure enables the compiler to **automatically** generate more optimally vectorised code, as the operations on the `x`, `y` and `z` elements of the vectors can be separated into registers as operations on coordinates are paired.
+
+For example, `x` coordinates can easily be multiplied with `x` coordinates and `y` with `y` (in this case using operator overloading) resulting in more opportunities for vectorisation.
+
+
+This is sometimes called a Struct Of Arrays (SOA) data layout rather than an Array of Structs (AOS) layout and is best demonstrated by comparing the two example classes. 
+
+```c++
+
+class SOA {
+    // an array for each dimensions
+    float* x; // xxxxx...
+    float* y; // yyyyy...
+    float* z; // zzzzz...
+};
+
+class AOS {
+    // an array containing repeated coordinates
+    float* coordinates; //xyzxyzxyz.....
+};
+```
+
+This approach is a fantastic first step for optimising compiler vectorisation.  However, we can do more.
+
+
+#### Option 2: Explicit vectorisation ####
+
+Rather than relying on the compiler to do our vectorisation for us, we can use **SIMD intrinsics** to have a much finer grained control over vectorisation in our program. The general way SIMD intrinsics work is that a single C++ intrinsic function roughly corresponds to a single CPU instruction but does not require writing of assembly. For example we could write our `sax()` function from  above using AVX2 intrinsics  as such:
+
+```c++
+#include <immintrin.h>
+
+void sax(float *c, float *a, float *b, int n) {
+    for (size_t i=0; i<n; i+=8) {
+        __m256 a_ = _mm256_loadu_ps(a+i); // load 8 values into a 256 bit YMM register
+        __m256 b_ = _mm256_loadu_ps(b+i); // load 8 values into a 256 bit YMM register
+        __m256 c_ = _mm256_mul_ps(a_, b_); // multiply 2x YMM register
+        _mm256_store_ps(c + i,  c_); // store the result
+    }
+}
+
+```
+
+This results in the very compact assembly (only core loop shown)
+
+
+```assembly
+        vmovups ymm1, YMMWORD PTR [rsi+rax*4]
+        vmulps  ymm0, ymm1, YMMWORD PTR [rdx+rax*4]
+        vmovups YMMWORD PTR [rdi+rax*4], ymm0
+```
+
+
+You might notice that this is almost same assembly as the compiler generated for the `sax()` function
+above. This is because the compiler is able to recognise the pattern of operations and generate highly optimised assembly without help from us.
+
+So what about the case where the compiler is unable to generate the optimal assembly? In particular how can we apply explicit vectorisation in our code for calculating distances? We saw above that the compiler was unable to generate optimal assembly for the `VanillaCalcDistances()` function above.
+
+One way is to use SIMD intrinsics to perform our calculation using an **AOS** data layout.  This is the approach taken in the `MDTraj` library.
+
+Examine this annotated code snippet from the `MDTraj` library.
+
+```c++
+void dist_mic(const float* xyz, const int* pairs, const float* box_matrix,
+              float* distance_out, float* displacement_out,
+              const int n_frames, const int n_atoms, const int n_pairs) {
+
+    for (int i = 0; i < n_frames; i++) {
+        // Load the periodic box vectors etc ...
+        for (int j = 0; j < n_pairs; j++) {
+            // Compute the displacement.
+            int offset1 = 3*pairs[2*j + 0];
+             // load 3 values from an AOS xyz buffer
+            fvec4 pos1(xyz[offset1], xyz[offset1+1], xyz[offset1+2], 0);
+            int offset2 = 3*pairs[2*j + 1];
+            // load 3 values from an AOS xyz buffer.
+            fvec4 pos2(xyz[offset2], xyz[offset2+1], xyz[offset2+2], 0); 
+            fvec4 r12 = pos2-pos1;
+            r12 -= round(r12*inv_box_size)*box_size;
+            // Store results....
+            }
+        }
+    }
+```
+
+The `fvec4` class handles the explicit SIMD vectorisation and holds the coordinate values in an **AOS format**.
+
+```c++
+class fvec4 {
+public:
+    __m128 val; // holds xyz and one blank coordinate. (x, y, z, something)
+    // define all operators and functions
+
+};
+```
+
+This also provides some speed improvements over the compiler generated code. However, it is still suboptimal as a slot for a blank coordinate is wasted. Additionally the operations over matched dimensions
+are not separated into registers as they are in the **SOA** data layout. Additionally due to the loading of only three coordinate elements at a time (x,y,z) larger SIMD registers above SSE sizes (4 `float` coordinates) cannot be used to accelerate performance.
+
+## So how does distopia work? ##
+
+Distopia is a combination of the two approaches above. It uses the **SOA** data layout and explicitly
+vectorised operations to generate highly optimised assembly. This is based on packing the coordinates in each dimension into SIMD registers. This is best demonstrated by the `SOASimd` class.
+
+```c++
+template <typename simdT>
+class SOASimd {
+    // an array for each dimensions
+    simdT x; // xxxxx...
+    simdT y; // yyyyy...
+    simdT z; // zzzzz...
+
+
+// define operations and operator overloads
+operator*(const SOASimd<simdT>& a, const SOASimd<simdT>& b) {
+
+    return SOASimd<simdT>(a.x * b.x, a.y * b.y, a.z * b.z);
+}
+
+};
+
+```
+
+The `simdT` type is a template parameter that can be a SIMD register of any width.
+The downside to this approach is the overhead required to pack the coordinates into the SIMD registers
+in the correct order. This however is a one time cost and can be amortised by performing multiple
+vectorised operations on the coordinates. 
+
+This comines the strengths of the two approaches above allowing vectorised SIMD operations to be applied accross each dimension and allows the Distopia system to benefit from wider SIMD registers. 
+
+In this way the Distopia distance calculation can be written as such:
+
+```c++
+
+template <typename simdT>
+void DistopiaCalcDistances(const SOASimd<simdT> r_i, const SOASimd<simdT> r_j,
+                      unsigned int nvals, float *output) {
+    // calculate the displacement
+    SOASimd<simdT> r_ij = r_j - r_i;
+    // calculate the squared distance
+    SOASimd<simdT> r2 = r_ij * r_ij;
+    // sum the squared distance
+    simdT r2_sum = r2.x + r2.y + r2.z;
+    // take the square root
+    simdT r = simd_sqrt(r2_sum);
+    // store the result
+    r.store(output);
+```
+
+Notice that the subtraction, multiplication and addition operations are all applied to each dimension of the coordinates. The sqrt and store operations are also applied to the entire SIMD register, resulting in 
+fantastic performance. 
+
+
+## Conclusion
+
+In this post we have seen how we can use the `distopia` library to accelerate distance calculations. Distopia is available as a backend to some MDAnalysis distiance functions right now, with support for more to be added in the future.
+
+
+```python
+import MDAnalysis as mda
+from MDAnalysis.tests.datafiles import PSF, DCD
+
+u = mda.Universe(PSF, DCD)
+# calculate pairwise distances distance between two atom groups
+atomgroup1 = u.select_atoms("selection 1")
+atomgroup2 = u.select_atoms("selection 2")
+
+# calculate the distances using the distopia backend
+distopia_distances = mda.lib.distances.calc_bonds(atomgroup1, atomgroup2, backend="distopia")
+
+```
+
+Come try it out on the develop branch of `MDAnalysis`! We welcome community feedback so if you have any questions or comments, please join us on the [MDAnalysis discord] or open an issue on the [Distopia github].
+
+[Distopia github]: https://github.com/MDAnalysis/distopia
+[Distopia PyPi]: https://pypi.org/project/distopia/
+[Distopia conda-forge]: https://github.com/conda-forge/distopia-feedstock
+[Distopia docs]: https://www.mdanalysis.org/distopia
+[MDAnalysis]: https://www.mdanalysis.org
+[MDAnalysis discord]: https://www.mdanalysis.org/2021/03/20/discord/
+[MDTraj]: https://mdtraj.org/
+[freud]: https://freud.readthedocs.io/en/stable/
+[mda backend]: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/include/calc_distances.h
+
+