nrposner

When Vectorized Arrays Aren't Enough

Or, Zipped Arrays Rule Everything Around Me

I've been contributing on and off to the McFACTS simulation project for a few months now, and am currently in the process of testing and merging some pretty substantial optimizations to the core 'business logic' of the simulation, which mainl consists of long series of NumPy array operations.

Pretty much all the optimizations in the pipeline right now are basically identical, to the point that I've considered just writing a macro to write the code for me. I figured it would be worth condensing what I've learned in the last few months about how these optimitzations work concretely, as well as the intuition behind them.

None of this is particularly arcane or hardware specific, and while we will be getting into some x86s assembly at the end, it's mainly illustrative and helpful for validating the intuitions we're building.

This may be especially valuable if you're an author of scientific Python code, and you're struggling with slow-running programs despite being told in undergrad that NumPy arrays are really fast.

But first, a tortured metaphor:


Alice, Bob, Carol are logging engineers in the forests of British Columbia. Day in and day out, they operate the TreeChopper 30000, a industrial miracle that renders raw material into construction-ready timber on the spot. However, there is one downside: the TreeChopper is large, heavy, and very difficult to move.

Alice presses a button, watches the TreeChopper rip apart a tree in five seconds flat, and then spends half an hour moving it to the next felled trunk. Ratio of effective work to transit: 1:360.

Bob presses the same button, and then hops into a truck to drag the next trunk over to the TreeChopper, taking just 5 minutes. Ratio: 1:60.

Carol has set up a parallel trucking operation that brings newly felled trunks continuously to the TreeChopper, piling up a backlog nearby, with just ten seconds of downtime between trunks. Ratio: 1:2.

In a laboratory in Toronto, Dave sketches the first designs for the TreeChopper 30000.0.1, with an extra-wide feeder lane that can process up to 8 tree trunks simultaneously.

Alice, Bob, and Carol will all use this new machine in exactly the same way they used the single-feed variant, with no change to their workflow.


What happens in the following code?

import numpy as np

a = np.array([...])
b = np.array([...])
c = np.array([...])
d = np.array([...])

e = a * b
f = e / c
g = f**d

If you've just finished Data Science 101, you'll say 'vectorized array operations!', and you would be correct. NumPy's array arithmetic provides far, far superior performance to the equivalent operation by iterating over lists, not to mention being shorter and easier to read.

a = [...]
b = [...]
c = [...]
d = [...]

e = [a*b for a, b, in zip(a, b)]
f = [e/c for e, c, in zip(e, c)]
g = [f**d for f, d, in zip(f, d)]

Now, looking at this you might say ‘why are you splitting up the steps like this? It’s longer than it needs to be, and probably slower too’!

import numpy as np

a = np.array([...])
b = np.array([...])
c = np.array([...])
d = np.array([...])

g = ((a*b)/c)**d

By moving this all into one line, we've made it shorter, and probably faster, no?

The first part is surely true, but you may be surprised to learn the second is not, at least not meaningfully: they’re pretty much identical under the hood.

This is characteristic of NumPy's eager expression evaluation. NumPy processes these operations left-to-right, and the moment it spies a valid operation like a*b, it goes ahead and performs it. A hypothetical 'lazy' evaluation would wait until it reached the end of the line, observe that the expression included multiple operations, and try to perform them all together - but this would be a much more complicated endeavor, and it's not NumPy's approach.

When we pass a line like

g = ((a*b)/c)**d

or

e = a * b
f = e / c
g = f**d

NumPy breaks this down into three vectorized array operations. Here's the disassembled output for the single-line version:

  9         178 LOAD_FAST                0 (a)
            180 LOAD_FAST                1 (b)
            182 BINARY_OP                5 (*)
            186 LOAD_FAST                2 (c)
            188 BINARY_OP               11 (/)
            192 LOAD_FAST                3 (d)
            194 BINARY_OP                8 (**)
            198 STORE_FAST               4 (g)

and here's the multi-line version.

 10         178 LOAD_FAST                0 (a)
            180 LOAD_FAST                1 (b)
            182 BINARY_OP                5 (*)
            186 STORE_FAST               4 (e)

 11         188 LOAD_FAST                4 (e)
            190 LOAD_FAST                2 (c)
            192 BINARY_OP               11 (/)
            196 STORE_FAST               5 (f)

 12         198 LOAD_FAST                5 (f)
            200 LOAD_FAST                3 (d)
            202 BINARY_OP                8 (**)
            206 STORE_FAST               6 (g)

The latter looks longer, but the difference comes entirely from four extra STORE_FAST / LOAD_FAST instructions, to save and load the intermediary variables, which remain available. That operation does cost time, but only a few nanoseconds each time.

In contrast, all the useful work is being done in the BINARY_OP instructions: what do they actually do?

First, BINARY_OP will check the type of the left operand. In our case, we'll asume that's always an np.ndarray[np.float64]. It will look the appropriate slot from PyNumberMethods (nb_multiply for *, nb_true_divide for /, and nb_power for **), calls the slot (for example, np.ndarray.__mul__()), which then checks the types of the left and right operands and other necessary steps, such as checking that the dimensions can be broadcast, selects an np.ufunc loop, ALLOCATES THE OUTPUT ARRAY, and then actually goes and does the math element by element.

The act of storing and loading a stack variable back to back will probably take as much time as type checking the left operand--a miniscule amount of our total runtime, even if the array is very small.

By contrast, our pure list implementation of the same process would have to do all the same setup work per element, on top of chasing pointers for every element, whereas the array implementation just does the setup once per binary operation and has access to the elements in nice, evenly spaced, contiguous linear memory.

These are the hidden complexities that support a nice, simple syntax like g = ((a*b)/c)**d, and allow us to avoid a ridiculous amount of pointer chasing. For most purposes, it's fine. But when you need peak performance, it's far from sufficient.

Remember my hysterical capslock from a couple paragraphs ago? EVERY SINGLE BINARY_OP ALLOCATES AN ARRAY ON THE HEAP. That's a lot of responsibility to put on a single operator.

Clearly, what we'd like to do would be to fuse this into a single pass: something like

a_arr = np.array([...])
b_arr = np.array([...])
c_arr = np.array([...])
d_arr = np.array([...])

out_arr = np.zeros(n)

for i, (a_scalar, b_scalar, c_scalar, d_scalar), in enumerate(zip(a_arr, b_arr, c_arr, d_arr)):
    out_arr[i] = ((a_scalar*b_scalar)/c_scalar)**d_scalar

We could pre-allocate just one array up-front, iterate just once in parallel over all these arrays, and perform the inner calculation without performing any intermediary allocation. Replace many allocations with one, iterate just once across all arrays in parallel, and perform the inner calculation straightforwardly.

This should work, right?

In theory, yes. In practice, as implemented above, it'll be waaaay slower than the vectorized operation on any but the smallest arrays.

What's happening under the hood here?

Every single iteration, we have to extract the various elements: these are coming from contiguous arrays, so we benefit from caching here, but each C float needs to be wrapped in an np.float64 object, and then allocated onto the heap.

What, you thought the known, fixed-size numeric type would be stored on the stack? You fool! As mentioned ~in another blog post~ last class, everything in Python is an O.B.J.E.C.T., which stands for Reference-Counted Heap Allocated C Struct! No, it's not a very good acronym, I don't know what Guido was thinking.

Then we have to do all the same binary operation setup we saw before, but on each element rather than once per array, because Python doesn't know or care that these values originally came from a NumPy array: we're firmly in native-Python loop territory here, which means all of NumPy's nice guarantees are void and its optimizations are inaccessible.

Then, once we've actually gone ahead and done the math, producing the output, that needs to be transformed into a heap-allocated object too before being inserted to the array, where the vectorized operation could do this without the per-element intermediary allocation.

We've traded a handful of up-front allocations on a per-array basis with a lot more tiny allocations on a per-element basis.

Oh, and there's also Python loop overhead that doesn't exist for the vectorized operations, which are also capable of SIMD optimizations that the one-by-one Python loop is a dozen layers away from being able to perform.

This is downstream of a subject I've written about at length in the past: being simple is really complicated, and the simplicity of idiomatic Python is bought with layers upon layers of indirection, heap allocation, and runtime type checking, to the point that we need to buy back a user-facing array implementation with a package instead of having one in the standard library.

So what can we actually do to speed this up?

NumExpr

The NumExpr Python package provides a much less well-known alternative to NumPy's vectorized array operations. Rather than working on the full arrays on at a time or looping over values individually, it chunks the arrays. Instead of rendering the mathematical transformation as raw code, it's rendered as a string which is then compiled into a valid Python expressions (it's not clear from the docs whether that expression is eagerly evaluated like NumPy or if they have another scheme).

If your code actually decomposes into something like ((a*b)/c)**d, then NumExpr could very well be the right choice for you.

But it comes with some downsides. First, this is all done in a separate virtual machine, which introduces some extra weight to the program and makes the process a lot less observable.

Second, if you're trying to accelerate a function that calls out to other functions in the middle, or otherwise performs operations not represented by straightforward mathematical expressions, you're out of luck. In those cases, you'll want a more general solution.

Numba/JAX

Numba is a JIT compiler for Python. Just take your existing code, slap an @njit on it, and watch that loop get fused automatically!

Unfortunately, I have never used Numba, because the last time I suggested using it my collaborator looked at me like I'd slapped her across the face with a salmon.

Luckily JAX exists, and does pretty much the same thing while being better in every way.

It can even run on the GPU!

It does require you to think in terms of immutable arrays, but I generally consider this a plus.

But what if you need EVEN MORE performance (without running on the GPU)?

Screw It

Drop Python entirely and rewrite this in a low-level language.

An implementation in C might look something like this (for illustration only, it's not safe and I didn't check if it compiles):

static PyObject *_foo(PyObject *self, PyObject *args) {
    npy_intp npts = 0;
    npy_intp i = 0;

    // Py_objects for inputs and output
    PyObject *a_obj = NULL, *b_obj = NULL, *c_obj = NULL, *d_obj = NULL;
    PyObject *out_obj = NULL;
    // PyArray objects
    PyArrayObject *a_arr = NULL, *b_arr = NULL, *c_arr = NULL, *d_arr = NULL;
    PyArrayObject *out_arr = NULL;
    // Data pointers
    double *a_ptr, *b_ptr, *c_ptr, *d_ptr, *out_ptr;

    // Parse four input arguments
    if (!PyArg_ParseTuple(args, "OOOO", &a_obj, &b_obj, &c_obj, &d_obj)) {
        PyErr_SetString(PyExc_TypeError, "Error parsing input");
        return NULL;
    }

    // Build array objects
    a_arr = (PyArrayObject *)PyArray_FROM_O(a_obj);
    b_arr = (PyArrayObject *)PyArray_FROM_O(b_obj);
    c_arr = (PyArrayObject *)PyArray_FROM_O(c_obj);
    d_arr = (PyArrayObject *)PyArray_FROM_O(d_obj);

    // Check dimensions match
    npts = PyArray_DIM(a_arr, 0);

    // Build output array
    out_obj = PyArray_ZEROS(1, &npts, NPY_DOUBLE, 0);
    out_arr = (PyArrayObject *)out_obj;

    Py_BEGIN_ALLOW_THREADS
    for (i = 0; i < npts; i++) {
        a_ptr = PyArray_GETPTR1(a_arr, i);
        b_ptr = PyArray_GETPTR1(b_arr, i);
        c_ptr = PyArray_GETPTR1(c_arr, i);
        d_ptr = PyArray_GETPTR1(d_arr, i);
        out_ptr = PyArray_GETPTR1(out_arr, i);
        *out_ptr = pow(((*a_ptr * *b_ptr) / *c_ptr), *d_ptr);
    }
    Py_END_ALLOW_THREADS

    // Free input arrays
    if (a_arr) { Py_DECREF(a_arr); }
    if (b_arr) { Py_DECREF(b_arr); }
    if (c_arr) { Py_DECREF(c_arr); }
    if (d_arr) { Py_DECREF(d_arr); }

    return out_obj;

error:
    if (a_arr) { Py_DECREF(a_arr); }
    if (b_arr) { Py_DECREF(b_arr); }
    if (c_arr) { Py_DECREF(c_arr); }
    if (d_arr) { Py_DECREF(d_arr); }
    if (out_obj) { Py_DECREF(out_obj); }
    if (out_arr) { Py_DECREF(out_arr); }
    PyErr_SetString(PyExc_RuntimeError, "C error");
    return NULL;
}

It's not easy to tell at a glance unless you're experienced with C, but this is the same 'dumb loop' approach we tried earlier: except this time it works!

Unlike Python, which puts a lot of allocations and checks in the way of these same operations in order to avoid having to do any manual memory management, C does everything manually... at the cost of a lot more boilerplate and opportunities for things to go wrong. Real easy to forget to free something and cause a memory leak.

In practice, I'd just use Rust instead, which allows us some more security with considerably less boilerplate, though it does require just a bit of unsafe.

#[pyfunction]
pub fn foo<'py>(
    py: Python<'py>,
    // get array objects via numpy bindings
    a_arr: PyReadonlyArray1<f64>,
    b_arr: PyReadonlyArray1<f64>,
    c_arr: PyReadonlyArray1<f64>,
    d_arr: PyReadonlyArray1<f64>,
) -> Bound<'py, PyArray1<f64>> {

    // take arrays as read-only slices
    let a_slice = a_arr.as_slice().unwrap();
    let b_slice = b_arr.as_slice().unwrap();
    let c_slice = c_arr.as_slice().unwrap();
    let d_slice = d_arr.as_slice().unwrap();

    // create a new array of the right length and take it as a mutable slice
    let out_arr = unsafe { PyArray1::new(py, a_slice.len(), false)};
    let out_slice = unsafe {out_arr.as_slice_mut().unwrap()};

    // our dumb loop goes through element by element, fusing the 
    // math operations into a single step without intermediate allocations
    for (i, (((a, b), c), d)) in a_slice.iter()
        .zip(b_slice)
        .zip(c_slice)
        .zip(d_slice)
        .enumerate() {

        // do the math and load the result directly into the relevant index in the out slice
        out_slice[i] = ((a*b)/c).powf(*d);
    }
    // return the out array to Python
    out_arr
}

This is, in my view, a good deal more readable than the C code, and the intent is quite visible: take read-only slices of all the arrays, allocate one (1) new array of the same length in Python's heap, take a mutable slice of that new array, and then run our dumb-as-rocks allocation loop, with the actual math being done entirely on the heap, with some SIMD on top of that!

But don't take my word for it, let's read some assembly! Here's a simplified pure-Rust rendition of the same logic, so that Compiler Explorer doesn't complain:

pub fn foo(
    a_arr: &[f64],
    b_arr: &[f64],
    c_arr: &[f64],
    d_arr: &[f64],
) -> Vec<f64> {

    // note that we don't need to take a mutable slice here, because all these 
    // arrays are entirely in Rust-controlled memory.
    let mut out = vec![0.0; a_arr.len()];

    for (i, (((a, b), c), d)) in a_arr.iter()
        .zip(b_arr)
        .zip(c_arr)
        .zip(d_arr)
        .enumerate() {

        out[i] = ((a*b)/c).powf(*d);
    }
    out
}

In dev, this compiles to ~1700 lines of x86 assembly. With opt-level-3, it goes all the way down to 150! This was on Rust 1.93.0.

The vectorized loop is even shorter: here's the whole of it:

.LBB0_9:
        movupd  xmm1, xmmword ptr [r13 + r15]
        movupd  xmm0, xmmword ptr [rbx + r15]
        mulpd   xmm0, xmm1
        movupd  xmm1, xmmword ptr [rcx + r15]
        divpd   xmm0, xmm1
        movapd  xmmword ptr [rsp + 80], xmm0
        movups  xmm1, xmmword ptr [rdx + r15]
        movaps  xmmword ptr [rsp + 96], xmm1
        mov     r12, rbx
        mov     rbx, rdx
        call    rbp
        movapd  xmmword ptr [rsp + 32], xmm0
        movapd  xmm0, xmmword ptr [rsp + 80]
        unpckhpd        xmm0, xmm0
        movaps  xmm1, xmmword ptr [rsp + 96]
        movhlps xmm1, xmm1
        call    rbp
        mov     rdx, rbx
        mov     rbx, r12
        mov     rcx, qword ptr [rsp + 24]
        mov     rax, qword ptr [rsp]
        movapd  xmm1, xmmword ptr [rsp + 32]
        unpcklpd        xmm1, xmm0
        movupd  xmmword ptr [rax + r15], xmm1
        add     r15, 16
        cmp     r14, r15
        jne     .LBB0_9
        mov     rsi, qword ptr [rsp + 56]
        mov     rdi, qword ptr [rsp + 16]

Let's take this apart.

Here, we're loading 128 bits (2 64-bit floats) from arrays A and B with movupd (move unaligned packed doubles): note that xmmword is 128-bit.

movupd  xmm1, xmmword ptr [r13 + r15]   ; load 2 doubles from A
movupd  xmm0, xmmword ptr [rbx + r15]   ; load 2 doubles from B
mulpd   xmm0, xmm1                      ; xmm0 = a * b (packed)
movupd  xmm1, xmmword ptr [rcx + r15]   ; load 2 doubles from C
divpd   xmm0, xmm1                      ; xmm0 /= c (packed)

It then performs mulpd (multiply packed doubles) on the upper and lower doubles simultaneously, storing the result in xmm0, loads another 2 packed doubles from array C into xmm1, and performs divpd (divide packed doubles) on the same.

We're accomplishing the (a*b)/c step in just five assembly instructions, for two elements! This is obviously way better than our attempt at the same dumb loop in Python, since we're not performing so many implicit allocations each time, but it's also superior to NumPy's vectorized array operations on the same hardware: even though NumPy is also taking advantage of SIMD, it's evaluating eagerly and performing a whole heap allocation before it gets to temp /= c.

We're also able to do temp ** d in the same step, but it's not vectorized on this hardware:

Way earlier we stored Rust's power function in the rbp register

mov     rbp, qword ptr [rip + pow@GOTPCREL]

Then we save the result of the previous operations to the function-local stack (we'll see why in a second) and set up for temp**d by loading two packed values from D to the same stack. Some of these are using singles instead of doubles, but that doesn't seem to have any effect - they should be equivalent here.

movapd  xmmword ptr [rsp + 80], xmm0    ; xmm0 to local stack (16 bytes)
movups  xmm1, xmmword ptr [rdx + r15]   ; load 2 doubles from D
movaps  xmmword ptr [rsp + 96], xmm1    ; ... and then put them on the local stack, bc x86 doesn't allow mem-to-mem moves
mov     r12, rbx                        ; a little more juggling to save into callee-managed registers
mov     rbx, rdx

Now that we've cleared things out, we can actually perform the power operation; but only on one scalar at a time.

call rbp is going to perform the 64-bit power operation... but since it's operating on 128-bit xmm registers, it just ignores evenything beyond the first 64 bits in each. Once that's done, pull the result, stored in the lower bits of xmm0, onto the stack, pull the old value of xmm0 that we stored on the stack back in, and use unpckhpd to move the high double to the low position. We do the same for the exponent stored in xmm1, and call rbp a second time

call    rbp                             ; xmm0[0] ** xmm1[0]
movapd  xmmword ptr [rsp + 32], xmm0    ; store result[0] on stack
movapd  xmm0, xmmword ptr [rsp + 80]    ; move previous temp back from stack
unpckhpd        xmm0, xmm0              ; shift higher bits to lower position
movaps  xmm1, xmmword ptr [rsp + 96]    ; do the same for xmm0
movhlps xmm1, xmm1
call    rbp                             ; xmm0[1] ** xmm1[1]

From here, it's all saving the results in their proper place, resetting registers, and continuing the loop.

And to bring it all full circle, let's take another look at some of the bytecode generated by the original numpy vectorized calculation:

def foo():
    a = np.array([...])
    b = np.array([...])
    c = np.array([...])
    d = np.array([...])

    return ((a*b)/c)**d
 12         178 LOAD_FAST                0 (a)
            180 LOAD_FAST                1 (b)
            182 BINARY_OP                5 (*)
            186 LOAD_FAST                2 (c)
            188 BINARY_OP               11 (/)
            192 LOAD_FAST                3 (d)
            194 BINARY_OP                8 (**)
            198 RETURN_VALUE

Load, load, op, load, op, load, op, return.

Conceptually, it's not so different from the assembly! But the assembly reflects commands to the hardware, each taking fractions of a nanosecond to complete and existing entirely on the stack, while the bytecode exists many layers of abstraction above the hardware, and each command hides multiple levels of pointer indirection, heap allocations, and type checks.

Whether you're working with NumPy or writing assembly directly, the language is an interface between you and the hardware doing what you want it to do: the more precisely specified 'what you want the code to do' is, the more faster it can run.

So, how fast do these actually run?

With some quick and dirty benchmarking, the basic NumPy implementation takes around ~7.3 nanoseconds / element.

Doing the same with list comprehension takes between 130-150 nanoseconds/element, around 20x slower.

The 'dumb loop' in Python sits around 170ns/element, 23x slower. No good!

And the Rust version comes in at... ~7.7ns/element, around 5% slower! Not what I expected!

Why might this be? Well, a couple reasons.

First, after experimenting with a different operation, doing just (a*b)/c without the exponentiation, the situation changes:

NumPy:  0.847 ns/element
Rust:  0.575 / element

The Rust implementation is now 47% faster. But more importantly, both are around 8-15 times faster without the exponentiation step. Turns out pow took anywhere between 80-95+ % of the function's runtime!

This leads into reason 2: the actual hardware.

The assembly analysis we did earlier was on x86 hardware, since that's what Compiler Explorer offers, but I'm running on Apple Silicon, and NumPy has access to a lot of platform-specific SIMD optimizations here: running np.show_config() gives us:

"SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP",
      "ASIMDDP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }

ASIMD is Apple's Accelerate SIMD framework, which offers, among many other things, a vectorized power function.

Taking advantage of this superior integration with the hardware, especially when the power operation dominates everything else in the function, makes the straightforward NumPy implementation faster, despite the intermediate memory allocations.

Lesson: no matter how much you think about optimizations, you need to actually benchmark different solutions. Ideally, you can do it on the target hardware, though this might not be trivial in many cases.

Having established this for the trivial case, let's take a gander at an actual target for optimization from McFACTS:

def analytical_kick_velocity(
        mass_1,
        mass_2,
        spin_1,
        spin_2,
        spin_angle_1,
        spin_angle_2):
    """
    Compute the analytical gravitational wave recoil (kick) velocity for merging black hole binaries
    as in Akiba et al. 2024 (arXiv:2410.19881).

    Parameters
    ----------
    mass_1 : numpy.ndarray
        Mass [M_sun] of object 1 with :obj:`float` type
    mass_2 : numpy.ndarray
        Mass [M_sun] of object 2 with :obj:`float` type
    spin_1 : numpy.ndarray
        Spin magnitude [unitless] of object 1 with :obj:`float` type
    spin_2 : numpy.ndarray
        Spin magnitude [unitless] of object 2 with :obj:`float` type
    spin_angle_1 : numpy.ndarray
        Spin angle [radian] of object 1 with :obj:`float` type
    spin_angle_2 : numpy.ndarray
        Spin angle [radian] of object 2 with :obj:`float` type

    Returns
    -------
    v_kick : np.ndarray
        Kick velocity [km/s] of the remnant BH with :obj:`float` type
    """
    # As in Akiba et al 2024 Appendix A, mass_2 should be the more massive BH in the binary.
    mask = mass_1 <= mass_2

    m_1_new = np.where(mask, mass_1, mass_2) * u.solMass
    m_2_new = np.where(mask, mass_2, mass_1)* u.solMass
    spin_1_new = np.where(mask, spin_1, spin_2)
    spin_2_new = np.where(mask, spin_2, spin_1)
    spin_angle_1_new = np.where(mask, spin_angle_1, spin_angle_2)
    spin_angle_2_new = np.where(mask, spin_angle_2, spin_angle_1)

    # "perp" and "par" refer to components perpendicular and parallel to the orbital angular momentum axis, respectively.
    # Orbital angular momentum axis of binary is aligned with the disk angualr momentum.
    # Find the perp and par components of spin:
    spin_1_par = spin_1_new * np.cos(spin_angle_1_new)
    spin_1_perp = spin_1_new * np.sin(spin_angle_1_new)
    spin_2_par = spin_2_new * np.cos(spin_angle_2_new)
    spin_2_perp = spin_2_new * np.sin(spin_angle_2_new)

    # Find the mass ratio q and asymmetric mass ratio eta
    # as defined in Akiba et al. 2024 Appendix A:
    q = m_1_new / m_2_new
    eta = q / (1 + q)**2

    # Use Akiba et al. 2024 eqn A5:
    S = (2 * (spin_1_new + q**2 * spin_2_new)) / (1 + q)**2

    # As defined in Akiba et al. 2024 Appendix A:
    xi = np.radians(145)
    A = 1.2e4 * u.km / u.s
    B = -0.93
    H = 6.9e3 * u.km / u.s
    V_11, V_A, V_B, V_C = 3678 * u.km / u.s, 2481 * u.km / u.s, 1793* u.km / u.s, 1507 * u.km / u.s
    angle = rng.uniform(0.0, 2*np.pi, size=len(mass_1))

    # Use Akiba et al. 2024 eqn A2:
    v_m = A * eta**2 * np.sqrt(1 - 4 * eta) * (1 + B * eta)

    # Use Akiba et al. 2024 eqn A3:
    v_perp = (H * eta**2 / (1 + q)) * (spin_2_par - q * spin_1_par)

    # Use Akiba et al. 2024 eqn A4:
    v_par = ((16 * eta**2) / (1 + q)) * (V_11 + (V_A * S) + (V_B * S**2) + (V_C * S**3)) * \
            np.abs(spin_2_perp - q * spin_1_perp) * np.cos(angle)

    # Use Akiba et al. 2024 eqn A1:
    v_kick = np.sqrt((v_m + v_perp * np.cos(xi))**2 +
                     (v_perp * np.sin(xi))**2 +
                     v_par**2)
    v_kick = np.array(v_kick.value)
    assert np.all(v_kick > 0), \
        "v_kick has values <= 0"
    assert np.isfinite(v_kick).all(), \
        "Finite check failure: v_kick"
    return v_kick

This is the kind of function that has the issues we've discussed before: a ton of vectorized array operations, all being eagerly evaluated and allocating tons of intermediary arrays.

We can take out the part of the function requiring a call to rng:

def analytical_kick_velocity_optimized(
    ...
):
    angle = rng.uniform(0.0, 2*np.pi, size=len(mass_1))

    v_kick = analytical_kick_velocity_helper(
        mass_1,
        mass_2,
        spin_1,
        spin_2,
        spin_angle_1,
        spin_angle_2,
        angle
    )

    return v_kick

And farm out the rest to a rust-side helper function:

#[pyfunction]
pub fn analytical_kick_velocity_helper<'py>(
    py: Python<'py>,
    mass1_arr: PyReadonlyArray1<f64>,
    mass2_arr: PyReadonlyArray1<f64>,
    spin1_arr: PyReadonlyArray1<f64>,
    spin2_arr: PyReadonlyArray1<f64>,
    spin_angle1_arr: PyReadonlyArray1<f64>,
    spin_angle2_arr: PyReadonlyArray1<f64>,
    angle_arr: PyReadonlyArray1<f64>,
) -> FloatArray1<'py> {
    
    let m1_slice = mass1_arr.as_slice().unwrap();
    let m2_slice = mass2_arr.as_slice().unwrap();
    let s1_slice = spin1_arr.as_slice().unwrap();
    let s2_slice = spin2_arr.as_slice().unwrap();
    let sa1_slice = spin_angle1_arr.as_slice().unwrap();
    let sa2_slice = spin_angle2_arr.as_slice().unwrap();
    let angle_slice = angle_arr.as_slice().unwrap();

    let out_arr = unsafe { PyArray1::new(py, m1_slice.len(), false)};
    let out_slice = unsafe { out_arr.as_slice_mut().unwrap()};

    let xi: f64 = 145.0f64.to_radians();
    let const_a: f64 = 1.2e4;
    let const_b: f64 = -0.93;
    let const_h: f64 = 6.9e3;
    let v_11: f64 = 3678.0;
    let v_a: f64 = 2481.0;
    let v_b: f64 = 1793.0;
    let v_c: f64 = 1507.0;

    for (i, ((((((m1, m2), s1), s2), sa1), sa2), angle)) in m1_slice.iter()
        .zip(m2_slice)
        .zip(s1_slice)
        .zip(s2_slice)
        .zip(sa1_slice)
        .zip(sa2_slice)
        .zip(angle_slice)
        .enumerate() {

        // Handle the Swap (Akiba et al. Appendix A: mass_2 should be heavier)
        // We use simple variable shadowing to swap purely on the stack
        let (loc_m1, loc_m2, loc_s1, loc_s2, loc_a1, loc_a2) = if m1 <= m2 {
            (m1, m2, s1, s2, sa1, sa2)
        } else {
            (m2, m1, s2, s1, sa2, sa1)
        };

        // Spin Components
        let (s1_sin, s1_cos) = loc_a1.sin_cos();
        let (s2_sin, s2_cos) = loc_a2.sin_cos();

        let s1_par = loc_s1 * s1_cos;
        let s1_perp = loc_s1 * s1_sin;
        let s2_par = loc_s2 * s2_cos;
        let s2_perp = loc_s2 * s2_sin;

        // Mass Ratios
        let q = loc_m1 / loc_m2;
        let q_sq = q * q;
        let one_plus_q = 1.0 + q;
        let one_plus_q_sq = one_plus_q * one_plus_q;
        
        let eta = q / one_plus_q_sq;
        let eta_sq = eta * eta;

        // Akiba Eq A5
        let s_big = (2.0 * (loc_s1 + q_sq * loc_s2)) / one_plus_q_sq;
        let s_big_sq = s_big * s_big;
        let s_big_cu = s_big_sq * s_big;

        // Akiba Eq A2 (v_m)
        let term_sqrt = (1.0 - 4.0 * eta).sqrt();
        let v_m = const_a * eta_sq * term_sqrt * (1.0 + const_b * eta);

        // Akiba Eq A3 (v_perp)
        let v_perp_mag = (const_h * eta_sq / one_plus_q) * (s2_par - q * s1_par);

        // Akiba Eq A4 (v_par)
        // Note: Python code used np.abs(spin_2_perp - q * spin_1_perp)
        let term_v = v_11 + (v_a * s_big) + (v_b * s_big_sq) + (v_c * s_big_cu);
        let spin_diff_perp = (s2_perp - q * s1_perp).abs();

        let v_par = ((16.0 * eta_sq) / one_plus_q) * term_v * spin_diff_perp * angle.cos();

        // Akiba Eq A1 (Total Kick)
        let (xi_sin, xi_cos) = xi.sin_cos();
        
        let term_1 = v_m + v_perp_mag * xi_cos;
        let term_2 = v_perp_mag * xi_sin;
        
        out_slice[i] = (term_1.powi(2) + term_2.powi(2) + v_par.powi(2)).sqrt();
    }

    out_arr
}

How do these actually compare in terms of performance?

The raw Python version benchmarks at around 912 microseconds, while the Rust version comes in at 4.3 microseconds: A 210x speedup.

Moreover, only around half of the Rust-accelerated function, 2.1 microseconds, is spent in the actual Rust helper function: the rest is spent on the random draws. Comparing apples to apples, then, it's more like a 430x speedup compared to using numpy arrays.

As a rule, the more eagerly-evaluated steps you can fuse into a single pass to eliminate their intermediary allocations, the more the performance gap will grow, even coming to dwarf NumPy's advantages in superior SIMD integration. Of course, we could also implement hardware-specific SIMD optimizations ourselves if we wanted to go EVEN FASTER, but for McFACTS, this is where we're stopping at the moment: by going from 912 to 4 microseconds, this function went from taking ~3% of our total simulation runtime, to being basically invisible.