nrposner

Vectorized Hardware Instructions Rule Everything Around Me, Actually

Yesterday, I shared a rather lengthy post about various ways to optimize NumPy array operations with lower-level solutions. One of the findings was that a zipped Rust array could be faster than NumPy, but it depended on several factors:

  • How many binary NumPy operations are you performing? NumPy performs an intermediate array allocation on each operation, so the more operations you can fuse into a single pass, the more you can eliminate allocator and loop overhead.
  • What operations are you performing, and are they optimized? NumPy is very well integrated with mathematical libraries like BLAs and with hardware acceleration frameworks like Apple's Accelerate, which provide vectorized transcendental functions like vvpow.

I concluded:

Of course, we could also implement hardware-specific SIMD optimizations ourselves if we wanted to go EVEN FASTER, but for McFACTS, this is where we're stopping at the moment: by going from 912 to 4 microseconds, this function went from taking ~3% of our total simulation runtime, to being basically invisible.

So anyway, here's a quick followup on accessing hardware-specific SIMD operations.


I'd never done this before, but it turned out to be simpler than I expected. The online documentation includes C call signatures, which I just had to translate into Rust:

void vvdiv(double *, const double *, const double *, const int *);
void vvpow(double *, const double *, const double *, const int *);

became

unsafe extern "C" {
    /// The Accelerate framework's vvdiv function for double-precision SIMD division, performing y/x
    #[cfg(target_os = "macos")]
    fn vvdiv(out: *mut f64, y: *const f64, x: *const f64, count: *const i32);
    /// The Accelerate framework's vvpow function for double-precision SIMD power, performing base^exponent
    #[cfg(target_os = "macos")]
    fn vvpow(out: *mut f64, exponents: *const f64, bases: *const f64, count: *const i32);
}

Then at build time, we need to link against the Accelerate framework:

// in build.rs
fn main() {
    #[cfg(target_os = "macos")]
    println!("cargo:rustc-link-lib=framework=Accelerate");
}

Frankly, I'm still surprised it's this easy.

Then we can extract this into a helper function with different bodies depending on the hardware:

fn go_brrr(a: &[f64], b: &[f64], c: &[f64], d: &[f64], out: &mut [f64]) {
    #[cfg(target_os = "macos")]
    {
        for (i, ((a, b), c)) in a.iter().zip(b).zip(c).enumerate() {
            out[i] = (a * b) / c;
        }
        let count = out.len() as i32;
        unsafe {
            vvpow(out.as_mut_ptr(), d.as_ptr(), out.as_ptr(), &count);
        }
    }

    #[cfg(not(target_os = "macos"))]
    {
        for (i, (((a, b), c), d)) in a.iter().zip(b).zip(c).zip(d).enumerate() {
            out[i] = ((a * b) / c).powf(*d);
        }
    }
}

#[pyfunction]
pub fn foobar<'py>(
    py: Python<'py>,
    // get array objects via numpy bindings
    a_arr: PyReadonlyArray1<f64>,
    b_arr: PyReadonlyArray1<f64>,
    c_arr: PyReadonlyArray1<f64>,
    d_arr: PyReadonlyArray1<f64>,
) -> Bound<'py, PyArray1<f64>> {

    // take arrays as read-only slices
    let a_slice = a_arr.as_slice().unwrap();
    let b_slice = b_arr.as_slice().unwrap();
    let c_slice = c_arr.as_slice().unwrap();
    let d_slice = d_arr.as_slice().unwrap();

    // create a new array of the right length and take it as a mutable slice
    let out_arr = unsafe { PyArray1::new(py, a_slice.len(), false)};
    let out_slice = unsafe {out_arr.as_slice_mut().unwrap()};

    go_brrr(a_slice, b_slice, c_slice, d_slice, out_slice);

    // return the out array to Python
    out_arr
}

And we can of course do the same for vvdiv, (note that there doesn't seem to be a vvmul, presumably because multiplication is so fast and already expected to be vectorized when possible, it's not worth it).

Here are our results, (all on randomly generated 1000-long arrays of doubles, with outputs checked to be the same to within rtol=1e-9):

NumPy:  7.3 ns/element          1x
Rust:  7.5 ns/element           0.97x
Rust+pow:  3.78 ns/element      1.92x
Rust+div+pow:  3.86 ns/elemnt   1.88x

The original all-rust version is slightly slower than NumPy because the powf call dominates everything else and isn't accelerated.

Using vvpow, we end up almost twice as fast as NumPy! I didn't expect such a large boost: I assumed that NumPy was already using vvpow, and that with Rust catching up to it, we would see a more modest speedup based on eliminating loop overhead and allocations, maybe on the order of 1 or 2 nanoseconds gained per element at most, not almost 4. In all likelihood, there's more going on under the hood that I don't know about yet.

Adding vvdiv is, not unexpectedly, unhelpful and slows us down slightly. Some combination of the call cost on a function that can't be inlined (I would assume), an additional pass, and vvdiv not being able to do much that the original Rust code isn't already doing.

Coming into this, I figured the take-home lesson would be 'you can speed up NumPy array operations a good deal, but only if NumPy is losing time to intermediary allocations.' Instead, between the (a*b)/c example from yesterday and this example now, it seems that even trivial operations can be sped up considerably, though you should profile and determined that it's actually a bottleneck before putting in the effort.