nrposner

Rust is 2x faster than Rust: PyO3 Edition

As a method of helping pass the time while waiting for CNNs to converge, I've been experimenting with converting some numerical C-Python libraries to Rust-Python with PyO3.

Today, I'm looking at V.E. Delfavero's basil-core C library, which is the optimized core of the BASIL (Bayesian Astronomical Sampling and Integrating Library) Python package. Specifically, a set of recently updated C functions optimizing the calculation of orbital decay in binary systems (note that I'm not an astrophysicist, I'm just a code monkey who got nerd-sniped).

Here's a representative CPython function for calculating... uhhh. I'm not really sure what it's calculating, exactly, but I've never let that stop me before!

static PyObject *_orb_sep_evol_ecc_integrand_arr(PyObject *self, PyObject *args) {

    // npts will describe the length of our P array
    npy_intp npts = 0;
    // Loop variables
    npy_intp i = 0;
    // Py_objects for inputs and output objects
    PyObject *ecc = NULL;
    double preamble = 0;
    double c0 = 0;
    PyObject *ecc_obj = NULL;
    PyObject *out_obj = NULL;
    // PyArray objects for array data
    PyArrayObject *ecc_arr = NULL;
    PyArrayObject *out_arr = NULL;
    // Get pointers for the arrays
    double *ecc_ptr,*out_ptr;

    // Parse the argument tuple
    if (!PyArg_ParseTuple(args, "ddO", &preamble, &c0, &ecc_obj)) {
        PyErr_SetString(PyExc_TypeError, "Error parsing input");
        return NULL;
    }

    // Fill the array pointers
    ecc_arr = (PyArrayObject *)PyArray_FROM_O(ecc_obj);
    check(ecc_arr, "Failed to build ecc_arr.");
    check(PyArray_NDIM(ecc_arr) == 1, "ecc array should have only one dimension.");

    // Check the dimensions
    npts = PyArray_DIM(ecc_arr, 0);
    check(npts > 0, "npts should be greater than zero.");

    // Build output array
    out_obj = PyArray_ZEROS(1, &npts, NPY_DOUBLE, 0);
    check(out_obj, "Failed to build output object.");
    out_arr = (PyArrayObject *)out_obj;
    check(out_arr, "Failed to build output array.");

    // Allow multithreading
    Py_BEGIN_ALLOW_THREADS
    for (i = 0; i < npts; i++){
        ecc_ptr = PyArray_GETPTR1(ecc_arr, i);
        out_ptr = PyArray_GETPTR1(out_arr, i);
        double sep = c0 / peters_ecc_const_c(*ecc_ptr);
        *out_ptr = preamble * (*ecc_ptr/pow(sep,4)) * (1. + (121./304.)*pow(*ecc_ptr,2)) / pow(1. - pow(*ecc_ptr,2),5./2.);
    }
    // Cut out the multithreading nonsense
    Py_END_ALLOW_THREADS
    // Free things
    if (ecc_arr) {Py_DECREF(ecc_arr);}
    // Return output array
    return out_obj;

// Error
error:
    if (ecc_arr) {Py_DECREF(ecc_arr);}
    if (out_obj) {Py_DECREF(out_obj);}
    if (out_arr) {Py_DECREF(out_arr);}
    PyErr_SetString(PyExc_RuntimeError, "C error");
    return NULL;
}

Some notes from the perspective of someone whose experience with C is limited to reading, not writing it: this is LONG. I wouldn't call it verbose, per se--any given line is quite self-explanatory and most are quite short. But there's a lot of boilerplate and manual pointer handling, and it would be very easy to just forget a decrement or dimension check.

But the thing that really struck me when I started translating these functions was how unclear the actual data flow is: all the functions take in raw PyObjects, manually extract them into C-friendly datatypes (double arrays, in this case), and then perform checks for the type and dimensionality of the array. The meat of the function, the loop which releases the GIL, performs the actual calculation, and reclaims the GIL, is squeezed in right at the end!

It's not an issue of optimization--it's very well optimized, with no wasted effort--but it's not exactly readable, and it's especially not scannable. you need to read through the actual function body in order to figure out what kinds of data types it expects to take in, whether it mutates them, and what it outputs (a static PyObject, yes, but representing what?).

You might protest that this is just a matter of the function lacking documentation: and sure, it doesn't have a detailed header, but that really shouldn't be the standard. Comments are not a concurrency strategy, and you shouldn't need to rely on complete and accurate documentation just to tell at a glance what a function does.

In contrast, here was my initial, dead simple translation to Rust with PyO3:

#[pyfunction]
fn orb_sep_evol_ecc_integrand_arr(
    preamble: f64, 
    c0: f64,
    ecc_arr: Vec<f64>, 
) -> Vec<f64> {
    assert!(!ecc_arr.is_empty());

    ecc_arr.iter().map(|ecc| {
        let sep = c0 / peters_ecc_const_sgl(*ecc);
        preamble * (ecc/sep.powi(4)) * (1.0 + (121.0/304.0) * ecc.powi(2)) / (1.0 - (ecc).powi(2)).powf(2.5)
    }).collect()
}

A few notes:

First, it's a lot shorter. Part of that comes from the lack of internal comments, but even if you took the comments out of the equivalent C function, it's still about four times as long as the Rust function. Hell, almost half the Rust function's vertical space is spent on its header.

The header, of course, directly communicates how the function interacts with its environment: it takes in by value two doubles/f64, and a Vec<f64>, a heap-allocated array of doubles, and it outputs another Vec<f64>.

Its data flow is pretty clear: we need just one check up front to ensure that the vector isn't empty (and if we eliminated that check, we'd just have a function that returns an empty vector when given an empty one). Beyond that, we're focused entirely on the loop that constructs the new vector from the old one. The iterator and map syntax would be unfamiliar to someone who hasn't worked with Rust, but besides that, it's pretty self-explanatory.

We don't need to check the dimensionality in the function body, because that requirement is already encoded in the function header: we are expecting an array of doubles. Anything Python provides that PyO3 can safely turn into an array of doubles (a list or a numpy array of integers or floats) is accepted. Any violation of that contract (non-numeric types, a non-one-dimensional input) is rejected at runtime without needing any manual check on our end.

One final little note: the C codebase has two separate functions, peters_ecc_const_sgl and peters_ecc_const_c, which do the same thing: the former called from Python, the latter called from C.

In our implementation, we don't have a separate peters_ecc_const_rust function: we just use the same scalar calculation function that we expose to Python, since it's a pure Rust function that we just stuck a #[pyfunction] decorator on top. The Rust function and the Python function are constructed separately at build time and exist in separate namespaces. Mind, this is only possible if the Rust function in question only consumes and produces common primitives, which is not always practical, as we'll soon see.

Benchmarking

We build, with maturin develop -r making this dead easy for us, uv pip install basil-core for comparison, and do some benchmarking.

The first result: our Rust function is reliably ~30% SLOWER than the equivalent C function:

(10000 iterations each on randomly generated data)

orb_sep_evol_ecc_integrand_arr
Orb Arrays: 500 long
    C averaged:   29489.52520 nanoseconds
    Rust averaged:  41358.17790 nanoseconds
    Differential (C / Rust):  0.71x
Orb Arrays: 1000 long
    C averaged:   58592.56450 nanoseconds
    Rust averaged:  82926.42460 nanoseconds
    Differential (C / Rust):  0.71x
Orb Arrays: 5000 long
    C averaged:   290843.57220 nanoseconds
    Rust averaged:  412179.88870 nanoseconds
    Differential (C / Rust):  0.71x
Orb Arrays: 10000 long
    C averaged:   580371.21280 nanoseconds
    Rust averaged:  818367.96210 nanoseconds
    Differential (C / Rust):  0.71x

Where C completes in, on average, 60 nanoseconds per element, Rust averages 80 nanoseconds per element.

Womp womp.

Now, the art of benchmarking is a fickle one, and speculating about why code runs faster or slower (without directly reading the MIR or assembly) can be hazardous: but this result is very robust, and demands some explanation: where is this slowdown coming from?

Well, mainly from the fact that the Rust and C code aren't quite identical.

The C code is striving to eliminate unnecessary memory allocation: it's reading directly from the input Python array (borrowing, not consuming) without copying. This line in particular,

ecc_arr = (PyArrayObject *)PyArray_FROM_O(ecc_obj);

will copy data if it's a Python list... but if it's a numpy array that enforces contiguity in memory (you are using numpy for your numerical calculations, right?), it will just provide a pointer to the ecc_ob Python object and increment its refcount.

It only actually allocates one new array, of exactly the necessary size, and feeds the contents into it by index without intermediary allocation. The rest of the function's time is just spent looping, reading, and writing, and all in contiguous, cached memory.

Our Rust function, on the other hand, has an extra allocation under its belt: that Vec<f64> we take as input isn't a direct read of the numpy array, but an entirely new, Rust-owned-and-allocated array. This makes the subsequent code simpler, since we're working directly with an owned, standard library struct, but it slows us down considerably.

So, how do we actually get these two functions on par?

Zerocopy PyO3

#[pyfunction]
fn orb_sep_evol_ecc_integrand_arr<'py>(
    py: Python<'py>,
    preamble: f64,
    c0: f64,
    ecc_arr: PyReadonlyArray1<'py, f64>,
) -> Bound<'py, PyArray1<f64>> {

    assert!(!ecc_arr.is_empty().unwrap());

    // get a read-only slice to iterate over
    let ecc_slice = ecc_arr.as_slice().unwrap();

    // pre-allocate the new numpy array
    let out_arr = unsafe{ PyArray1::new(py, ecc_slice.len(), false)};
    let out_slice = unsafe {out_arr.as_slice_mut().unwrap()};

    // now loop over the slices (no allocation)
    for (i, &ecc) in ecc_slice.iter().enumerate() {
        let sep = c0 / peters_ecc_const_sgl(ecc);
        out_slice[i] = preamble * (ecc/sep.powi(4)) * (1.0 + (121.0/304.0) * ecc.powi(2)) / (1.0 - ecc.powi(2)).powf(2.5);
    }
    out_arr
}

We're a good bit more verbose now, and we've brought in some types from the numpy rust crate and lifetimes, but the results are worth it. First, let's check out the differences:

We're now explicitly invoking the Python interpreter with its 'py lifetime, and taking a read-only reference to a 1-dimensional array Python array of doubles, PyReadOnlyArray1<'py, f64>. These are the same assumptions the C code was making, but we're stuffing them into an input type, rather than guaranteeing them with manual runtime checks.

Like the C code, we now pre-allocate an output array of the same length as the eccentricity array, which we are taking as a read-only slice. However, the act of creating this array is unsafe (since it's zero-initialized at creation, and any values which are not manually initialized may be invalid), and so is the act of taking a mutable slice of it using as_slice_mut().

We then loop directly over ecc_slice, calculating the new value for each iteration, and manually stuffing it into the correct index of the output array, just like the C code does.

(We could rewrite this syntax to use an iterator all the way rather than a for loop, but it gets a more verbose and isn't worth it unless we're trying to add parallelism with rayon).

Finally, we output... uhhh. What the hell is that output signature? Bound<'py, PyArray1<f64>>?

Previously, I've discussed PyO3's Bound wrapper in inputs as a way of saying 'look, Python could pass in anything, we're going to handle it manually', very similar to the C code's manual PyArg unpacking.

In an output, its meaning changes slightly: it's an owned, thread-attached smart pointer bound to the lifetime of the Python interpreter itself (so that Python can copy or deallocate it at its leisure), containing an actual Python object within it: not a Rust Vec<T>, a full Python object.

It's a bit arcane to read, and if you wanted to pass this around in Rust you'd have some issues (that's what pyo3::Py<T> is for), but it makes the crossing back to the Python side dead easy: PyO3 handles the unwrapping of the actual value, which can really only fail if you, the programmer, did something silly like unsafely zero-initialize a numpy array without re-initializing it with valid values.

Alright, that's enough chatter. How does this impact our benchmarks?

(10000 iterations each on randomly generated data)

orb_sep_evol_ecc_integrand_arr
Orb Arrays: 500 long
    C averaged:   29377.08650 nanoseconds
    Rust averaged:  14909.95010 nanoseconds
    Differential (C / Rust):  1.97x
Orb Arrays: 1000 long
    C averaged:   58475.87110 nanoseconds
    Rust averaged:  29440.20820 nanoseconds
    Differential (C / Rust):  1.99x
Orb Arrays: 5000 long
    C averaged:   290898.78610 nanoseconds
    Rust averaged:  145879.32700 nanoseconds
    Differential (C / Rust):  1.99x
Orb Arrays: 10000 long
    C averaged:   583183.95880 nanoseconds
    Rust averaged:  291285.88850 nanoseconds
    Differential (C / Rust):  2.00x

Oh.

Well that's quite nice.

Where our Rust version used to take ~80ns per element, it's now averaging ~30ns per element across the board: a solid 2.6x speedup, putting it around ~2x ahead of the equivalent C code.

Where the hell is that speedup coming from?

We did that work mainly to get the Rust code on par with C: where is this 2x difference coming from now?

I... don't have a great answer for that.

One possibility that Rust compiling in release is more aggressive with autovectorization, but 'it was autovectorization' is the systems engineer equivalent of a cloud engineer saying 'it was DNS'. Sure, it might be, but unless you're showing assembly you're not proving it are you?

Another possibility is that the Rust function uses .powi() for exponentiating to integer powers, while the C code uses pow() for all exponentiation. I'm given to understand that .powi() is faster than .powf() in Rust, and that pow is probably doing the equivalent of .powf(). If this is the case, it implies that the C code could probably get to the same level with a more specialized exponentiation function. I don't know enough about C math modules to say.

The speedup probably isn't due to the C code lacking some optimization flags: it was built with meson on optimization level 3, which should be the same thing as Cargo's default opt_level = 3 when compiling in release.