Writing Rust Utilities for Slurm

Published on September 28, 2025

For my internship this summer, I developed a set of Rust utilities on top of the Slurm job scheduler. Slurm is written in C, so to do that, I first had to build an API, and to do that, I first had to build an FFI.

This is how I approached it.

Bindings

At least one person has tried their hand at this problem before: Peter Williams created the slurm-rs repo years ago, and gave up around 2018, leaving it unmaintained.

As best I can tell, the part of the problem that frustrated Williams was the FFI, manually writing the bindings between C and Rust. Not only is this a lot of thankless work, especially if you want to be complete, but the underlying Slurm API isn't very stable either. It would be like developing a house on top of shifting sand.

We completely sidestepped this issue by using rust-bindgen to bind the entire Slurm codebase at build time.

It's possible that Williams tried this, or at least used bindgen as part of his build process at some point, but it wasn't nearly as complete back in 2018. As of 2025, it's so effective that I have no qualms about running it on every build: any time the underlying API would change, we can catch it at build time and adjust our API as necessary to fit the new bindings.

It's simultaneously the most technically impressive and most boring part of the whole project: it would have been impossible to produce this on one intern's time without bindgen, but at the same time, I just installed it on day one, emailed my sysadmin to figure out why it wasn't working (I was accidentally binding against a legacy version of libslurm, natch), and by day two I was working on the API. No muss, no fuss.

API

This, from my perspective, was the fun part.

With the bindings in place, I could call Slurm functionality from a Rust program trivially: just make sure that the connection is initialized and off we go. But getting data out of Slurm (and eventually, getting data into Slurm), and doing it safely, is a different story.

Example: Getting Node Info

I wound up defining some intermediary types for safely handling pointers into C-allocated memory. For example, if I'm trying to get node information from Slurm, the C methods that Slurm uses look like this:

 * slurm_load_node - issue RPC to get slurm all node configuration information
 * if changed since update_time
 * IN update_time - time of current configuration data
 * OUT resp - place to store a node configuration pointer
 * IN show_flags - node filtering options (e.g. SHOW_FEDERATION)
 * RET 0 or a slurm error code
 * NOTE: free the response using slurm_free_node_info_msg
 */
extern int slurm_load_node(time_t update_time, node_info_msg_t **resp,
              uint16_t show_flags);

(note the verbosity of the docs explaining each argument. Every time I look at code in another language, the first thing I miss from Rust is the type system! Comments are not a concurrency strategy, and long comments should not be a type safety strategy!)

We can pass in a time argument (nice if you're calling repeatedly and just need the information that has changed since the last time you called), a raw pointer that will end up pointing to a node_info_msg struct, and some optional flags.

Now I'm going to define a Rust-side struct to hold onto that pointer:

pub struct RawSlurmNodeInfo {
    ptr: *mut node_info_msg_t,
}

impl Drop for RawSlurmNodeInfo {
    fn drop(&mut self) {
        if !self.ptr.is_null() {
            unsafe {
                slurm_free_node_info_msg(self.ptr); // a Slurm C function binding
            }
            // nullifying pointer after free, probably unnecessary but good form
            self.ptr = std::ptr::null_mut();
        }
    }
}

Pretty basic so far: the RawSlurmNodeInfo struct just contains a pointer that points to C-allocated memory.

The nice, ergonomic touch here is in the Drop implementation: whenever the RawSlurmNodeInfo struct would go out of scope (or more rarely, if we manually drop it), the Drop process will also call into Slurm to free the memory it references. This way, we don't need to worry about memory leaks or double-frees across the boundary: if the Rust struct exist, so does the pointer and the memory. If it doesn't exist, neither does the memory.

Note that this requires an unsafe operation: any call across the Rust-C boundary with our bound functions is an unsafe call.

impl RawSlurmNodeInfo {
    pub fn load(update_time: time_t) -> Result<Self, String> {
        let mut node_info_msg_ptr: *mut node_info_msg_t = std::ptr::null_mut();

        let show_flags = 2; // only getting SHOW_DETAIL 

        let return_code = unsafe {
            slurm_load_node(
                update_time, 
                &mut node_info_msg_ptr, 
                show_flags)
        };

        if return_code != 0 || node_info_msg_ptr.is_null() {
            Err("Failed to load node information from Slurm".to_string())
        } else {
            Ok(RawSlurmNodeInfo { ptr: node_info_msg_ptr})
        }
    }
}

Now we can define some proper methods on the struct. We're going to create a load method, which creates the null, mutable pointer, we pass that pointer to C which allocates all the data and then 'stamps' the pointer so it points to that data, and, assuming this all succeeded, we initialize the RawSlurmNodeInfo struct using this pointer.

Once again, we have an unsafe call: calling into C and passing it a pointer to be stamped is unsafe.

Now here's the good stuff:

impl RawSlurmNodeInfo {
    pub fn as_slice(&self) -> &[node_info_t] {
        if self.ptr.is_null() {
            return &[];
        }
        unsafe {
            let msg = &*self.ptr;
            std::slice::from_raw_parts(msg.node_array, msg.record_count as usize)
        }
    }

    pub fn into_slurm_nodes(self) -> Result<SlurmNodes, String> {
        let raw_nodes_slice = self.as_slice();

        let num_nodes = raw_nodes_slice.len();
        let mut nodes_vec = Vec::with_capacity(num_nodes);
        let mut name_to_id_map = HashMap::with_capacity(num_nodes);

        for (id, raw_node) in raw_nodes_slice.iter().enumerate() {
            let safe_node = Node::from_raw_binding(id, raw_node)?;
            name_to_id_map.insert(safe_node.name.clone(), id);
            nodes_vec.push(safe_node);
        }

        let last_update_timestamp = unsafe { (*self.ptr).last_update };
        let last_update = DateTime::from_timestamp(last_update_timestamp, 0).unwrap_or_default();

        Ok(SlurmNodes {
            nodes: nodes_vec,
            name_to_id: name_to_id_map,
            last_update,
        })
    }
}

Now we're going to create methods to extract the node information data as a slice: each element of the slice is a separate node_info type, a C struct that we've bound, which describes the node. This is an unsafe call: we're getting a slice of memory managed by another language, after all.

Finally, we can build our into_slurm_nodes method, which calls .as_slice() internally initializes some memory, and creates safe, owned, Rust-side Node structs (defined elsewhere), from the equivalent node_info_t struct.

This is the only heavy operation we perform in this process: copying all this data from C-owned memory into Rust-owned memory. For a brief moment the same data is duplicated. If we were more resource-constrained or worried that the node information would be exceptionally large, we might consider some more involved memory-management regime... but as it stands, all the information is much less than a megabyte, and it's worth that cost to make this information stably availably in native Rust without making repeated calls across the FFI.

This also involves an unsafe call... but only to access the last_update field, and that's just because we need to dereference a potentially null pointer.

Note also that .into_slurm_nodes() takes the self argument, while .as_slice() just takes &self. This could not be done in the opposite order! This also means that, since the RawSlurmNodeInfo struct is now owned by its own into_slurm_nodes method, it now only exists within the method's scope: the moment the method finishes running, RawSlurmNodeInfo will be dropped, and as part of being dropped, it will free the C-allocated memory! By this point, the equivalent information has already been copied and returned to the Rust-side caller, so information is not lost. We got what we came for.

Writing into Slurm

This is broadly the pattern I followed with all of the 'read-only' APIs, which wound up being most of it. The utilities I was building were mainly diagnostic, so there wasn't too much need to write information into Slurm, just read out from it. But there were a couple places later in development where this was the case, when we were expanding the utilities to deal with the Slurm DataBase daemon.

For example, in the course of gathering QoS data from SlurmDB in order to provide information about per-user and per-partition resource usage, it winds up being necessary to read information into Slurm, parse and transform it, and then pass it back to Slurm as a cond-pattern object, the input to a query. An initial draft of this API, written with the same patterns as the read-only API, resulted in my first-ever segfault!

This occurred for a few reasons, but the thorniest of them was that the Slurm query type took in arguments in the form of a custom list_t type, which (I would have known this had I paid more attention to the documentation up front) does not have an associated free function: instead, this type must define its own destructor function at creation by taking in, as an argument, a function pointer corresponding to a custom, manual free function!

That, and memory passed between Rust and C needed to be Boxed and managed carefully in order to ensure that C didn't try to deallocate it, and that it didn't drop out of scope before C was done with it. Fun!