Or, Breaking 2 GB/s on Streaming JSON Queries
This is a long one, with notes on the ongoing development and re-development of crowley, some light literature review, and armchair theorizing about how to think about JSON querying as a process. Headers:
- SIMD Streaming JSON Parsing
-
- Langdale & Lemire (2019) Redux
-
- Rust Implementation
-
- The Frustrating Catch
- What This Structure Enables
- Lurching Towards a Model of SIMD JSON Querying
- Looking To The Future
In the last month or so, I've been occupying myself with improving crowley. This has mostly involved slowing down and rethinking a lot of assumptions, hence why it does not yet have cross-platform binaries (sorry Muk! I promise I'll get to your PR soon!).
I've made a handful of optimizations that speed it up non-trivially - we're now about 3.0-3.3x faster than ijson for comparable tasks, where we were around 2.8-3.1x faster at the time of 0.1.0. But as I was tooling around with those changes, I gradually got a bit frustrated.
crowley started out pretty simple - take json-event-parser's Read implementation, take jsongrep's DFA, fuse them together, ta-da! fast streaming query engine. But combining off-the-shelf parts was leaving a lot of performance on the table, especially because json-event-parser is, by its own admission, not hyper-focused on being the fastest streaming parser possible.
But there were a lot of optimizations we could make in theory that would require entangling the parser logic with the query/DFA logic, and even implementing some of the less intrusive optimizations resulted in a tangled mess. Here's the apply_array_index_transition method for EngineState:
fn apply_array_index_transition(&mut self) { if let Some(frame) = self.stack.last_mut() && let FrameContext::Array { ref mut index } = frame.context { let current_index = *index; *index += 1; if self.track_path { self.path.push(PathType::Index(current_index)); } // Transition from the array frame's DFA state on this index. self.current_state = frame.dfa_state.and_then(|state| { self.dfa .get_index_symbol_id(current_index) .and_then(|sym| self.dfa.transition(state, sym)) }); // Early termination: if this transition is dead AND no future // index in this array can produce a valid transition, AND there // are no active captures being materialized, we can stop. if self.current_state.is_none() && self.active_captures.is_empty() && let Some(parent_state) = frame.dfa_state && !self.has_future_index_match(parent_state, current_index + 1) { self.try_terminate(); } } }
if self.track_path? if self.active_captures.is_empty()? if !self.has_future_index_match()? No fewer than four separate chained boolean checks? What is this stuff doing in an array transition? Why did I organize it this way?
Of course, the answer is 'because I kept getting new ideas for features/optimizations and adding them piecemeal rather than architecting them in from the start'. In fairness, I wouldn't have seen these opportunities at all without having made the first draft - but it does reinforce my belief that an iterative, Agile-style development approach requires a strong commitment to modularity and willingness to burn down and rebuild the current implementation before the unplanned feature load becomes overwhelming.
So that's what I started working on a couple weeks ago: rewriting basically all of it (except for Micah's DFA logic, which for the moment I'm treating as fragile wizardry) with my current set of features and optimizations in mind.
Of course, a complete redesign opens the door to OTHER features and optimizations, forcing me to repeat this whole process recursively up to the limit of my patience/sanity.
SIMD Streaming JSON Parsing
When I was initially sketching out what would become crowley and shopping around for a SAX-style streaming JSON parser, I spent a little while looking for SIMD implementations. I'd previously heard of, attempted to read, and then failed to understand Langdale & Lemire (2019) 'Parsing Gigabytes of JSON per Second', which impressed upon me that REALLY fast JSON parsing in the 2020s would involve SIMD.
But while some SIMD parser implementations existed for Rust, none of them were SAX-style streaming parsers. L&L's reference implementation worked exclusively on in-memory JSON with a built-in 4GB limit, because their goal was to construct an entire tree for repeated use, rather than a one-time stream access for transient queries.
Even with the current rise in RAM prices, workflows that requires parsing <4GB of JSON at a time don't require streaming, so I left it to the side.
But after working with json-event-parser, I figured, no, we could totally do SIMD parsing here - we just need to do it chunk-by-chunk, in a way that requires us to only move forward and never backtrack. The DFA already lets us do this since it never backtracks. The main problem is dealing with content that crosses chunk and buffer boundaries.
Langdale & Lemire (2019) Redux
The basic process for SIMD JSON parsing goes as follows:
We take some chunk of JSON input - for the moment, let's stick to 64-byte-long chunks. Here's the sample L&L use in their paper:
{ "\\\"Nam[{": [ 116,"\\\\" , 234, "true", false ], "t":"\\\"" }
Valid JSON follows a few convenient rules: first, all characters OUTSIDE of strings are required to be valid ASCII, while all characters INSIDE strings are Unicode codepoints (almost always utf8). This means not ALL parsing steps need to be sequential - each individual byte can be parsed individually, in whatever order we like (including simultaneously), and the results will be valid outside of strings.
The catch, of course, is that we need a way to determine whether a given character is inside a string or not. But since double quotes " are themselves ASCII...
L&L use a series of dense bit-twiddling hacks for this purpose. This is harder than it may sound, because strings can themselves contain quotes, which are escaped with \: the bit-twiddling below handles that:
(taken from Langdale & Lemire (2019) page 7)
{ "\\\"Nam[{": [ 116,"\\\\" , 234, "true", false ], "t":"\\\"" } : Original text
1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_ : Even bits (for bit-twiddling hacks)
_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1 : Odd bits (for bit-twiddling hacks)
__1___1_____1________1____1________1____1___________1_1_1___11__ : Q (naive quote locations)
___111________________1111_______________________________111____ : B (Is this byte a backslash?)
___1__________________1__________________________________1______ : S = B & ~(B << 1) (Is this the start of a chain of backslashes?)
______________________1_________________________________________ : ES = S & E (get all the slash starts that begin on an even offset)
___111____________________1______________________________111____ : EC = B + ES (yields carries on backslash sequences with even starts)
__________________________1_____________________________________ : ECE = EC & ~B (filter out backslashes to get carries only)
________________________________________________________________ : OD1 = ECE & ~E (select only end of sequences ending on an odd offset)
___1_____________________________________________________1______ : OS = S & O (get all the slash starts that begin on an odd offset)
______1_______________1111__________________________________1___ : OC = B + OS (otherwise the same as above)
______1_____________________________________________________1___ : OCE = OC & ~B
______1_____________________________________________________1___ : OD2 = OCE & E
______1_____________________________________________________1___ : OD = OD1 | OD2 (merge to get the ends of all odd-length sequences of backslashes)
__1_________1________1____1________1____1___________1_1_1____1__ : Q &= ~OD (remove escaped quotes from naive quote search)
__1111111111_________11111_________11111____________11__11111___ : R = CLMUL (Q ,~0) (carry-less multiplication by all 1s to find the range of strings, including the starting quote but excluding the ending quote)
All of this is branchless and works simultaneously on as many bits as you can fit into a register - the only inconvenient operation here is clmul(), the carry-less multiplication, which isn't necessarily available for wider registers, and we'll need some special handling for it later.
Now that we know the ranges of strings, we can locate our structural and 'pseudo-structural' characters.
Structural characters include {, }, [, ], ,, :, which explicitly demarcate the beginning and end of objects, the beginning and end of arrays, and separate elements and keys from values, respectively.
The paper formally defines 'pseudo-structural' characters as 'any non-whitespace character outside of strings preceded by either a structural character or whitespace'.
Usefully, these are the places where JSON atoms (strings starting from the opening ", numbers, and true/false/null literals) begin. So long as we assume that the subject is valid JSON, this actually allows us to distinguish the type of any encountered JSON atom from the first byte: strings obviously start with ", but numbers always start with either a digit or -, and the true/false/null literals all start with different characters.
(taken from Langdale & Lemire (2019) page 9)
{ " \\\" Nam [{": [ 116 ," \\\\ " , 234 , " true ", false ], "t":" \\\" " }: input data
__1_________1________1____1________1____1___________1_1_1____1__ : Q (quote locations, no longer naive)
__1111111111_________11111_________11111____________11__11111___ : R (string range `[)`)
1_________11_1_1____1_______1____1_______1_______11____1_______1 : S (structural)
_1____________1_1__________1_1____1_______1_____1__1__________1_ : W (whitespace of any kind)
1 ____________1_1____1_______1____1_______1_______11____1_______1 : S = S & ~R (eliminate quoted regions from structural characters)
1 _1_________11_1____11____1_1____1_1____11_______11_1_111____1_1 : S = S|Q (restore ending quotes to pseudo-structural characters)
111 _________11111___11____1111___111____111_____11111_111____111 : P = S|W (to find pseudo-structural, first find all structural or whitespace characters)
_111_________11111___11____1111___111____111_____11111_111____11 : P = P << 1 (then advance the structural-whitespace mask forward by one)
_____________1_1_1__________1_1__________1_1_____11____1_______1 : P &= ~W & ~R (now eliminate remaining whitespace and quoted characters - voila, we have pseudo-structural characters!)
1 _1_________11_1_1__11____1_1_1__1_1____11_1_____11_1_111____1_1 : S = S|P (now merge pseudo-structural and structural character masks together)
1 _1__________1_1_1__11______1_1__1_1_____1_1_____11_1__11______1 : S & ~(Q & ~R) (eliminate ending quotes from the structural character list)
And look at that: for every single byte in the original 64-byte chunk, we now have a corresponding bit that indicates if that byte is an explicit control character ({, }, [, ], ,, :) or the starting point of a value. This is all computed branchlessly, with (not quite but we'll get to that later) fully local reasoning on each byte individually. It trivially scales from 64 bytes to 128, 256, 512, however large a chunk you can handle.
Having calculated this 'relevance mask' we've substantially simplified the task of JSON querying. We can navigate the JSON tree using only the structural and pseudo-structural characters: everything else is necessary for value return, but not navigation - and crucially, the DFA is capable of navigating one byte at a time.
Langdale & Lemire goes considerably further in its design, but this is where our interaction with it mostly ends - we also remove the final step above, since we actually don't want to remove ending quotes from our pseudo-structural characters: L&L do that because their scheme involves building a big tape representing the whole JSON tree and they want to save as much space as possible, but we're not constructing jack, so we actually want to keep the ending quotes.
For reference, here is a Rust implementation of the above logic operating on 64-byte chunks:
Rust Implementation
const EVENS: u64 = 0b1010101010101010101010101010101010101010101010101010101010101010; const ODDS: u64 = 0b0101010101010101010101010101010101010101010101010101010101010101; // an architecture-independent version of clmul: not as performant, but easier for the time being fn prefix_xor(mut x: u64) -> u64 { x ^= x << 1; x ^= x << 2; x ^= x << 4; x ^= x << 8; x ^= x << 16; x ^= x << 32; x } fn find_structural(bytes: &[u8; 64]) -> u64 { let mut s: u64 = 0; for (i, byte) in bytes.iter().enumerate() { if matches!(byte, b'{' | b'}' | b'[' | b']' | b',' | b':') { s |= 1u64 << i; } } s } fn find_odd_escapes(bytes: &[u8; 64], context: &mut Context) -> u64 { let mut base: u64 = 0; for (i, byte) in bytes.iter().enumerate() { if matches!(byte, b'\\') { base |= 1u64 << i; } } let starts: u64 = base & !((base << 1) | (context.slashes.is_some() as u64)); let even_starts: u64 = starts & EVENS; // wrapping behavior is intentional and desired here let even_carries: u64 = base.wrapping_add(even_starts); let ece = even_carries & !base; let od1 = ece & !EVENS; let odd_starts: u64 = starts & ODDS; let odd_carries: u64 = base.wrapping_add(odd_starts); let oce = odd_carries & !base; let od2 = oce & !ODDS; let mut od = od1 | od2; if context.slashes.is_some() { // how many backslashes does the run continue with at the start of this chunk? let leading_bs = (!base).trailing_zeros() as u64; // 0 if bit 0 isn't a backslash // the run is fully resolved (ended within this chunk) iff leading_bs < 64 if leading_bs < 64 { // combined run parity: prev parity XOR new contribution parity let prev_run_was_odd = match context.slashes { Some(SlashParity::Odd) => true, Some(SlashParity::Even) => false, None => unreachable!("We already checked it's some") }; let combined_odd = prev_run_was_odd ^ ((leading_bs & 1) != 0); if combined_odd { // the byte after the (combined) run is escaped od |= 1u64 << leading_bs; } } // if leading_bs == 64, the entire chunk is backslashes; the run still hasn't // terminated. The new context's parity will reflect the continuation } let trailing_ones = (!base).leading_zeros(); // run length at top let prev_was_backslash = trailing_ones > 0; // ended on \? match prev_was_backslash { false => context.slashes = None, true => { if trailing_ones == 64 { // the entire chunk is slashes, so it's a wash // if the context was previously some, preserve it, // if it was None, make it Even match context.slashes { Some(parity) => context.slashes = Some(parity), None => context.slashes = Some(SlashParity::Even), } } else { // getting parity, true if last run was odd match (trailing_ones & 1) != 0 { false => context.slashes = Some(SlashParity::Even), true => context.slashes = Some(SlashParity::Odd), } } } } od } fn find_quotes_and_range(bytes: &[u8; 64], context: &mut Context) -> (u64, u64) { let mut q: u64 = 0; for (i, byte) in bytes.iter().enumerate() { if matches!(byte, b'"') { q |= 1u64 << i; } } let od = find_odd_escapes(bytes, context); // // would need to use architecture-specific version or nightly // let r_chunk = (q & !od).carryless_mul(!0u64); // for now, using the manual version let q_unescaped = q & !od; let r_chunk = prefix_xor(q_unescaped); // taking care of the last chunk ending inside a string let r = r_chunk ^ (-(context.in_string as i64) as u64); // r[63] == 1 iff the last byte of this chunk is inside a string // Convention: opening quote counts as in-string, closing quote does not, // so r[63] correctly reflects the state crossing into the next chunk context.in_string = (r >> 63) != 0; (q_unescaped, r) } fn find_whitespace(bytes: &[u8; 64]) -> u64 { let mut w: u64 = 0; for (i, byte) in bytes.iter().enumerate() { if matches!(byte, b' ' | b'\n' | b'\t' | b'\r') { w |= 1u64 << i; } } w } fn find_structural_and_pseudo( q: u64, // quote locations r: u64, // string spans starting at the opening quote, ending before the closing quote s: u64, // structural w: u64, // whitespace context: &mut Context, // prev_whitespace_or_structural: bool ) -> u64 { // eliminate quoted regions from our structural characters let mut s = s & !r; // restore ending quotes to our structural characters // ( for purposes of building pseudo - structural characters ) s |= q; // begin to calculate pseudostructural characters initially // pseudostructural characters are structural or white space let mut p = s | w; // now move our mask for candidate pseudo - structural characters forward by one p <<= 1; // handles case where there was previously whitespace or structural p |= context.whitespace_or_structural as u64; // sets bit 0 if needed // eliminate white - space and quoted characters from our candidates p &= !w & !r; // compute the carry BEFORE merging pseudo-structurals into s // ps are atom-start bytes that are not themselves ws/struct/quote, so they // mustn't feed the carry // ran into bug where n/ull cross buffer was detecting 'u' as structural let next_ws_or_struct: bool = ((s | w) >> 63) & 1 != 0; context.whitespace_or_structural = next_ws_or_struct; // merge pseudo - structural characters into structural character mask s |= p; s // we don't need this bc we actually want to preserve closing quotes for our bit index // the simdjson paper wanted to eliminate them for tape compactness, but we don't have a tape // eliminate ending quotes from our final structural characters // s & !(q & !r) } fn structure_masks(bytes: &[u8; 64], context: &mut Context) -> u64 { let (q, r) = find_quotes_and_range(bytes, context); let s = find_structural(bytes); let w = find_whitespace(bytes); find_structural_and_pseudo(q, r, s, w, context) }
If you read that instead of just rushing down to the bottom of the code block (hello!) you'll notice that the above is not quite the clean and branchless logic presented in Langdale & Lemire. That's because of the frustrating catch.
The Frustrating Catch
If JSON-parsing could actually be done with fully local reasoning on individual bytes, we'd be able to do it fully in parallel. Unfortunately, this is not the case.
The local, byte-by-byte reasoning above assumes that everything relevant to the current chunk begins and ends within itself. There's probably a formal name for that property, but I'm just going to call it 'convenient'. The sample JSON used above is very convenient:
{ "\\\"Nam[{": [ 116,"\\\\" , 234, "true", false ], "t":"\\\"" }
The object begins and ends within the 64-byte chunk. It doesn't trail off into another chunk, it doesn't end halfway through a string, atom, array, or chain of slashes.
If it did end halfway through anything... well, we'd actually be fine. But the NEXT chunk would be missing crucial details. If we stopped on some slashes, we might mistakenly treat an unescaped character as escaped, or vice versa. We might try to interpret a string as JSON and choke on non-ASCII characters or just plain fail as we try to interpret regular old text as JSON.
So while we can process each individual chunk branchlessly, we can only (correctly) process one chunk at a time, and need to carry several bits of important state between chunks: namely, did we end inside a string, did we end on slashes (and if so, was it an even or odd number of them?) and did we end on a structural or whitespace character (which would mean the first character of the new chunk is a pseudo-structural candidate).
It turns out that we don't actually have to track the stack of brackets, at least not for our implementation: the DFA is handling that part, which makes our job a little less uncomfortable.
The other big catch here is that, since we can potentially stop in the middle of an atom, including long strings, value return lags navigation. Assuming we need to actually return values, instead of just byte offsets, we would need to copy the partially-scanned atom into an external buffer and then re-unite it after we scan the next chunk. This will happen periodically even with four- or five-byte atom literals and decimal numbers, but it will happen a LOT with strings on nontrivial length, even if we parse 512 bytes at a time.
This can potentially get pathological in the case of strings that cross multiple chunks, requiring us to push to the carry-over buffer section by section. We need to set some kind of upper bound on this behavior. I've not yet decided if we want to pre-reserve a buffer of sufficient size (the good news is that we only need to account for a single carry-over atom at a time) or if we just want to grow the buffer as we extend the string - I expect the latter will be superior for 99+% of cases, since even very long utf-8 strings shouldn't be more than a few kilobytes - a 16MB string would cover at least 4 million utf-8 characters, maybe half of the Mahabharata... if you have individual strings that long, that's a you problem.
The above implementation has not yet been tested rigorously across pathological chunks, so there's probably still an edge case or two in there. But for the moment, it's sufficient to talk about what this structure enables.
What This Structure Enables
The process above provides us a way to quickly skip over irrelevant bytes: after we've calculated the mask, we can directly jump to the bytes relevant to the DFA, parse those, and transition the DFA accordingly.
This should give us a nice speedup even on fairly dense JSON where the proportion of structural and pseudo-structural is very dense, and will provide a very large speedup on JSON where the proportion is more sparse. Langdale and Lemire suggest that pseudo/structural characters can plausibly appear as often as 1 in 4 characters, or as sparsely as 1 in 40 characters. While Langdale & Lemire don't quite provide a graph of pseudo/structural proportion to speedup, we can eyeball a few figures:
(all figures representing throughput: GB/s)
file | simdjson | sajson | RapidJSON | bytes / structural
marine_ik | 0.94 | 0.67 | 0.45 | 4.6
apache_builds | 2.3 | 1.2 | 0.48 | 10.3
gsoc_2018 | 3.2 | 1.1 | 0.51 | 43.9
In addition to being generally faster than competing implementations, simdjson's throughput increased considerably the more non-structural bytes it was able to skip. Not the only factor by far, check out the original paper for actual details and modeling, but suffice it to say that it's very effective!
The DFA needs a couple pieces of information in order to correctly transition: one is the tokenized value of the transition element (can be a single structural element like {, or the value of a key field). If it's inside of an array, it also needs to know the index of the current element.
That's... really it. It doesn't actually look at values, only keys (though that may change in the future), and keys are strings, generally ASCII, and generally very short, so even in the case where the field values cross chunk or buffer boundaries, it's quite cheap to store them, and it's easy to find their range: from the byte after the current " to the byte before the next ", which is guaranteed to be the next pseudo/structural character.
This makes certain operations very easy indeed. One of the main features in crowley compared to ijson or jsongrep is that instead of just returning matching values and requiring the user to write any aggregation code around it, crowley provides the option to perform common aggregations internally. Instead of writing:
with open(file, "rb") as f: return sum(i for i in ijson.items(f, query))
you can write
return crowley.Query(file, query).agg("sum")
It's shorter to write and easier to read, sure, but it's also doing a lot less work under the hood - even leaving the performance advantages of our current json-event-parser+DFA setup over ijson+yajl_c aside, the ijson approach has to not only find matches, but it must always materialize them, create a new Python object of the appropriate type (remember, even a single Python integer is ~40 bytes across stack and heap!) and sum them all together, with all the list comprehension and type-checking overhead that entering Python-space demands.
In contrast, crowley can skip the allocation, performing all the materialization and addition under the hood on the Rust side and just return a single terminal value.
This is certainly less flexible than writing the Python yourself, and crowley is more fragile than some code you could write yourself: for example, if you wanted to average over matching values, but knew some of the matches would be non-numeric, you could just have Python manually check the type and skip (or attempt to coerce) non-numeric matches, while crowley would throw and error and stop in the same scenario. If all the elements can be represented as integers, manual Python will be able to keep things in an i32 representation, while crowley will defensively convert to f64 to begin with. But of course, you can always retreat back to the manual case by just using .values() and writing the wrapper code yourself as always, and still get a meaningful speedup.
(is that worth adding as a feature, something like a .non_strict() mode? maybe, but it's not implemented for the moment and won't be until I a) actually get users and b) those users tell me unambiguously they want that feature)
Likewise, if all we need to do is return byte offsets of matches, this whole process is extremely lightweight: we don't need to return values at all (though we may wish to look at them anyway just to validate them).
But if we want to do something more sophisticated, things become heavier and more complicated pretty quickly. If we want to return all unique values (and only unique values) we not only need to parse and materialize all matching values (which might not just be atoms, but arbitrarily large arrays and objects), but then hash and store them. At present, crowley will error out if any single matched value exceeds 16MB - a lot, and certainly enough for most purposes, but could potentially be exceeded by large arrays or string-dense objects.
This was also part of the motivation behind one of the finnickier features of crowley v0.1.0: offset recording. I figured that we might want to perform the same query over the same file multiple times in order to extract different outputs from it: maybe we want the count of all matches, but also the count of all unique matches, and the types of unique matches as well! It wouldn't make sense to re-do the same query from scratch, would it? So I added a feature that records the byte offsets of matching values, allowing re-queries to just scan forward on disk if the matches are sparse enough, making requerying much faster.
Other features of 0.1.0 included multifile querying, spawning a rayon thread and building multiple copies of the same DFA for each file to query them in parallel, and container skipping, which attempts to determine if the DFA can no longer find a valid match in the current container, and if so skip until it exits the container (in profiling, most time was spent doing this!).
These and many other attempted and successful optimizations have made the not-yet-released (and maybe never released) crowley v0.1.1-alpha a bit of a bloated mess.
The SIMD rewrite started as an opportunity to build a stronger performance foundation and simplify a lot of the hackier things I'd attempted before, but gradually led me to start thinking in more formal, abstract terms about what I was doing.
Lurching Towards a Model of SIMD JSON Querying
As is so often the case, I didn't even have a conscious model of this problem when I first started on it.
ijson implicitly models streaming JSON querying as one Query on one File. There isn't really a concept of an output mode: it just returns the matching values.
Something I did early on, again without much thought, was to expand this model to one Query on one File with one Output mode - e.g. values, byte offsets, count, unique values, etc. But at the time, I was thinking of Output mode as a property of the Query.
When I added multiquery support, I implicitly determined that we might want to perform the same Query over multiple files. When I added byte offset recording, I was implicitly admitting that the user might want to perform the same query on the same file with multiple Output modes - but since I considered Output mode as a property of the Query, I rendered this as performing different but related queries in succession.
It was only in the course of architecting out the re-design that I realized I was being a bit stupid.
Going forward, I want to conceive of JSON querying as parametrized over three independent elements: the Query, the Accumulator (values, count, types, etc), and the Source (a file in memory, a file streamed from disk, a stream over the network, whatever).
The number of possible valid JSON sources and queries are functionally infinite, while the number of accumulators is more constrained - fewer than a dozen at present, though we may add more in the future.
Queries and Accumulators may be aware of another in a limited sense, but only because some Accumulators are lazy, requiring only notices that a match has been found or that it exists on a certain byte offset, while others require the fully materialized values. This informs the design of the engine, but for the most part, each is independent of the other.
As a consequence of both this re-conception and also the design of the SIMD parser, it should be possible for a single file to drive multiple Queries simultaneously - after all, the SIMD byte classification is performed in bulk up front per chunk, and is totally agnostic of queries, and all queries transition on the same information - but some return values, while others do not.
Each Query DFA should be cheap enough in terms of memory and transition time that we should be able to easily have dozens running against a single Source - I suppose we can always record byte offsets if we like, but why re-scan at all rather than do all querying up front?
Each Query, in turn, may send to multiple Accumulators simultaneously - once again, every Accumulator associated with a Query receives the same match information - they just do different things with it.
In some cases, it should be possible to fuse Accumulators together. For instance, it's possible to track the number of unique matches requiring only the hashes of matches, while the unique values requires a hashset of the values, and a count of how many times each unique value appears requires a hashmap of value and count. In such a case, there's a lot of overlap: all three receive the same values, perform the same hashes, and each one makes the last redundant. So it should make sense to fuse these to use a single hashmap, and then return the map, the keys, and the key count separately as needed.
It then remains natural to perform many queries with many simultaneous accumulators over many files in parallel - for each file with M Queries with P Accumulators, we're still only using a single thread and tightly bounded memory equal to our max buffer size, max carry-over buffer size, and scratch space for things like hash maps - if we're being generous, at most 32MB per thread, typically much less. We should be able to comfortably use as many threads as a modern machine has without worrying much about memory use.
So we can conceive of JSON querying as taking place over the space of M Queries with P Accumulators over N Files, for an MxPxN tensor - though we can expect them to often be jagged, with not all queries sharing the same accumulators, not all files sharing the same queries, etc.
But it does seem possible (I'm still figuring out what a good API for this would look like) to do something like
from crowley import Query [ [ [f1q1a1, f1q1a2, ..., f1q1ap], [f1q2a1, f1q2a2, ..., f1q2ap], ], . . . [ [fnqma1, fnqma2, ..., fnqmap], ], ] = Query([files,], [queries,], [modes,])
and have its memory performance scale linearly with the number of threads and runtime performance scale with N, while M and P are for all intents and purposes not in consideration.
Figuring out a good API for jagged tensor-shaped returns that isn't godawful might be harder than actually building it.
Looking To The Future
I delayed releasing this blog post until I actually had a working prototype of crowley v0.2.0, to avoid making a dunce of myself in case this whole idea was a flop.
While it's still under construction, the core pipeline is already implemented. The first working version had a throughput of ~240MB/s - slightly less performant than ijson on my machine, whereas crowley v0.1.0 managed 880MB/s.
After several rounds of profiling and optimizing, adding NEON instructions instead of using scalar operations, etc, we've finally landed at ~2.2GB/s throughput on my standard benchmark (counting the number of "[*].repo.name" matches on a 26MB dump of github events). That's over twice the original crowley's performance, not only breaking the 1GB/s barrier that galled me enough to begin this rewrite nearly a month ago, but shattering 2GB/s!
I still need to test more vigorously, benchmark more widely, and see how performance differs across accumulator types and with multiple queries/accumulators per file, not to mention enabling more source types than just streaming from disk. But as things stand, I'm quite happy.
It's been a wild ride getting the crowley rewrite to fruition, so obviously I'm tempted to do another rewrite just to try out some alternate ideas.
One of the big architectural decisions I made early on was to perform character classification on a single chunk, THEN parse it and emit events as necessary, and only THEN to move on to the next chunk.
In theory, we could get some benefit out of classifying the entire buffer first, and then parsing it - a lot of systems benefit from doing one thing on a lot of data, then another thing. But at the time, I decided it would be better to guarantee cache locality, to ensure that the raw 64/128/256/512 byte chunk, the chunk's bitmask, and infrastructure like the DFA itself could fit as close to the metal as possible all at the same time.
Of course, now that we've got a successful reference implementation, we could fork and re-architect to try it out the other way first - it wouldn't be very hard - and benchmark both implementations on a wide range of default buffer sizes. I should also try to see how large a typical compiled DFA is, and how the size changes by query complexity. It could very well be the case that we would benefit more from a small buffer classified and parsed separately, and that could very well end up using less memory.
Another possibility I though of more recently, though I'm not sure how the plumbing for this would work - would be a dual-stream system in which the SIMD classifier 'runs ahead', producing bitmasks and storing them at the same time that the parser is scanning and emitting events from previous chunks.
When I considered this, producing the structural masks accounted for roughly a third of runtime (after factoring out benchmarking overhead) while parsing accounted for half (that included querying with the DFA, which accounted for less than 10% of total runtime). In theory, for every chunk that the parser finished examining, the classifier could produce another mask and be well on its way to producing another. But while the classifier's work is almost exactly constant per chunk, the parser can potentially discard an entire chunk in a single operation if it contains zero structural characters (in which case the bitmask would just be 0CHUNKSIZE, and we always check if self.current_mask == 0 { return None; }).
Moreover, since that time, the runtime cost has flipped almost exactly. At present it takes roughly half the runtime to produce masks and roughly a third of the time to parse them (at least on my basic Count benchmark, that could change with other accumulators).
In very 'thin' JSON with a low ratio of structural characters to bytes, such as in L&L's test case that includes 43 bytes per structural character, it's not inconceivable that the parser could starve as the classifier works ahead of it far more slowly.
Even so, the idea may have legs, especially if there's a way to densely store the produced masks (perhaps interleaved with the raw bytes, leaving a CHUNKSIZE/8 gap in memory between chunks for the classifier to fill?).
I'm also keen on exploring other features - chiefly, implementing iterator methods for stream outputs, not just on the Rust side, but on the Python side as well - the ability to add a .map() with a Python lambda function, a conditional for .filter(), or a fold() or reduce(), would potentially make crowley very ergonomic on the other side of the FFI boundary.
One more bugbear of the current codebase is that, at present, the Exists accumulator no longer terminates early, which was one of the nice features of v0.1.0. It wouldn't be hard to make that happen, but because the current design separates accumulation out, it would mean that every query would need to check if it's entered a terminal state on each match in order to break off. I rejected this earlier on, though I expect I'll return to it later in order to figure out how much of a performance penalty this would cause.
Alright, that's enough for now. I'll report back when I have more to say, hopefully with a release-ready crowley v0.2.0 in the near future.