NeoSmart Technologies is pleased to announce the immediate availability of its open source NeoSmart.Collections
library/package of specialized containers and collections designed to eke out performance beyond that which is available in typical libraries and frameworks, by purposely optimizing for specific (common!) use cases at the cost of pretty much everything else.
In many regards, the data structures/containers/collections that ship with pretty much any given framework or standard library are a study in compromise. Language and framework developers have no idea what sort of data users will throw at their collections, or in what order. They have no clue whether or not a malicious third party would ultimately be at liberty to insert carefully crafted values into a container, or even what the general ratio of reads to writes would look like. The shipped code must be resilient, fairly performant, free of any obvious pathological cases with catastrophic memory or computation complexities, and above all, dependable.
It isn’t just language and framework developers that are forced to make choices that don’t necessarily align with your own use case. Even when attempting to identify what alternative data structure you could write up and use for your needs, you’ll often be presented with theoretical 𝒪 numbers that don’t necessarily apply or even have any relevance at all in the real world. An algorithm thrown out the window for having a horrible 𝒪 in Algorithms and Data Structures 101 may very well be your best friend if you can reasonably confine it to a certain subset of conditions or input values – and that’s without delving into processor instruction pipelines, execution units, spatial and temporal data locality, branch prediction, or SIMD processing.
Which brings us to NeoSmart.Collections.SortedList<T>
, the star of today’s NeoSmart.Collections
nuget package release. Most developers have learned to reach for their libraries equivalent of std::set
when they need to keep an ordered list of values around for deduplication and querying. Whether it’s std::set<T>
or System.Collections.SortedSet<T>
, the implementation is typically a red-and-black tree (a type of self-balancing binary search tree) that can guarantee 𝒪(log n) lookups without too much memory overhead or insert-time computation.
But that 𝒪(log n) is extremely misleading. A binary search tree naturally implies a certain amount of memory fragmentation, and a self-balancing binary-search tree automatically incurs a fair bit of work on insertion to guarantee the lookup times. Data isn’t contiguously allocated, and most tree implementations are not particularly cache friendly. Balancing itself also means you’ll sometimes pull in data you don’t want or need into the cache, evicting data you might be about to read next.
At first blush, all that sounds downright pleasurable compared to what it costs to store always-sorted-at-rest data in a T[]
or List<T>
, needing to shift, on average, n/2 elements on each insert to place items in their correct location and possibly needing to copy the entire contents of the list or array over to a new memory location if the capacity is exceeded and a larger backing array is needed. But it’s also fairly easy to see how, once populated, a sorted T[]
array offers more preferable real-world lookup times with better data locality and cache friendliness as compared to your average binary search tree, while still technically making the same (theoretical) 𝒪(log n) guarantee.
But that’s on paper in a world of unknowns where anything can happen. In the here and now, we have data we need to load, data we need to insert, and often, a pretty good idea of what (dis)order that data will arrive in. We generally know whether we’re more likely to read or write, and maybe even where those writes will occur. While a binary tree variant may offer better insertion times if you’re dealing with a constant stream of new elements in no particular order, in the real world we often have a chunk of n elements to insert all at once or elements arriving in-order – or even just reasonably close to it! – past a certain point.
Here’s a scenario that’s certainly familiar to anyone that’s developed an application with mutable state loaded/restored from disk: you need an always-sorted collection of T
, once your initial data is loaded from whatever input source, new data generally arrives in-order except that sometimes events/items arriving in close temporal proximity to one another can reach out-of-order due to a queuing mechanism that doesn’t guarantee FIFO semantics. Maybe your event queue doesn’t guarantee exactly-once semantics either, and the same message can be delivered twice in cases of congestion or network delays, or perhaps you may need to deduplicate data that arrives within a 5-second window – or maybe not.
Whatever the case, you find yourself dealing with data that arrives more-or-less in order. Not only can a sorted list match the insertion performance of an RB or AVL tree if the value to be inserted is strictly greater than (or equal to, if that’s what you want) the previous maximum value in the collection, it can even guarantee constant-time 𝒪(1) insertions.1 And if the general monotonic trend holds, even when you do need to insert out-of-order data, you only need to shuffle around any recently arrived items that should have been inserted after the element being inserted: in other words, even when the going gets tough, the tough gets going; no more 𝒪(n) insertions here!
It is for this reason that the .NET BCL includes SortedList<K,V>
in addition to the red-and-black tree-backed SortedDictionary<K,V>
: sometimes a list/vector/array is faster than a tree, even for sorted data. But unfortunately, there’s no equivalent System.Collections.SortedList<T>
structure to be found, not in .NET Framework 4.8 or in the upcoming .NET Core 3.0. But that’s OK, because we can write our own and even include some nice optimizations for even more specific access patterns that aren’t in SortedList<K,V>
. There’s a good chance if you’ve needed a data structure optimized for this particular access pattern, you’ve already written it and have been using it happily ever since. But in case you haven’t, read on.
NeoSmart.Collections
provides this missing piece, with our SortedList<T>
and UniqueSortedList<T>
offering both faster insertions and faster lookups than the System.Collections
offerings… provided these conditions are met. SortedSet
doesn’t give you an option of keeping duplicates or equivalents around in the collection, but our SortedList<T>
can handle that without a problem. And if you do need to deduplicate your inputs, then UniqueSortedList<T>
is your friend, with no cost incurred for the deduplication.
NeoSmart.Collections
is released open source under the MIT license, and contributions are both encouraged and welcome. Source is available on GitHub and the package binaries and symbols are both available on nuget.org as NeoSmart.Collections
, and includes the previously mentioned SortedList<T>
and UniqueSortedList<T>
classes, as well as some other handy data structures that aren’t your textbook go-to collections for storing collections, but tend to come in handy and perform really well in the real world where it counts.
Presuming sufficient preallocation of capacity ↩