Implementing predecessor searches in static sorted sets lies at the intersection of simple, self-contained problems and useful ones. It's so simple that my introduction to data structure class started with linear and binary search. Yet, even binary search enables a compact representation of sets--always interesting for multi-GB datasets--and is a building block for bulk operations like joins.
That's why a lot of research has gone into optimising predecessor searches for each generation of computer microarchitecture. Work on in-memory searches in the past decade or two focused on cache-oriented performance models. These models do not capture the "costs" that actually matter for contemporary server-class X86 machines; the consequence is data structures that optimise for the wrong cost metric.
I will present results from a collaboration with Pat Morin and show how worrying about latency rather than block transfers helps us exploit latent memory-level parallelism. I will also try and explain the thought processes that allowed us to find this mismatch between orthodox complexity models and real machines.