Skip to content

better array contains (SIMD Quad)#805

Open
lemire wants to merge 3 commits intomasterfrom
betterarraycontains
Open

better array contains (SIMD Quad)#805
lemire wants to merge 3 commits intomasterfrom
betterarraycontains

Conversation

@lemire
Copy link
Copy Markdown
Member

@lemire lemire commented Apr 28, 2026

In the benchmarks below, this new algorithm improves very significantly about contains performance.

Daniel Lemire, "You can beat the binary search," in Daniel Lemire's blog, April 27, 2026, https://lemire.me/blog/2026/04/27/you-can-beat-the-binary-search/.

On Apple M4, the optimized "After" version delivered substantial performance gains over the "Before" reference across most Contains operations with the census1881 dataset (200 bitmaps). ContainsColdLow improved from 111.6 ns/query to 60.4 ns/query (~46% faster), ContainsColdMod from 146.7 ns to 103.5 ns (~29% faster), ContainsWarmLow from 55.1 ns to 33.0 ns (~40% faster), and ContainsWarmMod from 137.0 ns to 57.5 ns (over 2× faster), while ContainsHigh cases remained essentially unchanged. On Intel with GCC, similar improvements were observed: ContainsColdLow dropped from 47.9 ns/query to 33.7 ns (~30% faster), ContainsColdMod from 108.4 ns to 71.2 ns (~34% faster), ContainsWarmLow from 34.0 ns to 15.8 ns (over 2× faster), and ContainsWarmMod from 50.9 ns to 21.3 ns (also >2× faster), with ContainsHigh again showing negligible change. Overall, the optimizations yielded major speedups on both the M4 and Intel platforms, particularly for the Low and Mod access patterns in both cold and warm scenarios. The reason we see no gain in the dense benchmarks is that we are measuring bitset performance.

Apple M4

Before

./build/microbenchmarks/benchref --benchmark_filter="ContainsCold|ContainsWarm"
Permission denied, xnu/kpc requires root privileges.
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
This does not affect benchmark measurements, only the metadata output.
***WARNING*** Failed to set thread affinity. Estimated CPU frequency may be incorrect.
2026-04-27T23:17:15-04:00
Running ./build/microbenchmarks/benchref
Run on (14 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x14)
Load Average: 3.79, 2.79, 3.05
In RAM volume in MiB (estimated): 1.802830
benchmarking other files: You may pass is a data directory as a parameter.
data source: /Users/dlemire/CVS/github/CRoaring/benchmarks/realdata/census1881
number of bitmaps: 200
performance counters: No privileged access (sudo may help).
---------------------------------------------------------------------------
Benchmark                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------
ContainsColdLow      111604 ns       111589 ns         5799 ns/query=11.1589ns
ContainsColdMod      146836 ns       146666 ns         3681 ns/query=14.6666ns
ContainsColdHigh      38584 ns        38549 ns        18716 ns/query=3.85487ns
ContainsWarmLow       55176 ns        55121 ns        12549 ns/query=5.51212ns
ContainsWarmMod      137196 ns       137012 ns         4189 ns/query=13.7012ns
ContainsWarmHigh      19258 ns        19244 ns        33841 ns/query=1.92436ns

After

 ./build/microbenchmarks/bench --benchmark_filter="ContainsCold|ContainsWarm"
Permission denied, xnu/kpc requires root privileges.
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
This does not affect benchmark measurements, only the metadata output.
***WARNING*** Failed to set thread affinity. Estimated CPU frequency may be incorrect.
2026-04-27T23:16:51-04:00
Running ./build/microbenchmarks/bench
Run on (14 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x14)
Load Average: 4.60, 2.87, 3.09
In RAM volume in MiB (estimated): 1.802830
benchmarking other files: You may pass is a data directory as a parameter.
data source: /Users/dlemire/CVS/github/CRoaring/benchmarks/realdata/census1881
number of bitmaps: 200
performance counters: No privileged access (sudo may help).
---------------------------------------------------------------------------
Benchmark                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------
ContainsColdLow       60419 ns        60410 ns        10999 ns/query=6.04097ns
ContainsColdMod      103576 ns       103539 ns         6662 ns/query=10.3539ns
ContainsColdHigh      38578 ns        38570 ns        18340 ns/query=3.85695ns
ContainsWarmLow       33009 ns        32975 ns        21066 ns/query=3.29752ns
ContainsWarmMod       57625 ns        57542 ns        12032 ns/query=5.75419ns
ContainsWarmHigh      19781 ns        19766 ns        35798 ns/query=1.97662ns

Intel (GCC)

Before

---------------------------------------------------------------------------
Benchmark                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------
ContainsColdLow      480671 ns       479386 ns         1460 GHz=3.49345 cycles=1.65674M instructions=1.08021M ns/query=47.9386ns
ContainsColdMod     1087253 ns      1083995 ns          656 GHz=3.48505 cycles=3.59022M instructions=1.44732M ns/query=108.399ns
ContainsColdHigh     214537 ns       213961 ns         3262 GHz=3.49527 cycles=738.232k instructions=544.209k ns/query=21.3961ns
ContainsWarmLow      340623 ns       340020 ns         2027 GHz=3.49336 cycles=1.0953M instructions=1.24707M ns/query=34.002ns
ContainsWarmMod      510054 ns       509316 ns         1000 GHz=3.49301 cycles=1.62198M instructions=1.70033M ns/query=50.9316ns
ContainsWarmHigh      24476 ns        24432 ns        27107 GHz=3.50449 cycles=85.145k instructions=571.847k ns/query=2.4432ns

After

---------------------------------------------------------------------------
Benchmark                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------
ContainsColdLow      337554 ns       336616 ns         2041 GHz=3.49441 cycles=1.06576M instructions=1.09451M ns/query=33.6616ns
ContainsColdMod      714265 ns       712343 ns          988 GHz=3.49333 cycles=2.47495M instructions=1.65688M ns/query=71.2343ns
ContainsColdHigh     214359 ns       213743 ns         3310 GHz=3.49499 cycles=718.668k instructions=536.631k ns/query=21.3743ns
ContainsWarmLow      158881 ns       158361 ns         4461 GHz=3.4944 cycles=441.238k instructions=1.20468M ns/query=15.8361ns
ContainsWarmMod      213602 ns       213297 ns         3305 GHz=3.49394 cycles=627.889k instructions=1.90445M ns/query=21.3297ns
ContainsWarmHigh      24423 ns        24386 ns        28736 GHz=3.50426 cycles=85.9k instructions=551.211k ns/query=2.43861ns

@andreigudkov
Copy link
Copy Markdown
Member

Great! Where are the new benchmarks (Contains...)? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants