FindFirstFunctions.jl is a package for faster findfirst type functions. These are specialized to improve performance
over more generic implementations.
findfirstequal(x::Int64,A::DenseVector{Int64})Finds the first index in A where the value equals x.
bracketstrictlymontonic(v, x, guess; lt=<comparison>, by=<transform>, rev=false)Starting from an initial guess index, find indices (lo, hi) such that v[lo] β€ x β€ v[hi] according to the specified order, assuming that x is actually within the range of
values found in v. If x is outside that range, either lo will be firstindex(v) or
hi will be lastindex(v).
Note that the results will not typically satisfy lo β€ guess β€ hi. If x is precisely
equal to a value that is not unique in the input v, there is no guarantee that (lo, hi)
will encompass all indices corresponding to that value.
This algorithm is essentially an expanding binary search, which can be used as a precursor
to searchsorted and related functions, which can take lo and hi as arguments. The
purpose of using this function first would be to accelerate convergence in those functions
by using correlated guesses for repeated calls. The best guess for the next call of
this function would be the index returned by the previous call to searchsorted.
See Base.sort! for an explanation of the keyword arguments by, lt and rev.
searchsortedfirstcorrelated(v::AbstractVector, x, guess)An accelerated findfirst on sorted vectors using a bracketed search. Requires a guess
to start the search from, which is either an integer or an instance of Guesser.
An analogous function searchsortedlastcorrelated exists.
Some benchmarks:
using Random, BenchmarkTools, FindFirstFunctions
x = rand(Int, 2048); s = sort(x);
perm = randperm(length(x));
function findbench(f, x, vals)
@inbounds for i = eachindex(x, vals)
v = vals[i]
f(x[v], x) == v || throw("oops")
end
end
@benchmark findbench(FindFirstFunctions.findfirstequal, $x, $perm)
@benchmark findbench((x,v)->findfirst(==(x), v), $x, $perm)
@benchmark findbench(FindFirstFunctions.findfirstsortedequal, $s, $perm)
@benchmark findbench((x,v)->searchsortedfirst(v, x), $s, $perm)Sample results using -Cnative,-prefer-256-bit on an AVX512 capable laptop:
julia> @benchmark findbench(FindFirstFunctions.findfirstequal, $x, $perm)
BenchmarkTools.Trial: 6794 samples with 1 evaluation.
Range (min β¦ max): 141.489 ΞΌs β¦ 190.383 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 145.892 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 145.978 ΞΌs Β± 4.697 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
ββ βββ
βββ βββ β β β
βββββββββββββββββββββββββββββ
β
β
βββββββββββ
ββββββββ
βββββββββ
β
β
β
141 ΞΌs Histogram: log(frequency) by time 163 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench((x,v)->findfirst(==(x), v), $x, $perm)
BenchmarkTools.Trial: 1765 samples with 1 evaluation.
Range (min β¦ max): 547.812 ΞΌs β¦ 663.534 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 564.245 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 565.600 ΞΌs Β± 14.561 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ β βββ
ββ
ββββ
βββ
β β
βββββββββββββββββββββββββββββββββ
ββ
β
ββ
β
βββββββ
β
β
βββ
β
β
β
βββββββ
β
548 ΞΌs Histogram: log(frequency) by time 628 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench(FindFirstFunctions.findfirstsortedequal, $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 75.857 ΞΌs β¦ 125.111 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 85.811 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 86.135 ΞΌs Β± 3.217 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β ββββ
βββββββββββββββββ
ββββββββββ
βββββββββββββββββββββββββββββββββ β
75.9 ΞΌs Histogram: frequency by time 101 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench((x,v)->searchsortedfirst(v, x), $s, $perm)
BenchmarkTools.Trial: 8741 samples with 1 evaluation.
Range (min β¦ max): 108.941 ΞΌs β¦ 152.368 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 113.026 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 113.282 ΞΌs Β± 3.812 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
β ββββ
βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
109 ΞΌs Histogram: frequency by time 130 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
Note: This is an unofficial build, please report bugs to the project
responsible for this build and not to the Julia project unless you can
reproduce the issue using official builds available at https://julialang.org/downloads
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 Γ Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 1 on 36 virtual cores
Environment:
JULIA_PATH = @.
LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/:/usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
JULIA_NUM_THREADS = 36
LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/Note, if you're searching sorted collections and on an x86 CPU, it is worth setting the ENV variable JULIA_LLVM_ARGS="-x86-cmov-converter=false" before starting Julia, e.g. on an AVX512 capable CPU, you may wish to start Julia from the command line using
JULIA_LLVM_ARGS="-x86-cmov-converter=false" julia -Cnative,-prefer-256-bitWith this, benchmark results are
julia> @benchmark findbench(FindFirstFunctions.findfirstequal, $x, $perm)
BenchmarkTools.Trial: 6623 samples with 1 evaluation.
Range (min β¦ max): 141.304 ΞΌs β¦ 473.786 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 145.581 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 149.690 ΞΌs Β± 28.577 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββ β β β
βββββββββββββββββ
ββ
ββ
β
βββ
ββββββββββββββββββββββββ
β
βββββββββββ β
141 ΞΌs Histogram: log(frequency) by time 302 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench((x,v)->findfirst(==(x), v), $x, $perm)
BenchmarkTools.Trial: 1784 samples with 1 evaluation.
Range (min β¦ max): 546.395 ΞΌs β¦ 660.254 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 560.513 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 559.546 ΞΌs Β± 14.138 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ βββββββββββββ β
βββββββ
βββ
βββββββββββββββββ
ββ
β
ββββ
βββ
ββββββ
βββββββ
βββββ
ββββββ β
546 ΞΌs Histogram: log(frequency) by time 625 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench(FindFirstFunctions.findfirstsortedequal, $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 45.969 ΞΌs β¦ 73.354 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 47.674 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 47.675 ΞΌs Β± 1.999 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββ
ββββ βββ β
ββββββ
βββββββββββββββββββββ
βββββββββββββββββββββββββββββ
βββ β
46 ΞΌs Histogram: log(frequency) by time 58.1 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench((x,v)->searchsortedfirst(v, x), $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 35.988 ΞΌs β¦ 224.353 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 37.807 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 38.966 ΞΌs Β± 7.905 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββ
ββ ββ β
βββββββββββββββββββββββββββ
ββ
β
ββββββββββ
ββ
ββββ
ββββββ
ββ
β
βββββ β
36 ΞΌs Histogram: log(frequency) by time 57 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench((x,v)->FindFirstFunctions.findfirstsortedequal(x,v,Val(64)), $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 43.709 ΞΌs β¦ 182.914 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 45.227 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 45.954 ΞΌs Β± 5.377 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββββββ ββββ β β ββ β
ββββββββββββββββββββββββββββββββ
βββ
βββ
ββββββββββ
β
β
ββββββ
ββ
ββ β
43.7 ΞΌs Histogram: log(frequency) by time 61.4 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench((x,v)->FindFirstFunctions.findfirstsortedequal(x,v,Val(32)), $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 42.482 ΞΌs β¦ 172.067 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 44.422 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 45.765 ΞΌs Β± 8.329 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ ββββ β
ββββββββββββββββ
βββββββββ
βββββββββββββββββββββββββββββββββββ β
42.5 ΞΌs Histogram: log(frequency) by time 83.3 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench((x,v)->FindFirstFunctions.findfirstsortedequal(x,v,Val(16)), $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 36.870 ΞΌs β¦ 154.299 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 39.764 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 40.400 ΞΌs Β± 2.552 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ ββββββ
ββββββββ ββ β
βββββββ
ββββββββββββββββββββββββββββββββ
βββ
β
β
β
ββββ
β
βββ
ββ
β
ββββ
β
36.9 ΞΌs Histogram: log(frequency) by time 53 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark findbench((x,v)->FindFirstFunctions.findfirstsortedequal(x,v,Val(8)), $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 26.011 ΞΌs β¦ 48.109 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 26.954 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 27.046 ΞΌs Β± 1.677 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ ββ β β
βββββββ
ββββββββ
β
β
βββββ
ββββ
ββββββ
βββββββββββββββββββββ
β
β
ββ
ββ
β
26 ΞΌs Histogram: log(frequency) by time 37.9 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.The branches in a binary search are unpredictable, thus disabling the conversion of cmov into branches results in a substantial performance increase.
Additionally, enablig cmov (i.e., disabling cmov conversion) greatly reduces the optimal base case size for FindFirstFunctions.findfirstsortedequal. Without cmov, we need a very large base case to avoid too many branches, scanning large swaths contiguously.
With cmov, we can reduce the base case size to 8, taking several additional binary search steps without incurring heavy branch prediction penalties.
However, we default to a large base case size, under the assumptions users are not setting this ENV variable; we assume that an expert user concerned about binary search performance who sets this variable will also be able to choose their own basecase size.
Take care when benchmarking JULIA_LLVM_ARGS="-x86-cmov-converter=false": your CPU's branch predictor can probably memorize a sequence of hundreds of perfectly random branches. Branch predcitors are great at defeating microbenchmarks.
Thus, you need a very long unpredictable sequence (which I tried to do in the above benchmark) to prevent the branch predictor from memorizing it.
In "real world" workloads, your branch predictor isn't going to be able to memorize a sequence of left vs right bisections in your binary search, as you won't be performing the same searches over and over again!
Without making your benchmark realistic, the default setting of converting cmov into branches will look unrealistically good.
If you actually are, memoize. If you're looking for close answers, look for something like bracketstrictlymontonic's guess API.