Skip to content

Use bytes.Index instead of KMP for substring search #98

@dkharms

Description

@dkharms

When we process query like foo:'*bar*' we iterate over all tokens that are stored in fractions and perform substring search using KMP algorithm.

We saw that for some queries findSequence takes a lot of CPU time.
It seems like for small haystack and needles KMP are wasteful (since the majority of tokens is pretty small but still we need to build LPS - it adds additional resource consumption).

And we think substring search can be improved with bytes.Index function from Golang standard library because:

  • for small haystacks and needles it uses vector instructions (SSE4.2 extension and even AVX2 for some cases);
  • it falls back to Rabin-Karp algorithm when dealing with large haystacks and needles (so it's anyway better than stupid bruteforce);

I've already made a PoC and I am ready to provide some results:

pattern · main± ⟩ benchstat kmp.txt bytes-index.txt
goos: linux
goarch: amd64
pkg: github.com/ozontech/seq-db/pattern
cpu: 12th Gen Intel(R) Core(TM) i5-12600K
                            │    kmp.txt     │           bytes-index.txt           │
                            │     sec/op     │    sec/op     vs base               │
FindSequence/tiny-16            61.03n ± 13%   13.30n ± 18%  -78.21% (p=0.002 n=6)
FindSequence/small-16          177.40n ± 31%   18.43n ± 55%  -89.61% (p=0.002 n=6)
FindSequence/medium-16         626.15n ± 25%   45.43n ± 59%  -92.75% (p=0.002 n=6)
FindSequence/large-16         10588.0n ±  5%   542.5n ±  7%  -94.88% (p=0.002 n=6)
FindSequence/extra-large-16    633.90µ ±  2%   58.57µ ±  4%  -90.76% (p=0.002 n=6)
geomean                         2.146µ         204.1n        -90.49%

                            │    kmp.txt     │             bytes-index.txt             │
                            │      B/s       │      B/s        vs base                 │
FindSequence/tiny-16          1000.4Mi ± 15%   4588.1Mi ± 22%   +358.63% (p=0.002 n=6)
FindSequence/small-16          1.344Gi ± 45%   12.938Gi ± 36%   +862.37% (p=0.002 n=6)
FindSequence/medium-16         1.524Gi ± 20%   21.430Gi ± 92%  +1305.72% (p=0.002 n=6)
FindSequence/large-16          1.441Gi ±  6%   28.125Gi ±  8%  +1851.48% (p=0.002 n=6)
FindSequence/extra-large-16    1.541Gi ±  2%   16.675Gi ±  4%   +982.38% (p=0.002 n=6)
geomean                        1.348Gi          14.23Gi         +955.57%

Collected CPU profiles you can check out here:

Warning

And of course we are targeting performance improvements only for x86.

Metadata

Metadata

Assignees

Labels

benchmarksFeatures or improvements over our benchmark processesperformanceFeatures or improvements that positively affect seq-db performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions