Skip to content

Conversation

dkharms
Copy link
Member

@dkharms dkharms commented Sep 1, 2025

Description

Here's the benchmarks for KMP and bytes.Index (I excluded comparison of heap allocation since both version had zero heap allocations).

goos: linux
goarch: amd64
pkg: github.com/ozontech/seq-db/pattern
cpu: 12th Gen Intel(R) Core(TM) i5-12600K
                                              │    kmp.txt    │              bytes.txt               │
                                              │    sec/op     │    sec/op     vs base                │
FindSequence_Deterministic/regular-cases-0-16   19.605n ±  7%   8.456n ±  1%  -56.87% (p=0.000 n=10)
FindSequence_Deterministic/regular-cases-1-16   14.305n ±  2%   5.590n ±  1%  -60.92% (p=0.000 n=10)
FindSequence_Deterministic/corner-cases-0-16     91.35n ±  3%   22.86n ±  1%  -74.98% (p=0.000 n=10)
FindSequence_Deterministic/corner-cases-1-16    167.70n ±  2%   28.44n ±  1%  -83.04% (p=0.000 n=10)
FindSequence_Deterministic/corner-cases-2-16    2482.5n ±  1%   181.8n ±  1%  -92.68% (p=0.000 n=10)
FindSequence_Deterministic/corner-cases-3-16    39.509µ ±  1%   2.663µ ±  1%  -93.26% (p=0.000 n=10)
FindSequence_Random/tiny-16                      63.56n ± 34%   12.10n ± 11%  -80.96% (p=0.000 n=10)
FindSequence_Random/small-16                    190.70n ±  8%   23.70n ± 39%  -87.57% (p=0.000 n=10)
FindSequence_Random/medium-16                   681.70n ± 25%   41.71n ± 28%  -93.88% (p=0.000 n=10)
FindSequence_Random/large-16                    9804.5n ±  2%   538.2n ±  7%  -94.51% (p=0.000 n=10)
FindSequence_Random/extra-large-16              603.54µ ±  1%   58.99µ ±  2%  -90.23% (p=0.000 n=10)
geomean                                          702.6n         94.94n        -86.49%

                                   │    kmp.txt    │                bytes.txt                 │
                                   │      B/s      │      B/s        vs base                  │
FindSequence_Random/tiny-16          960.4Mi ± 52%   5044.3Mi ± 12%   +425.25% (p=0.000 n=10)
FindSequence_Random/small-16         1.250Gi ±  9%   10.424Gi ± 33%   +733.58% (p=0.000 n=10)
FindSequence_Random/medium-16        1.402Gi ± 20%   22.882Gi ± 32%  +1531.58% (p=0.000 n=10)
FindSequence_Random/large-16         1.556Gi ±  2%   28.363Gi ±  7%  +1722.42% (p=0.000 n=10)
FindSequence_Random/extra-large-16   1.618Gi ±  1%   16.555Gi ±  2%   +923.15% (p=0.000 n=10)
geomean                              1.329Gi          14.07Gi         +959.02%

So we see that on regular cases we have around 53% of speedup by using bytes.Index. And on large strings we see around of 90% speedup (I believe that's because we build LPS for needles, not the haystack).

Here are features of my CPU (it supports both SSE4.2 and AVX2):

Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 **sse4_2** x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 **avx2** smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities

Closes #98


  • I have read and followed all requirements in CONTRIBUTING.md;
  • I used LLM/AI assistance to make this pull request;

If you have used LLM/AI assistance please provide model name and full prompt:

Model: {{model-name}}
Prompt: {{prompt}}

Summary by CodeRabbit

  • Refactor

    • Streamlined pattern matching to a lighter, byte-based approach, reducing overhead and improving runtime efficiency in search operations.
    • Optimized memory usage during multi-part pattern searches.
  • Tests

    • Replaced legacy unit test with comprehensive, data-driven test coverage for sequence matching.
    • Added deterministic and randomized benchmarks to assess performance across typical and edge-case inputs.
    • Introduced utilities to generate large and varied datasets for more robust validation.

@dkharms dkharms self-assigned this Sep 1, 2025
@dkharms dkharms added benchmarks Features or improvements over our benchmark processes performance Features or improvements that positively affect seq-db performance labels Sep 1, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.78%. Comparing base (3b00e02) to head (abb9bb5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
- Coverage   71.87%   71.78%   -0.10%     
==========================================
  Files         200      200              
  Lines       18076    18050      -26     
==========================================
- Hits        12993    12958      -35     
- Misses       4364     4371       +7     
- Partials      719      721       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

coderabbitai bot commented Sep 1, 2025

📝 Walkthrough

Walkthrough

Replaces the KMP-based substring search with a bytes.Index-based implementation, updates wildcard pattern handling to use [][]byte and unsafe string→[]byte conversions, and adds data-driven tests and benchmarks for findSequence.

Changes

Cohort / File(s) Summary
Substring search refactor
pattern/substring.go
Replaced KMP/prefix-function implementation with a bytes.Index-based sequence search; changed findSequence signature to accept [][]byte; removed substring type and related helpers.
Wildcard search integration
pattern/pattern.go
Changed wildcardSearch.middle from []*substring to [][]byte; use util.StringToByteUnsafe(...) when building middle slices; update middleLen accumulation to len(val).
Tests and benchmarks
pattern/substring_test.go
Replaced previous unit test with TestFindSequence; added deterministic and random benchmarks, test-data generators, and supporting imports.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Assessment against linked issues

Objective Addressed Explanation
Replace KMP with bytes.Index in findSequence (#98)
Maintain findSequence behavior while changing implementation (#98) Tests added, but behavioral equivalence for all edge cases (overlaps, empty needles, boundary conditions) not exhaustively demonstrated.
Optimize for x86 by leveraging stdlib’s SIMD-backed bytes.Index (#98)
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 98-bytes-index

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
pattern/substring.go (1)

7-16: Tighten the loop and avoid double indexing.

Minor cleanup and a tiny win: range with index+value and use val for bytes.Index. No behavior change.

-func findSequence(haystack []byte, needles [][]byte) int {
-	for cur := range needles {
-		val := needles[cur]
-		start := bytes.Index(haystack, needles[cur])
-		if start == -1 {
-			return cur
-		}
-		haystack = haystack[start+len(val):]
-	}
-	return len(needles)
-}
+func findSequence(haystack []byte, needles [][]byte) int {
+	for i, val := range needles {
+		start := bytes.Index(haystack, val)
+		if start == -1 {
+			return i
+		}
+		haystack = haystack[start+len(val):]
+	}
+	return len(needles)
+}

Do we ever expect empty needles? If yes, current logic treats them as “found” without advancing. If that’s undesirable, we should skip/forbid empties and add a unit test.

pattern/pattern.go (3)

78-81: Unsafe aliasing needs lifetime pinning (keep original strings alive).

middle now holds [][]byte produced via an unsafe string→[]byte view. To avoid GC lifetime hazards and future accidental mutation, keep the original strings alongside the byte slices.

 type wildcardSearch struct {
   baseSearch
   prefix    []byte
   suffix    []byte
-  middle    [][]byte
+  middle    [][]byte
+  // keep original strings to pin GC lifetime when using unsafe conversions
+  middleKeepAlive []string
   middleLen int
   narrowed  bool
 }

99-102: Pin the source strings when using StringToByteUnsafe.

Store the source string to ensure it stays alive for the duration of the search. Also document the read-only invariant.

-      val := util.StringToByteUnsafe(terms[i].Data)
-      s.middle = append(s.middle, val)
-      s.middleLen += len(val)
+      str := terms[i].Data
+      val := util.StringToByteUnsafe(str) // read-only view; do not mutate
+      s.middle = append(s.middle, val)
+      s.middleKeepAlive = append(s.middleKeepAlive, str) // pin lifetime
+      s.middleLen += len(val)

If avoiding any unsafe GC nuance is preferred, switch to a safe copy:

-      val := util.StringToByteUnsafe(terms[i].Data)
+      val := []byte(terms[i].Data) // safe copy

151-152: Guard slice bounds locally (defensive check).

checkMiddle guards length globally, but a local bound check keeps findSequence call obviously safe and future-proof if conditions change.

-  return findSequence(val[len(s.prefix):len(val)-len(s.suffix)], s.middle) == len(s.middle)
+  lo, hi := len(s.prefix), len(val)-len(s.suffix)
+  if hi < lo {
+    return false
+  }
+  return findSequence(val[lo:hi], s.middle) == len(s.middle)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 035c094 and 2b9cefd.

📒 Files selected for processing (3)
  • pattern/pattern.go (2 hunks)
  • pattern/substring.go (1 hunks)
  • pattern/substring_test.go (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (3)
pattern/substring_test.go (3)

24-40: Tests look good.

Nice coverage on order, overlaps, and corner growth cases.


125-131: LGTM.

Simple and fast random byte generator.


133-135: LGTM.

Nice helper for readability in deterministic cases.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
pattern/substring_test.go (3)

72-79: Use the canonical benchmark loop; drop redundant conversion.

Replace b.Loop() and []byte(c.haystack).

-      b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) {
-        for b.Loop() {
-          findSequence([]byte(c.haystack), c.needles)
-        }
-      })
+      b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) {
+        b.ReportAllocs()
+        for i := 0; i < b.N; i++ {
+          findSequence(c.haystack, c.needles)
+        }
+      })

97-108: Stabilize random bench; set bytes once.

Move SetBytes out of the loop and use the canonical loop.

   b.Run(size.name, func(b *testing.B) {
     haystack, needles := generateTestData(
       size.haystackSize, size.needleSize, size.needleCount, 256,
     )
-    b.ResetTimer()
-    for b.Loop() {
-      findSequence(haystack, needles)
-      b.SetBytes(int64(len(haystack)))
-    }
+    b.ReportAllocs()
+    b.SetBytes(int64(len(haystack)))
+    b.ResetTimer()
+    for i := 0; i < b.N; i++ {
+      findSequence(haystack, needles)
+    }
   })

111-123: Fix compile error and bound checks in data generator.

Looping over int with range doesn’t compile; rand.Intn(len(haystack)-needleSize) can panic; copy must be bounded.

 func generateTestData(haystackSize, needleSize, needleCount, charset int) ([]byte, [][]byte) {
   haystack := generateRandomBytes(haystackSize, charset)

   needles := make([][]byte, needleCount)
-  for i := range needleCount {
-    pattern := generateRandomBytes(needleSize, charset)
-    pos := rand.Intn(len(haystack) - needleSize)
-    copy(haystack[pos:], pattern)
-    needles[i] = pattern
-  }
+  if needleSize > haystackSize {
+    needleSize = haystackSize
+  }
+  for i := 0; i < needleCount; i++ {
+    pattern := generateRandomBytes(needleSize, charset)
+    if needleSize > 0 {
+      pos := rand.Intn(len(haystack)-needleSize+1)
+      copy(haystack[pos:pos+needleSize], pattern)
+    }
+    needles[i] = pattern
+  }

   return haystack, needles
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2b9cefd and abb9bb5.

📒 Files selected for processing (3)
  • pattern/pattern.go (2 hunks)
  • pattern/substring.go (1 hunks)
  • pattern/substring_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pattern/substring.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (1)
pattern/pattern.go (1)

342-349: Ignore empty-range wrap-around risk
baseSearch.first is int(tp.FirstTID()) (≥1), and util.BinSearchInRange never returns below that, so on “not found”

s.last = s.first - 1  

yields s.last ≥ 0. Casting these non-negative ints to uint32 can’t wrap.

Likely an incorrect or invalid review comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks Features or improvements over our benchmark processes performance Features or improvements that positively affect seq-db performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use bytes.Index instead of KMP for substring search
2 participants