-
Notifications
You must be signed in to change notification settings - Fork 4
perf: use bytes.Index
for substring search
#104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #104 +/- ##
==========================================
- Coverage 71.87% 71.78% -0.10%
==========================================
Files 200 200
Lines 18076 18050 -26
==========================================
- Hits 12993 12958 -35
- Misses 4364 4371 +7
- Partials 719 721 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
📝 WalkthroughWalkthroughReplaces the KMP-based substring search with a bytes.Index-based implementation, updates wildcard pattern handling to use [][]byte and unsafe string→[]byte conversions, and adds data-driven tests and benchmarks for findSequence. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Assessment against linked issues
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
pattern/substring.go (1)
7-16
: Tighten the loop and avoid double indexing.Minor cleanup and a tiny win: range with index+value and use
val
forbytes.Index
. No behavior change.-func findSequence(haystack []byte, needles [][]byte) int { - for cur := range needles { - val := needles[cur] - start := bytes.Index(haystack, needles[cur]) - if start == -1 { - return cur - } - haystack = haystack[start+len(val):] - } - return len(needles) -} +func findSequence(haystack []byte, needles [][]byte) int { + for i, val := range needles { + start := bytes.Index(haystack, val) + if start == -1 { + return i + } + haystack = haystack[start+len(val):] + } + return len(needles) +}Do we ever expect empty needles? If yes, current logic treats them as “found” without advancing. If that’s undesirable, we should skip/forbid empties and add a unit test.
pattern/pattern.go (3)
78-81
: Unsafe aliasing needs lifetime pinning (keep original strings alive).
middle
now holds[][]byte
produced via an unsafe string→[]byte view. To avoid GC lifetime hazards and future accidental mutation, keep the original strings alongside the byte slices.type wildcardSearch struct { baseSearch prefix []byte suffix []byte - middle [][]byte + middle [][]byte + // keep original strings to pin GC lifetime when using unsafe conversions + middleKeepAlive []string middleLen int narrowed bool }
99-102
: Pin the source strings when using StringToByteUnsafe.Store the source string to ensure it stays alive for the duration of the search. Also document the read-only invariant.
- val := util.StringToByteUnsafe(terms[i].Data) - s.middle = append(s.middle, val) - s.middleLen += len(val) + str := terms[i].Data + val := util.StringToByteUnsafe(str) // read-only view; do not mutate + s.middle = append(s.middle, val) + s.middleKeepAlive = append(s.middleKeepAlive, str) // pin lifetime + s.middleLen += len(val)If avoiding any unsafe GC nuance is preferred, switch to a safe copy:
- val := util.StringToByteUnsafe(terms[i].Data) + val := []byte(terms[i].Data) // safe copy
151-152
: Guard slice bounds locally (defensive check).
checkMiddle
guards length globally, but a local bound check keepsfindSequence
call obviously safe and future-proof if conditions change.- return findSequence(val[len(s.prefix):len(val)-len(s.suffix)], s.middle) == len(s.middle) + lo, hi := len(s.prefix), len(val)-len(s.suffix) + if hi < lo { + return false + } + return findSequence(val[lo:hi], s.middle) == len(s.middle)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
pattern/pattern.go
(2 hunks)pattern/substring.go
(1 hunks)pattern/substring_test.go
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (3)
pattern/substring_test.go (3)
24-40
: Tests look good.Nice coverage on order, overlaps, and corner growth cases.
125-131
: LGTM.Simple and fast random byte generator.
133-135
: LGTM.Nice helper for readability in deterministic cases.
2b9cefd
to
abb9bb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
pattern/substring_test.go (3)
72-79
: Use the canonical benchmark loop; drop redundant conversion.Replace
b.Loop()
and[]byte(c.haystack)
.- b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) { - for b.Loop() { - findSequence([]byte(c.haystack), c.needles) - } - }) + b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) { + b.ReportAllocs() + for i := 0; i < b.N; i++ { + findSequence(c.haystack, c.needles) + } + })
97-108
: Stabilize random bench; set bytes once.Move
SetBytes
out of the loop and use the canonical loop.b.Run(size.name, func(b *testing.B) { haystack, needles := generateTestData( size.haystackSize, size.needleSize, size.needleCount, 256, ) - b.ResetTimer() - for b.Loop() { - findSequence(haystack, needles) - b.SetBytes(int64(len(haystack))) - } + b.ReportAllocs() + b.SetBytes(int64(len(haystack))) + b.ResetTimer() + for i := 0; i < b.N; i++ { + findSequence(haystack, needles) + } })
111-123
: Fix compile error and bound checks in data generator.Looping over
int
withrange
doesn’t compile;rand.Intn(len(haystack)-needleSize)
can panic; copy must be bounded.func generateTestData(haystackSize, needleSize, needleCount, charset int) ([]byte, [][]byte) { haystack := generateRandomBytes(haystackSize, charset) needles := make([][]byte, needleCount) - for i := range needleCount { - pattern := generateRandomBytes(needleSize, charset) - pos := rand.Intn(len(haystack) - needleSize) - copy(haystack[pos:], pattern) - needles[i] = pattern - } + if needleSize > haystackSize { + needleSize = haystackSize + } + for i := 0; i < needleCount; i++ { + pattern := generateRandomBytes(needleSize, charset) + if needleSize > 0 { + pos := rand.Intn(len(haystack)-needleSize+1) + copy(haystack[pos:pos+needleSize], pattern) + } + needles[i] = pattern + } return haystack, needles }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
pattern/pattern.go
(2 hunks)pattern/substring.go
(1 hunks)pattern/substring_test.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pattern/substring.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (1)
pattern/pattern.go (1)
342-349
: Ignore empty-range wrap-around risk
baseSearch.first isint(tp.FirstTID())
(≥1), andutil.BinSearchInRange
never returns below that, so on “not found”s.last = s.first - 1yields
s.last ≥ 0
. Casting these non-negative ints touint32
can’t wrap.Likely an incorrect or invalid review comment.
Description
Here's the benchmarks for KMP and
bytes.Index
(I excluded comparison of heap allocation since both version had zero heap allocations).So we see that on regular cases we have around 53% of speedup by using
bytes.Index
. And on large strings we see around of 90% speedup (I believe that's because we build LPS for needles, not the haystack).Here are features of my CPU (it supports both SSE4.2 and AVX2):
Closes #98
If you have used LLM/AI assistance please provide model name and full prompt:
Summary by CodeRabbit
Refactor
Tests