-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106
base: main
Are you sure you want to change the base?
TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106
Conversation
This pull request does not have a backport label. Could you fix it @carsonip? 🙏
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the code is mostly ready, I plan to split this PR into 2 - a refactoring PR with no behavior changes and a bugfix PR with actual "drop and recreate" logic.
} | ||
|
||
func (s *ManagedReadWriter) WriteTraceEvent(traceID, id string, event *modelpb.APMEvent, opts WriterOpts) error { | ||
s.sm.mu.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: not sure if it should RLock and wait for the swap, or just fail right away with TryRLock to avoid blocking ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes look good to me, thank you @carsonip!
|
8e65516
to
9605745
Compare
9605745
to
1e4d5b3
Compare
now := time.Now() | ||
if firstExceeded.IsZero() { | ||
firstExceeded = now | ||
s.logger.Warnf("badger db size has exceeded storage limit; db will be dropped and recreated if problem persists for `sampling.tail.ttl` (%s)", ttl.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: maybe report badger db size in this log line as well
return len(filenames) == 0 | ||
}, 10*time.Second, 500*time.Millisecond, filenames) | ||
|
||
b, err := os.ReadFile(subscriberPositionFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check that backup directory is gone?
And that new writes are accepted by badger after DB was recreated?
Motivation/summary
Alternative to #15081
Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.
Checklist
For functional changes, consider:
How to test these changes
Related issues
Fixes #14923