-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNDB-9104: Port over chunk cache improvements from DSE #1495
base: main
Are you sure you want to change the base?
Conversation
Using CachingRebuffererTest.calculateMemoryOverhead with 1.5M entries. Cache size set at 4096 MiB. Bytes on heap per entry: 320
Saves at least 40 bytes per cache entry (12.5%) and 20% of the insertion time. Bytes on heap per entry: 280
Bytes on heap per entry: 232
With fileIDs this has no effect on the performance of the cache Bytes on heap per entry: 222
This reverts commit be317b2.
This reverts commit 1a15fa5.
Quality Gate failedFailed conditions |
…normally-allocated direct byte buffer.
if (buffers.length > 1) | ||
return new MultiRegionChunk(position, buffers); | ||
else | ||
return new SingleRegionChunk(position, buffers[0]); | ||
} | ||
|
||
public ChunkCache(BufferPool pool, int cacheSizeInMB, Function<ChunkCache, ChunkCacheMetrics> createMetrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I personally find it a tad inconvenient to have the ctor so deep within the class, would prefer moving this at the beginning (not new to those changes admittedly, but there is a lot of code before this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file is now rearranged.
@@ -124,6 +125,9 @@ private static FileChannel openChannel(File file) | |||
try { channel.close(); } | |||
catch (Throwable t2) { t.addSuppressed(t2); } | |||
} | |||
|
|||
// Invalidate any cache entries that may exist for a previous file with the same name. | |||
ChunkCache.instance.invalidateFile(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most importantly, ChunkCache.instance
can be null
, which would break here.
Also, I don't have a super good alternative in mind at the moment, but it does feel rather fragile to me that we have to call this everywhere were we write files that may be used by the chunk cache. If some code, especially in cndb, decided to write files by another mean in some special case, this would be really easy to forget/get wrong (side note: I'm not saying it's the case right now, I don't think it is, but I'm also really not 100% sure).
Admittedly a half baked idea, but couldn't we, say, shove the file modification file into the cache fileId
so we don't need to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general problem for caches, isn't it? How does NSS deal with it?
For the cache to be efficient, all of this needs to be done only at the time when the file is opened (even ignoring the time it takes to get to the modification time, checking it fully defeats early open).
Currently we invalidate in two places:
- when we start writing to a file
- when we clean up a file handle
Because files enter Cassandra when we write them or on restart (which the chunk cache does not survive), the former is currently sufficient. Another alternative is to invalidate on creating a file handle, which is trickier because I am not sure how to make it preserve early caches. Actually, invalidation on dropping a file handle may also break that; I need to look further into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general problem for caches, isn't it?
Is it? What bother me here is that SequentialWriter
, which a fairly generic writing file utility, has to worry about the chunk cache, which feels like a completely separate abstraction. It just feel really easy to mess up if every code that write files has to worry about the cache is all.
How does NSS deal with it?
NSS assumes that the files for which it stores data are "forever" immutable, and so that you never need to invalidate for correctness (only to reclaim space earlier, which NSS don't even bother with at the moment). That's why the whole "immutable SAI component" part was so important to NSS in fact.
Which actually begs the question: do we really need this for the chunk cache in practice? Or is this more defensive/for specific tests that reuse files? Can't we just make the same assumption in the chunk cache that file names of stuff going to the chunk cache are never reused? Am I missing something here?
I'll also note here too that this code assumes that the static instance of the chunk cache is the only instance, which happens to not be true in CNDB today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ran into bugs caused by file reuse, usually during restore. I would not assume UUID naming completely resolves it. I don't know if it is still possible to hit with SAI index rebuild.
The alternative we are left with is to invalidate during FileHandle.Builder.complete
, with some option to not do it, needed by sstable writers for early open. Making sure we don't invalidate anything else we need is not going to be fun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ran into bugs caused by file reuse, usually during restore. I would not assume UUID naming completely resolves it.
Let me pause on this because I don't follow. Are you saying that sstable names don't always happen to be unique, and that the same sstable name could be used by different sstables? That would obviously be a huge issue for CNDB if that were true though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, file names should be unique, but nothing guarantees they are, at the very least for on-prem deployments. We have hit problems with this multiple times (DB-2334, DB-2497, the SAI index rebuild issue on CNDB, and there was recently a related issue on the C* 5 line as well).
I would be very happy if we didn't have to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry to insist, but that seems important and I need to make sure to understand your points.
I think what I forgot about is that we used to identify sstable with just a "generation", and in that case sstable file names were indeed not globally and forever unique. Looking at DB-2334/DB-2497, those look pretty old so I assume that might be what was going on there.
But with "modern" sstables, the UUID or ULID should guarantee that they are unique. Do we agree? Or are you saying you don't trust UUID/ULID to be genuinely unique?
And if the worry is old-style generation-based sstable, are we sure we still need to worry about them? Because I believe even those would only be a problem if you "restore" sstable in a live cluster, and is that even still supported for those old sstable format?
Anyway, my point is that it seems we have to buy a bunch of complexity here to "file name that could be reused" and I'm not quite sure that's an actual possibility in modern Cassandra, so I wonder if it's worth the trouble is all.
But mostly, if we do protect against it, I just want us to be clear on why we do it. If that's for old generation-based files, then a bit sad we still have to worry about this but so be it. If instead we're saying "this really shouldn't happen anymore and shouldn't be needed but we've been bitten before and we'd feel better if we were defensive against future use cases", then I'm not personally convinced it's worth it, but I don't strongly object. But if we're instead saying "no, we could actually have problems with modern UUID/ULID based sstables", then that's where I'm lost and need explanation on what the problem is with those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't prove that we won't run into a version of this issue even with UUID. Are we guaranteed, for one, that we can't have a situation like an index rebuild creating files with the same name? Are we sure that (some version of) repair can't insert a slightly different version of a file, for example if it streams new parts of a partially-streamed file after the previous part has been compacted?
We also have to consider HCD and the potential regression not handling this could cause to installations (with generation-based sstable names) that have had the invalidation in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright fair enough. I was just hoping that maybe we could save ourselves some trouble.And while I'm not quite sure those are real problem at the moment, I get the point that this could at least easily become a problem someday. Plus, I don't quite know the response regarding HCD and happy to play it safe. So happy to handle this if we have a good solution.
@@ -238,7 +238,7 @@ public String name() | |||
|
|||
public void tidy() | |||
{ | |||
chunkCache.ifPresent(cache -> cache.invalidateFile(name())); | |||
ChunkCache.removeFileIdFromCache(channel.getFile()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that not using the local chunkCache
here is a regression, for 2 reasons:
- it assumes
chunkCache.get() == ChunkCache.instance
. But currently in CNDB we sometimes use a different chunk cache instance for some filse, where this would be incorrect. Admittedly, the benefits of that separate instance are debatable, but it is used currently. - it also kind of assumes that if the chunk cache is used, then this instance of
FileHandle
is meant to use it, which is technically not guarantee. In theory, nothing quite forbid the same file from being opened in one place with use of the chunk cache but also in another where it doesn't. Given thatFileHandle
is used a lot, including in CNDB, I think it's better not to rely on this never being a legit use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole code is now dropped, because it is obliterating the effect of early open. Whatever solution we choose, it cannot be done during handle cleanup.
@@ -124,6 +125,9 @@ private static FileChannel openChannel(File file) | |||
try { channel.close(); } | |||
catch (Throwable t2) { t.addSuppressed(t2); } | |||
} | |||
|
|||
// Invalidate any cache entries that may exist for a previous file with the same name. | |||
ChunkCache.instance.invalidateFile(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general problem for caches, isn't it?
Is it? What bother me here is that SequentialWriter
, which a fairly generic writing file utility, has to worry about the chunk cache, which feels like a completely separate abstraction. It just feel really easy to mess up if every code that write files has to worry about the cache is all.
How does NSS deal with it?
NSS assumes that the files for which it stores data are "forever" immutable, and so that you never need to invalidate for correctness (only to reclaim space earlier, which NSS don't even bother with at the moment). That's why the whole "immutable SAI component" part was so important to NSS in fact.
Which actually begs the question: do we really need this for the chunk cache in practice? Or is this more defensive/for specific tests that reuse files? Can't we just make the same assumption in the chunk cache that file names of stuff going to the chunk cache are never reused? Am I missing something here?
I'll also note here too that this code assumes that the static instance of the chunk cache is the only instance, which happens to not be true in CNDB today.
Needed to allow simple scanners to complete correctly
*/ | ||
public static void removeFileIdFromCache(File file) | ||
{ | ||
if (instance != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may have multiple instances of the ChunkCache, in CNDB we have the IndexChunkCache
I see that this method is used only in the SequentialWriter class, what about adding some comment or improve the javadocs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to move the invalidation to FileHandle.Builder
, or move it back to the tidier.
This will take some time, though, I need to write new tests to make sure we are not releasing content we shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The invalidation is now moved to the builder, using the supplied chunk cache instance.
Invalidation in the tidier does not work: even after changing the test to behave exactly as normal compaction, I see the cache being reliably dropped when it should not be.
@@ -124,6 +125,9 @@ private static FileChannel openChannel(File file) | |||
try { channel.close(); } | |||
catch (Throwable t2) { t.addSuppressed(t2); } | |||
} | |||
|
|||
// Invalidate any cache entries that may exist for a previous file with the same name. | |||
ChunkCache.removeFileIdFromCache(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in CNDB we use the ChunkCache also for SAI, should we add some specifc handling ?
cc @pcmanus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to the file handle builder to use the supplied cache instance.
Change early open caching test to exercise the compaction code
// Invalidate the cache for any previous version of the file that may differ from the one we are opening. | ||
// It is important to do this here (rather than e.g. in complete) to ensure that we don't invalidate when | ||
// opening a file multiple times e.g. when opening sstables early during compaction. | ||
if (chunkCache != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is going to invalidate all of the "ChunkCache priming" that happens on CNDB during Preloading
maybe we need some flag to disable this behavior on the Writers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a problem for priming. The cndb chunk cache priming uses the existing FileHandle
of the SSTableReader
, it doesn't create a new one, so this method isn't involved.
If anything, what I might be worrying about is a SSTableReader
being re-instantiated for some reason (despite the underlying file not having changed). I see no reason to do so tbc, but cndb sstable reloading code is not trivial and I could see some inefficiency like that easily go unnoticed. With that said, I don't think the code currently does this, so this "should" be fine (if not quite error proof maybe).
…er's invalidateCache" This reverts commit ba1232f.
…ng SSTableReader release
Changed this again. It was too broad to work correctly. Pushed three more attempts at this:
The latter is the solution I prefer. Please let me know if any of the other options makes better sense to you. |
✔️ Build ds-cassandra-pr-gate/PR-1495 approved by ButlerApproved by Butler |
❌ Build ds-cassandra-pr-gate/PR-1495 rejected by Butler6 new test failure(s) in 9 builds Found 6 new test failures
Found 369 known test failures |
Quality Gate passedIssues Measures |
StorageProvider.instance.invalidateFileSystemCache(desc.fileFor(Component.DATA)); | ||
StorageProvider.instance.invalidateFileSystemCache(desc.fileFor(Component.ROW_INDEX)); | ||
StorageProvider.instance.invalidateFileSystemCache(desc.fileFor(Component.PARTITION_INDEX)); | ||
for (Component component : SSTable.discoverComponentsFor(desc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory I like that option, but I'll note 2 things:
discoverComponentsFor
only include components whose files still exists, but thisinvalidateFileSystemCache
method run afterobsoletion.commit()
inSSTableReader.GlobalTidy.tidy()
and the later deletes the files, so I wonder if this couldn't be a problem in some case (but I'm only so familiar with the whole tidying code so maybe theobsoletion
only do something in cases where nothing can be in caches?discoverComponentsFor
also has the misleading behavior of only including "hard-coded" components, meaning no "custom" ones and so none of the SAI files. To include SAI files we'd probably have to callSSTable.readTOC(desc, false)
, though that implies the TOC is still there so previous point also a question here. I'll note that C* proper never put SAI files into the chunk cache, and that we can override this method in CNDB, so I'm ok if we prefer to stick to hard coded components here and leave the concern for SAI files to CNDB, but figure it was worth mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the test to catch this problem.
Changed the order of cache invalidation and obsoletion to fix it. Looking at relevant code in CC and CNDB, there doesn't appear to be anything that depends on this order.
Kept the discovery's use of discoverComponentsFor
for now, because that is what the obsoletion code does. How do SAI components get deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do SAI components get deleted?
Honestly, I'm not sure. StorageAttachedIndexGroup.handleNotification
has a bunch of code that run when it's notified of the removal of a sstable, but look at it right now, I'm not finding where it actually delete the files in the case of, say, a compacted sstable. And SSTableIndex has a comment that says it's happening in LogTransaction
, but I'm not sure how given the tidier...
@jasonstack do you know off the top of your head?
Improve WriteAndReadTest to catch this
What is the issue
Using buffers of different size in the chunk cache causes fragmentation, which in turn results in excessive memory use and lack of pooling for a large fraction of the buffers used by the chunk cache.
What does this PR fix and why was it fixed
Ports over single-size chunk cache buffers (DB-2904), caching memory addresses (parts of DB-2509) and file cache ids (DB-2489) from DSE.
This does not port any of the
BufferPool
refactoring in DSE. As C* already has distinct buffer pools for short vs. longer-term buffers, we should already be receiving similar benefits.The size of the per-entry on-heap overhead of the chunk cache is reduced from ~350 bytes to ~220. As part of this reduction, the patch drops the collection of lists of keys per file and replaces it with the ability to drop a file id for invalidation, making a file's entries in the cache unreachable and reclaimed with some delay using the normal cache eviction.
Before this patch the cache could use on-heap memory if this was the preference of the compressor in use (e.g. Deflate specifies an ON_HEAP preference). This is highly unexpected and put very low limits on the useable cache size. The cache is now changed to always store data off heap.
Also changes the source of some temporary buffers to the short lived "networking" pool.
Checklist before you submit for review
NoSpamLogger
for log lines that may appear frequently in the logs