CNDB-9104: Port over chunk cache improvements from DSE #1495

blambov · 2025-01-09T14:38:08Z

What is the issue

Using buffers of different size in the chunk cache causes fragmentation, which in turn results in excessive memory use and lack of pooling for a large fraction of the buffers used by the chunk cache.

What does this PR fix and why was it fixed

Ports over single-size chunk cache buffers (DB-2904), caching memory addresses (parts of DB-2509) and file cache ids (DB-2489) from DSE.

This does not port any of the BufferPool refactoring in DSE. As C* already has distinct buffer pools for short vs. longer-term buffers, we should already be receiving similar benefits.

The size of the per-entry on-heap overhead of the chunk cache is reduced from ~350 bytes to ~220. As part of this reduction, the patch drops the collection of lists of keys per file and replaces it with the ability to drop a file id for invalidation, making a file's entries in the cache unreachable and reclaimed with some delay using the normal cache eviction.

Before this patch the cache could use on-heap memory if this was the preference of the compressor in use (e.g. Deflate specifies an ON_HEAP preference). This is highly unexpected and put very low limits on the useable cache size. The cache is now changed to always store data off heap.

Also changes the source of some temporary buffers to the short lived "networking" pool.

Checklist before you submit for review

Make sure there is a PR in the CNDB project updating the Converged Cassandra version
Use NoSpamLogger for log lines that may appear frequently in the logs
Verify test results on Butler
Test coverage for new/modified code is > 80%
Proper code formatting
Proper title for each commit staring with the project-issue number, like CNDB-1234
Each commit has a meaningful description
Each commit is not very long and contains related changes
Renames, moves and reformatting are in distinct commits

Using CachingRebuffererTest.calculateMemoryOverhead with 1.5M entries. Cache size set at 4096 MiB. Bytes on heap per entry: 320

Saves at least 40 bytes per cache entry (12.5%) and 20% of the insertion time. Bytes on heap per entry: 280

Bytes on heap per entry: 232

With fileIDs this has no effect on the performance of the cache Bytes on heap per entry: 222

This reverts commit be317b2.

This reverts commit 1a15fa5.

src/java/org/apache/cassandra/cache/ChunkCache.java

sonarqubecloud · 2025-01-10T14:37:13Z

Quality Gate failed

Failed conditions
57.8% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

…normally-allocated direct byte buffer.

src/java/org/apache/cassandra/cache/ChunkCache.java

pcmanus · 2025-01-10T15:33:49Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+        if (buffers.length > 1)
+            return new MultiRegionChunk(position, buffers);
+        else
+            return new SingleRegionChunk(position, buffers[0]);
    }

    public ChunkCache(BufferPool pool, int cacheSizeInMB, Function<ChunkCache, ChunkCacheMetrics> createMetrics)


Nit: I personally find it a tad inconvenient to have the ctor so deep within the class, would prefer moving this at the beginning (not new to those changes admittedly, but there is a lot of code before this now.

The file is now rearranged.

src/java/org/apache/cassandra/cache/ChunkCache.java

pcmanus · 2025-01-10T16:11:00Z

src/java/org/apache/cassandra/io/util/SequentialWriter.java

@@ -124,6 +125,9 @@ private static FileChannel openChannel(File file)
                    try { channel.close(); }
                    catch (Throwable t2) { t.addSuppressed(t2); }
                }
+
+                // Invalidate any cache entries that may exist for a previous file with the same name.
+                ChunkCache.instance.invalidateFile(file);


Most importantly, ChunkCache.instance can be null, which would break here.

Also, I don't have a super good alternative in mind at the moment, but it does feel rather fragile to me that we have to call this everywhere were we write files that may be used by the chunk cache. If some code, especially in cndb, decided to write files by another mean in some special case, this would be really easy to forget/get wrong (side note: I'm not saying it's the case right now, I don't think it is, but I'm also really not 100% sure).

Admittedly a half baked idea, but couldn't we, say, shove the file modification file into the cache fileId so we don't need to do this?

This is a general problem for caches, isn't it? How does NSS deal with it?

For the cache to be efficient, all of this needs to be done only at the time when the file is opened (even ignoring the time it takes to get to the modification time, checking it fully defeats early open).

Currently we invalidate in two places:

when we start writing to a file

when we clean up a file handle

Because files enter Cassandra when we write them or on restart (which the chunk cache does not survive), the former is currently sufficient. Another alternative is to invalidate on creating a file handle, which is trickier because I am not sure how to make it preserve early caches. Actually, invalidation on dropping a file handle may also break that; I need to look further into it.

This is a general problem for caches, isn't it?

Is it? What bother me here is that SequentialWriter, which a fairly generic writing file utility, has to worry about the chunk cache, which feels like a completely separate abstraction. It just feel really easy to mess up if every code that write files has to worry about the cache is all.

How does NSS deal with it?

NSS assumes that the files for which it stores data are "forever" immutable, and so that you never need to invalidate for correctness (only to reclaim space earlier, which NSS don't even bother with at the moment). That's why the whole "immutable SAI component" part was so important to NSS in fact.

Which actually begs the question: do we really need this for the chunk cache in practice? Or is this more defensive/for specific tests that reuse files? Can't we just make the same assumption in the chunk cache that file names of stuff going to the chunk cache are never reused? Am I missing something here?

I'll also note here too that this code assumes that the static instance of the chunk cache is the only instance, which happens to not be true in CNDB today.

We ran into bugs caused by file reuse, usually during restore. I would not assume UUID naming completely resolves it. I don't know if it is still possible to hit with SAI index rebuild.

The alternative we are left with is to invalidate during FileHandle.Builder.complete, with some option to not do it, needed by sstable writers for early open. Making sure we don't invalidate anything else we need is not going to be fun.

We ran into bugs caused by file reuse, usually during restore. I would not assume UUID naming completely resolves it.

Let me pause on this because I don't follow. Are you saying that sstable names don't always happen to be unique, and that the same sstable name could be used by different sstables? That would obviously be a huge issue for CNDB if that were true though?

Yes, file names should be unique, but nothing guarantees they are, at the very least for on-prem deployments. We have hit problems with this multiple times (DB-2334, DB-2497, the SAI index rebuild issue on CNDB, and there was recently a related issue on the C* 5 line as well).

I would be very happy if we didn't have to do this.

I'm sorry to insist, but that seems important and I need to make sure to understand your points.

I think what I forgot about is that we used to identify sstable with just a "generation", and in that case sstable file names were indeed not globally and forever unique. Looking at DB-2334/DB-2497, those look pretty old so I assume that might be what was going on there.

But with "modern" sstables, the UUID or ULID should guarantee that they are unique. Do we agree? Or are you saying you don't trust UUID/ULID to be genuinely unique?

And if the worry is old-style generation-based sstable, are we sure we still need to worry about them? Because I believe even those would only be a problem if you "restore" sstable in a live cluster, and is that even still supported for those old sstable format?

Anyway, my point is that it seems we have to buy a bunch of complexity here to "file name that could be reused" and I'm not quite sure that's an actual possibility in modern Cassandra, so I wonder if it's worth the trouble is all.

But mostly, if we do protect against it, I just want us to be clear on why we do it. If that's for old generation-based files, then a bit sad we still have to worry about this but so be it. If instead we're saying "this really shouldn't happen anymore and shouldn't be needed but we've been bitten before and we'd feel better if we were defensive against future use cases", then I'm not personally convinced it's worth it, but I don't strongly object. But if we're instead saying "no, we could actually have problems with modern UUID/ULID based sstables", then that's where I'm lost and need explanation on what the problem is with those.

I can't prove that we won't run into a version of this issue even with UUID. Are we guaranteed, for one, that we can't have a situation like an index rebuild creating files with the same name? Are we sure that (some version of) repair can't insert a slightly different version of a file, for example if it streams new parts of a partially-streamed file after the previous part has been compacted?

We also have to consider HCD and the potential regression not handling this could cause to installations (with generation-based sstable names) that have had the invalidation in place.

Alright fair enough. I was just hoping that maybe we could save ourselves some trouble.And while I'm not quite sure those are real problem at the moment, I get the point that this could at least easily become a problem someday. Plus, I don't quite know the response regarding HCD and happy to play it safe. So happy to handle this if we have a good solution.

src/java/org/apache/cassandra/utils/memory/BufferPool.java

src/java/org/apache/cassandra/cache/ChunkCache.java

src/java/org/apache/cassandra/utils/memory/BufferPool.java

pcmanus · 2025-01-13T12:36:06Z

src/java/org/apache/cassandra/io/util/FileHandle.java

@@ -238,7 +238,7 @@ public String name()

        public void tidy()
        {
-            chunkCache.ifPresent(cache -> cache.invalidateFile(name()));
+            ChunkCache.removeFileIdFromCache(channel.getFile());


I'd argue that not using the local chunkCache here is a regression, for 2 reasons:

it assumes chunkCache.get() == ChunkCache.instance. But currently in CNDB we sometimes use a different chunk cache instance for some filse, where this would be incorrect. Admittedly, the benefits of that separate instance are debatable, but it is used currently.

it also kind of assumes that if the chunk cache is used, then this instance of FileHandle is meant to use it, which is technically not guarantee. In theory, nothing quite forbid the same file from being opened in one place with use of the chunk cache but also in another where it doesn't. Given that FileHandle is used a lot, including in CNDB, I think it's better not to rely on this never being a legit use case.

This whole code is now dropped, because it is obliterating the effect of early open. Whatever solution we choose, it cannot be done during handle cleanup.

pcmanus · 2025-01-13T14:12:31Z

src/java/org/apache/cassandra/io/util/SequentialWriter.java

@@ -124,6 +125,9 @@ private static FileChannel openChannel(File file)
                    try { channel.close(); }
                    catch (Throwable t2) { t.addSuppressed(t2); }
                }
+
+                // Invalidate any cache entries that may exist for a previous file with the same name.
+                ChunkCache.instance.invalidateFile(file);


This is a general problem for caches, isn't it?

Is it? What bother me here is that SequentialWriter, which a fairly generic writing file utility, has to worry about the chunk cache, which feels like a completely separate abstraction. It just feel really easy to mess up if every code that write files has to worry about the cache is all.

How does NSS deal with it?

NSS assumes that the files for which it stores data are "forever" immutable, and so that you never need to invalidate for correctness (only to reclaim space earlier, which NSS don't even bother with at the moment). That's why the whole "immutable SAI component" part was so important to NSS in fact.

Which actually begs the question: do we really need this for the chunk cache in practice? Or is this more defensive/for specific tests that reuse files? Can't we just make the same assumption in the chunk cache that file names of stuff going to the chunk cache are never reused? Am I missing something here?

I'll also note here too that this code assumes that the static instance of the chunk cache is the only instance, which happens to not be true in CNDB today.

Needed to allow simple scanners to complete correctly

eolivelli · 2025-01-13T16:13:40Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+     */
+    public static void removeFileIdFromCache(File file)
+    {
+        if (instance != null)


we may have multiple instances of the ChunkCache, in CNDB we have the IndexChunkCache
I see that this method is used only in the SequentialWriter class, what about adding some comment or improve the javadocs ?

I intend to move the invalidation to FileHandle.Builder, or move it back to the tidier.

This will take some time, though, I need to write new tests to make sure we are not releasing content we shouldn't.

The invalidation is now moved to the builder, using the supplied chunk cache instance.

Invalidation in the tidier does not work: even after changing the test to behave exactly as normal compaction, I see the cache being reliably dropped when it should not be.

eolivelli · 2025-01-13T16:14:18Z

src/java/org/apache/cassandra/io/util/SequentialWriter.java

@@ -124,6 +125,9 @@ private static FileChannel openChannel(File file)
                    try { channel.close(); }
                    catch (Throwable t2) { t.addSuppressed(t2); }
                }
+
+                // Invalidate any cache entries that may exist for a previous file with the same name.
+                ChunkCache.removeFileIdFromCache(file);


in CNDB we use the ChunkCache also for SAI, should we add some specifc handling ?

cc @pcmanus

Moved to the file handle builder to use the supplied cache instance.

Change early open caching test to exercise the compaction code

eolivelli · 2025-01-14T14:51:06Z

src/java/org/apache/cassandra/io/util/FileHandle.java

+            // Invalidate the cache for any previous version of the file that may differ from the one we are opening.
+            // It is important to do this here (rather than e.g. in complete) to ensure that we don't invalidate when
+            // opening a file multiple times e.g. when opening sstables early during compaction.
+            if (chunkCache != null)


I wonder if this is going to invalidate all of the "ChunkCache priming" that happens on CNDB during Preloading

cc @pcmanus @jasonstack

maybe we need some flag to disable this behavior on the Writers

I don't think that's a problem for priming. The cndb chunk cache priming uses the existing FileHandle of the SSTableReader, it doesn't create a new one, so this method isn't involved.

If anything, what I might be worrying about is a SSTableReader being re-instantiated for some reason (despite the underlying file not having changed). I see no reason to do so tbc, but cndb sstable reloading code is not trivial and I could see some inefficiency like that easily go unnoticed. With that said, I don't think the code currently does this, so this "should" be fine (if not quite error proof maybe).

…alidateCache

…er's invalidateCache" This reverts commit ba1232f.

…r release

…ng SSTableReader release

blambov · 2025-01-14T16:16:38Z

Changed this again. It was too broad to work correctly.

Pushed three more attempts at this:

Explicitly requesting cache invalidation in FileHandle.Builder and changing StorageProvider to do it for write handles. Feels pretty ugly to me, and is also likely insufficient as it doesn't handle streaming.
Doing the cache invalidation in SequentialWriter as before, but defering to StorageProvider.invalidateFileSystemCache, which now calls the chunk cache if it's present. The method should be overridden if there is more than one chunk cache in use. This has the effect of also calling invalidation during SSTableReader global release, which appears like the right place to do release on close as it's specifically designed to not mess with early open.
Doing no invalidation in SequentialWriter, instead relying on the SSTableReader global release to invalidate all sstable components.

The latter is the solution I prefer. Please let me know if any of the other options makes better sense to you.

cassci-bot · 2025-01-14T18:51:11Z

✔️ Build ds-cassandra-pr-gate/PR-1495 approved by Butler

Approved by Butler
See build details here

cassci-bot · 2025-01-14T18:51:12Z

❌ Build ds-cassandra-pr-gate/PR-1495 rejected by Butler

6 new test failure(s) in 9 builds
See build details here

Found 6 new test failures

Test	Explanation	Branch history	Upstream history
o.a.c.c.ChunkCacheTest.testRacingReadersWithError	regression	🔴🔴🔵🔵🔵🔴🔴	🔵🔵🔵🔵🔵🔵🔵
....d.VerifyTest.testVerifyCorruptRowCorrectDigest	regression	🔴🔴🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...tCorruptedSSTablesWithLeveledCompactionStrategy	regression	🔴🔴🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...al,decimal>>,wide=false,scenario=SSTABLE_QUERY]	regression	🔴🔵🔵🔵🔵🔵🔵	🔵🔵🔵🔵🔵🔵🔵
...Test.testFinalOpenRetainsCachedData[format=BIG]	regression	🔴🔴
...Test.testFinalOpenRetainsCachedData[format=BTI]	regression	🔴🔴

Found 369 known test failures

sonarqubecloud · 2025-01-14T19:18:14Z

Quality Gate passed

Issues
16 New issues
0 Accepted issues

Measures
0 Security Hotspots
89.2% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

pcmanus · 2025-01-15T09:02:39Z

src/java/org/apache/cassandra/io/storage/StorageProvider.java

-            StorageProvider.instance.invalidateFileSystemCache(desc.fileFor(Component.DATA));
-            StorageProvider.instance.invalidateFileSystemCache(desc.fileFor(Component.ROW_INDEX));
-            StorageProvider.instance.invalidateFileSystemCache(desc.fileFor(Component.PARTITION_INDEX));
+            for (Component component : SSTable.discoverComponentsFor(desc))


In theory I like that option, but I'll note 2 things:

discoverComponentsFor only include components whose files still exists, but this invalidateFileSystemCache method run after obsoletion.commit() in SSTableReader.GlobalTidy.tidy() and the later deletes the files, so I wonder if this couldn't be a problem in some case (but I'm only so familiar with the whole tidying code so maybe the obsoletion only do something in cases where nothing can be in caches?

discoverComponentsFor also has the misleading behavior of only including "hard-coded" components, meaning no "custom" ones and so none of the SAI files. To include SAI files we'd probably have to call SSTable.readTOC(desc, false), though that implies the TOC is still there so previous point also a question here. I'll note that C* proper never put SAI files into the chunk cache, and that we can override this method in CNDB, so I'm ok if we prefer to stick to hard coded components here and leave the concern for SAI files to CNDB, but figure it was worth mentioning.

Changed the test to catch this problem.

Changed the order of cache invalidation and obsoletion to fix it. Looking at relevant code in CC and CNDB, there doesn't appear to be anything that depends on this order.

Kept the discovery's use of discoverComponentsFor for now, because that is what the obsoletion code does. How do SAI components get deleted?

How do SAI components get deleted?

Honestly, I'm not sure. StorageAttachedIndexGroup.handleNotification has a bunch of code that run when it's notified of the removal of a sstable, but look at it right now, I'm not finding where it actually delete the files in the case of, say, a compacted sstable. And SSTableIndex has a comment that says it's happening in LogTransaction, but I'm not sure how given the tidier...

@jasonstack do you know off the top of your head?

Improve WriteAndReadTest to catch this

blambov added 4 commits January 9, 2025 12:25

Remove unused BufferHolder methods

3968200

DB-2904 port: Use same size buffers in chunk cache

1ad3ac4

Always use off-heap memory for chunk cache.

cfa0842

Use networking buffer pool for compressed reads.

0f84fc2

eolivelli requested review from pcmanus, jasonstack and eolivelli January 9, 2025 14:49

blambov added 9 commits January 10, 2025 12:09

Allow buffer pool to return one-buffer multi-page chunks

31a5fda

Port over some chunk cache tests

6aa0130

Set up for on-heap memory usage test

1a15fa5

Using CachingRebuffererTest.calculateMemoryOverhead with 1.5M entries. Cache size set at 4096 MiB. Bytes on heap per entry: 320

Introduce fileID and invalidate file by dropping id.

c4f14ee

Saves at least 40 bytes per cache entry (12.5%) and 20% of the insertion time. Bytes on heap per entry: 280

Store addresses and attachments to avoid a direct buffer per entry

307aa7f

Bytes on heap per entry: 232

Sleep for jmap

be317b2

Remove pre-computed key hash

1313839

With fileIDs this has no effect on the performance of the cache Bytes on heap per entry: 222

Revert "Sleep for jmap"

36be7e5

This reverts commit be317b2.

Revert "Set up for on-heap memory usage test"

9190348

This reverts commit 1a15fa5.

blambov force-pushed the CNDB-9104 branch from 1207559 to 9190348 Compare January 10, 2025 10:49

eolivelli reviewed Jan 10, 2025

View reviewed changes

src/java/org/apache/cassandra/cache/ChunkCache.java Outdated Show resolved Hide resolved

src/java/org/apache/cassandra/cache/ChunkCache.java Outdated Show resolved Hide resolved

Review changes and license fix

d4e230e

blambov requested a review from michaeljmarshall January 10, 2025 13:05

blambov added 3 commits January 10, 2025 15:30

Drop ChunkReader reference from Key

d5a22c8

Revert unneeded change

fa2551f

Test improvements

89817b4

blambov added 2 commits January 10, 2025 18:29

Use page splitting for large buffers too, to avoid having to store a …

c0c4716

…normally-allocated direct byte buffer.

Fix test.

7ab70b0

pcmanus reviewed Jan 10, 2025

View reviewed changes

blambov added 3 commits January 13, 2025 12:45

Review comments

2539b18

Move code unchanged in ChunkCache.java

c01dbb7

Fix test compilation

5262c59

blambov added 2 commits January 13, 2025 15:49

Change sizeOfFile to accept File

7f0d6ba

Fix and test chunk cache retention after early open

6686f6d

pcmanus reviewed Jan 13, 2025

View reviewed changes

Provide precise end position for early-open sstables

462fb34

Needed to allow simple scanners to complete correctly

eolivelli reviewed Jan 13, 2025

View reviewed changes

blambov added 3 commits January 14, 2025 15:07

Move cache invalidation FileHandle.Builder creation

d755d32

Change early open caching test to exercise the compaction code

Add comment and remove unused method

9fc879e

Test fix

09199b3

eolivelli reviewed Jan 14, 2025

View reviewed changes

blambov added 5 commits January 14, 2025 17:08

Test fix

3f08d7f

Invalidate cache only on request by calling file handle builder's inv…

ba1232f

…alidateCache

Revert "Invalidate cache only on request by calling file handle build…

226bc15

…er's invalidateCache" This reverts commit ba1232f.

Invalidate both on making SequentialWriter and on global SSTableReade…

743e53a

…r release

Remove invalidation in SequentialWriter and rely on invalidation duri…

1bbc990

…ng SSTableReader release

pcmanus reviewed Jan 15, 2025

View reviewed changes

Change order of cache invalidation and obsoletion

dac75ad

Improve WriteAndReadTest to catch this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNDB-9104: Port over chunk cache improvements from DSE #1495

CNDB-9104: Port over chunk cache improvements from DSE #1495

blambov commented Jan 9, 2025 •

edited

Loading

sonarqubecloud bot commented Jan 10, 2025

pcmanus Jan 10, 2025

blambov Jan 13, 2025

pcmanus Jan 10, 2025

blambov Jan 13, 2025

pcmanus Jan 13, 2025

blambov Jan 13, 2025

pcmanus Jan 13, 2025

blambov Jan 14, 2025

pcmanus Jan 14, 2025

blambov Jan 14, 2025 •

edited

Loading

pcmanus Jan 14, 2025

pcmanus Jan 13, 2025

blambov Jan 13, 2025

pcmanus Jan 13, 2025

eolivelli Jan 13, 2025

blambov Jan 14, 2025

blambov Jan 14, 2025

eolivelli Jan 13, 2025

blambov Jan 14, 2025

eolivelli Jan 14, 2025 •

edited

Loading

pcmanus Jan 14, 2025

blambov commented Jan 14, 2025 •

edited

Loading

cassci-bot commented Jan 14, 2025

cassci-bot commented Jan 14, 2025

sonarqubecloud bot commented Jan 14, 2025

pcmanus Jan 15, 2025

blambov Jan 15, 2025

pcmanus Jan 15, 2025

CNDB-9104: Port over chunk cache improvements from DSE #1495

Are you sure you want to change the base?

CNDB-9104: Port over chunk cache improvements from DSE #1495

Conversation

blambov commented Jan 9, 2025 • edited Loading

What is the issue

What does this PR fix and why was it fixed

Checklist before you submit for review

sonarqubecloud bot commented Jan 10, 2025

Quality Gate failed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blambov Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eolivelli Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blambov commented Jan 14, 2025 • edited Loading

cassci-bot commented Jan 14, 2025

✔️ Build ds-cassandra-pr-gate/PR-1495 approved by Butler

cassci-bot commented Jan 14, 2025

❌ Build ds-cassandra-pr-gate/PR-1495 rejected by Butler

Found 6 new test failures

Found 369 known test failures

sonarqubecloud bot commented Jan 14, 2025

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blambov commented Jan 9, 2025 •

edited

Loading

blambov Jan 14, 2025 •

edited

Loading

eolivelli Jan 14, 2025 •

edited

Loading

blambov commented Jan 14, 2025 •

edited

Loading