Quantized GGUF style #1523

Narsil · 2024-01-04T16:16:16Z

Working quantized state for candle.

High level overview:

Introduce QStorage to split Metal and Cpu ops.
Quantize/Dequantize still running on CPU even when asked to run on metal. Kernels exist but led to different quantization/dequantization. GGML removed those kernels on device too (I guess because of the differences).
I think we should keep the surface for on metal quantize/dequantize so we can easily implement them later. They are part of the ggml API imho.
Added a bunch of test, using test_device! in order to get similar testing behavior as regular Tensor.
quantized.metal is a direct copy of ggml's ggml-metal.metal. This choice was made so further dev could be made faster and bugs mentionned after can be imported more easily. All the glue logic is in candle_metal_kernels.
Introduces a new GgmlDType in candle_metal_kernels. Ggml uses different kernels based on size of matmul and hardware capacity. This wasn't implemented here, but could with the current API.

Worthy bug already discovered (not fixed in this PR since they do not belong here):

Q2K Metal -> Bugged (also present in GGML).
Q4K CPU -> Bugged (present reviously, new test catch it).
Q5K CPU -> Bugged (present previously).
Q8_1 Both -> Never really implemented it seems
Q8K metal -> Never implemented in metal

LaurentMazare · 2024-01-06T11:05:54Z

Thanks for making this PR. It's really large and very hard to review and it seems to contain a bunch of orthogonal things.
Could you start by making a minimal PR that adds a device to the quantized var builder and doesn't do much besides this? This sounds like the natural first step and should be a reasonably small thing to look at before making the rest of the changes.

FL33TW00D · 2024-01-15T15:13:05Z

candle-core/src/tensor.rs

@@ -426,7 +426,9 @@ impl Tensor {
        if buffer_size != shape.elem_count() {
            return Err(Error::ShapeMismatch { buffer_size, shape }.bt());
        }
+        // println!("from vec {buffer_size}");


Probably want these gone

FL33TW00D · 2024-01-15T15:15:00Z

candle-examples/examples/custom-ops/cuda_kernels.rs

@@ -1 +0,0 @@
-pub const LAYERNORM_KERNELS: &str = include_str!(concat!(env!("OUT_DIR"), "/layernorm_kernels.ptx"));


Hmm building without cuda will do that it seems.

FL33TW00D · 2024-01-15T15:15:21Z

candle-examples/examples/phi/main.rs

+            &filenames[0],
+            &device,
+        )?;
+        println!("Loaded vb");


FL33TW00D · 2024-01-15T15:15:54Z

candle-examples/examples/quantized/main.rs

            let mut total_size_in_bytes = 0;
            for (_, tensor) in model.tensors.iter() {
                let elem_count = tensor.shape().elem_count();
                total_size_in_bytes +=
-                    elem_count * tensor.dtype().type_size() / tensor.dtype().blck_size();
+                    elem_count * tensor.dtype().type_size() / tensor.dtype().block_size();


Could make this a method?

It is a method, I don't get what you are implying here ?

In this file we do elem_count * tensor.dtype().type_size() / tensor.dtype().block_size();
Could be a method: tensor.size_in_bytes

Maybe, but let's no do this in this PR though I think, no ?

FL33TW00D · 2024-01-15T15:16:20Z

candle-metal-kernels/src/cast.metal

@@ -86,7 +86,6 @@ CAST(cast_i64_f32, cast_i64_f32_strided, int64_t, float)
 #endif

 #if defined(__HAVE_BFLOAT__)
-#if __METAL_VERSION__ >= 310


I think we still need this

No because of __HAVE_BFLOAT__ above.

FL33TW00D · 2024-01-15T15:16:25Z

candle-kernels/src/lib.rs

-pub const INDEXING: &str = include_str!(concat!(env!("OUT_DIR"), "/indexing.ptx"));
-pub const REDUCE: &str = include_str!(concat!(env!("OUT_DIR"), "/reduce.ptx"));
-pub const TERNARY: &str = include_str!(concat!(env!("OUT_DIR"), "/ternary.ptx"));
-pub const UNARY: &str = include_str!(concat!(env!("OUT_DIR"), "/unary.ptx"));


Oops nice catch I must have messed up the build.

FL33TW00D · 2024-01-15T15:17:06Z

candle-metal-kernels/src/lib.rs

+
+#[derive(Debug, Clone, Copy)]
+pub enum GgmlDType {
+    Q4_0,


https://github.com/rustformers/llm use an FFI here to keep in line, something to consider for the future.

Why would we need FFI ? This is purely for internal use to call the correct kernel with the correct warps. (Each kernel is written with specific warps in mind)

FL33TW00D · 2024-01-15T15:18:42Z

candle-metal-kernels/src/lib.rs

+            r3
+        )
+    );
+    encoder.set_threadgroup_memory_length(0, 8192);


8192 is fine, could introduce a version check for this in future: https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf

Yes, this is copy-pasted from ggml for now. And the kernels expect 8192.

FL33TW00D · 2024-01-15T15:20:17Z

candle-metal-kernels/src/quantized.metal

+    }
+}
+
+kernel void kernel_norm(


We can go faster ;)

https://fleetwood.dev/posts/layernorm-as-fast-as-possible

This is not used :) If you're up for it, we could maybe discuss implementing something like this: https://github.com/huggingface/candle-rotary (But changing the implementation around so we could add metal backend in it).

LaurentMazare · 2024-01-15T15:29:13Z

Ping on the refactoring into a first smaller PR as mentioned above. #1534 was actually doing something like this but not sure why it was closed.

Narsil · 2024-01-15T16:27:44Z

@LaurentMazare have you looked at the PR ? Aside from the gigantic file (copy pasted from GGML without any modification) the PR is actually quite small.

The only real logic is in candle_metal_kernels::call_ggml_gemm and it's also relatively small (it's just small logic to adapt the warps for the kernels as they are written in GGML file)

LaurentMazare · 2024-01-15T16:31:23Z

I have looked at it indeed (and a bit in details), hence my suggestion to split the new kernels aside from a simple PR that would just add the device as a parameter for the quantized var store, it would make it a lot easier to read. Also there seems to be some other unrelated changes that I would prefer to see in isolation.

- Add a device param, wherever needed. - Create new QMetal storage thing that implements QuantizedType. - Update everywhere needed. Fix Python. Fixing examples. Fix: fmt + clippy + stub. Moving everything around. Only missing the actual implems. Fixing everything + adding dequantized kernels. More work. Fixing matmul. Fmt + Clippy Some clippy fixes. Working state. Q2K Metal -> Bugged (also present in GGML). Q4K CPU -> Bugged (present previously, new test catch it). Q5K CPU -> Bugged (present previously). Q8_1 Both -> Never really implemented it seems Q8K metal -> Never implemented in metal Fixing Q2K bug (present in ggml).

Narsil · 2024-01-15T17:05:21Z

Here it is https://github.com/huggingface/candle/pull/1594/files

I closed the previous "small" PR because I had to switch a lot of things around while actually implementing. Mostly because GgmlType (the blocks) made too many assumptions of CPU execution, so I had to move to QStorage to put the enum before the block logic, so I could keep the CPU code mostly as-is while enabling the GPU code to work.

That also means moving away for the generics calls ( I benched on my computer, it doesn't seem to make such a difference.)

Narsil · 2024-01-17T09:27:54Z

Okay I'm going to go ahead and merge this.

It has been reviewed by @FL33TW00D. @LaurentMazare all your comments on this PR and the smaller one are just about splitting PRs, not really some conceptual problem with it in any way, therefore I'll assume everything is actually mostly ok, and we can just fix whatever is wrong in other PRs if needed.

LaurentMazare · 2024-01-17T09:50:17Z

Please don't merge things like this without adressing my review comments first.

Narsil changed the title ~~[WIP] Quantized GGUF style (mostly working, but still bogus output on real scenarios).~~ Quantized GGUF style (mostly working, but still bogus output on real scenarios). Jan 5, 2024

Narsil mentioned this pull request Jan 5, 2024

Metal quantized modifications proposal. #1341

Closed

Narsil requested a review from LaurentMazare January 5, 2024 13:04

Narsil force-pushed the metal6 branch from 351744c to 6f85840 Compare January 5, 2024 13:29

Narsil changed the title ~~Quantized GGUF style (mostly working, but still bogus output on real scenarios).~~ Quantized GGUF style Jan 7, 2024

Narsil force-pushed the metal6 branch 2 times, most recently from d68e82e to b89c02e Compare January 11, 2024 16:13

FL33TW00D reviewed Jan 15, 2024

View reviewed changes

candle-examples/examples/phi/main.rs Outdated

&filenames[0],

&device,

)?;

println!("Loaded vb");

Copy link

Contributor

FL33TW00D Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed

FL33TW00D reviewed Jan 15, 2024

View reviewed changes

Nicolas Patry and others added 8 commits January 15, 2024 17:42

Cleanup.

2cd1e59

Fix the rebase.

61ad8d9

Removing the fences speeds everything up and *is* correct this time...

c8c603c

Cleanup the fence.

3aefc70

After rebase.

9ef0403

Bad code removal.

b2db5ad

Rebase after phi2 merge + fix replit default to CPU.

3dbf65e

Narsil force-pushed the metal6 branch from 83d3188 to 3dbf65e Compare January 15, 2024 16:53

Narsil added 2 commits January 15, 2024 18:30

Making the CI happy.

f1b624c

More happy tests.

1c0404a

grzuy mentioned this pull request Jan 15, 2024

feat: support Metal device mimiquate/candlex#36

Draft

9 tasks

Narsil mentioned this pull request Jan 16, 2024

Metal Backend not properly loading large models at 16GB of RAM #1568

Open

Narsil merged commit 403680f into main Jan 17, 2024
23 checks passed

Narsil deleted the metal6 branch January 17, 2024 09:27

Narsil mentioned this pull request Jan 17, 2024

Quantized metal. #1594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized GGUF style #1523

Quantized GGUF style #1523

Narsil commented Jan 4, 2024 •

edited

Loading

LaurentMazare commented Jan 6, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

FL33TW00D Jan 15, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

FL33TW00D Jan 15, 2024

Narsil Jan 15, 2024

LaurentMazare commented Jan 15, 2024

Narsil commented Jan 15, 2024

LaurentMazare commented Jan 15, 2024

Narsil commented Jan 15, 2024

Narsil commented Jan 17, 2024

LaurentMazare commented Jan 17, 2024

		@@ -1 +0,0 @@
		pub const LAYERNORM_KERNELS: &str = include_str!(concat!(env!("OUT_DIR"), "/layernorm_kernels.ptx"));

Quantized GGUF style #1523

Quantized GGUF style #1523

Conversation

Narsil commented Jan 4, 2024 • edited Loading

LaurentMazare commented Jan 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LaurentMazare commented Jan 15, 2024

Narsil commented Jan 15, 2024

LaurentMazare commented Jan 15, 2024

Narsil commented Jan 15, 2024

Narsil commented Jan 17, 2024

LaurentMazare commented Jan 17, 2024

Narsil commented Jan 4, 2024 •

edited

Loading