-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Performance of reduce(hcat, ::ColVecs)
and reduce(vcat, ::ColVecs)
by specialization
#510
base: master
Are you sure you want to change the base?
Improve Performance of reduce(hcat, ::ColVecs)
and reduce(vcat, ::ColVecs)
by specialization
#510
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master JuliaLang/julia#510 +/- ##
==========================================
- Coverage 94.16% 90.45% -3.71%
==========================================
Files 52 52
Lines 1387 1393 +6
==========================================
- Hits 1306 1260 -46
- Misses 81 133 +52
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please bump the patch version, then we can merge when CI passes
I have no idea why the tests would fail - something about constant allocation heuristics in Zygote? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think similar definitions should be added in this PR for RowVecs
for consistency.
Is it clear whether they're happening in areas where the modifications introduced here cause problems, or do they appear completely spurious? |
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
they definitely appear completely spurious. The area seems completely unrelated. But I also don't understand the code in that area to be honest |
If you A = [ a11 ... a1n;
...
am1 ... amn]
RowVecs(A) == [ [a11 ... a1n], ... [am1, ... amn] ] So reduce(hcat, RowVecs(A)) == [
a11 ... am1
...
a1n ... 1mn
] # == A' But if you have to transpose That is why I did not add it (because I don't know if that would actually result in a performance benefit. But if you believe this is worth the additional lines of code, I can add these definitions for RowVecs :) |
It's defined here: https://github.com/JuliaLang/julia/blob/310f59019856749fb85bc56a1e3c2e0592a134ad/base/abstractarray.jl#L1638-L1701
Many stats packages prefer this format though, I assume at least partly due to consistency with R, Python, and dataframes/datasets in general. Often it also doesn't matter what input format users use - internally you can always use the most performant algorithm. |
Just from looking at the failing tests and logs, I would assume that the test failures are real: They compare the allocations in the pullbacks of transforms that IMO are likely to involve This could be another argument for optimizing |
How do you guys do benchmarks locally? If I want to use |
I often use temporary environments. Changing Project.toml and/or Manifest.toml for benchmarking is not a good idea IMO. |
julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector}, hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 10.375 μs … 10.470 ms ┊ GC (min … max): 0.00% … 99.78%
Time (median): 18.125 μs ┊ GC (median): 0.00%
Time (mean ± σ): 29.337 μs ± 269.440 μs ┊ GC (mean ± σ): 33.74% ± 3.89%
▁ ▂▄▅▄▆▆▇█▆▆▄▂▁
▃▆███▇▆▆▇████████████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▂▂ ▄
10.4 μs Histogram: frequency by time 38.9 μs <
Memory estimate: 86.12 KiB, allocs estimate: 198.
julia> @benchmark invoke(reduce, Tuple{typeof(hcat), ColVecs}, hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 2.208 ns … 23.917 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.375 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.388 ns ± 0.593 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▂▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁█▅▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
2.21 ns Histogram: frequency by time 2.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector}, hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
Range (min … max): 7.073 μs … 1.815 ms ┊ GC (min … max): 0.00% … 95.50%
Time (median): 11.677 μs ┊ GC (median): 0.00%
Time (mean ± σ): 16.796 μs ± 73.342 μs ┊ GC (mean ± σ): 27.95% ± 6.60%
▁ ▄▄▁▂▅█▅▂
▆██▅▄▄██████████▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
7.07 μs Histogram: frequency by time 31.9 μs <
Memory estimate: 57.86 KiB, allocs estimate: 128.
julia> @benchmark invoke(reduce, Tuple{typeof(hcat), RowVecs}, hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 247 evaluations.
Range (min … max): 302.126 ns … 121.268 μs ┊ GC (min … max): 0.00% … 99.45%
Time (median): 679.486 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.117 μs ± 2.958 μs ┊ GC (mean ± σ): 34.72% ± 14.37%
█
▅█▆▂▂▂▂▂▂▁▂▂▂▂▂▂▁▁▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▁▂▂▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
302 ns Histogram: frequency by time 16.1 μs <
Memory estimate: 4.81 KiB, allocs estimate: 1. Interestingly it does help to implement the RowVersion, but it is still slower than the col version which just does nothing |
Well, that's exactly what I expected, so I'm not surprised. (BTW I would directly benchmark the functions of interest to avoid measuring |
To measure the performance of the fallback I would need to use |
You could just run the same benchmark on the master branch and on the PR. |
That's also safer in so far as you don't have to reason about which methods are called - maybe there's some specialization for |
So if I understand correctly here the number of allocations are counted: kernelmatrix (binary): Test Failed at /home/runner/work/KernelFunctions.jl/KernelFunctions.jl/test/test_utils.jl:387
Expression: pb[1] == pb[2]
Evaluated: 504 == 503 and since |
new julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 2.125 ns … 81.958 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.292 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.343 ns ± 0.997 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ █ ▁ ▁
▂▁▁▁▁▁▂▁▁▁▁▁▃▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▂ ▂
2.12 ns Histogram: frequency by time 2.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 932 evaluations.
Range (min … max): 125.850 ns … 5.302 μs ┊ GC (min … max): 0.00% … 96.04%
Time (median): 142.212 ns ┊ GC (median): 0.00%
Time (mean ± σ): 201.706 ns ± 243.362 ns ┊ GC (mean ± σ): 19.27% ± 14.85%
█▅▃▁▄▄▃ ▁ ▁
██████████▆▅▅▅▃▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆▇████ █
126 ns Histogram: log(frequency) by time 1.41 μs <
Memory estimate: 1.77 KiB, allocs estimate: 1. vs old julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 258 evaluations.
Range (min … max): 296.349 ns … 5.797 μs ┊ GC (min … max): 0.00% … 92.52%
Time (median): 315.244 ns ┊ GC (median): 0.00%
Time (mean ± σ): 333.317 ns ± 142.009 ns ┊ GC (mean ± σ): 3.90% ± 7.70%
█▇▅▁ ▁
████▇██▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃ █
296 ns Histogram: log(frequency) by time 1.5 μs <
Memory estimate: 1.77 KiB, allocs estimate: 1.
julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 343 evaluations.
Range (min … max): 259.233 ns … 5.354 μs ┊ GC (min … max): 0.00% … 92.93%
Time (median): 276.239 ns ┊ GC (median): 0.00%
Time (mean ± σ): 297.589 ns ± 161.529 ns ┊ GC (mean ± σ): 5.91% ± 9.18%
█▅▁ ▁
███▇▆▅▇█▆▅▃▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▆▆ █
259 ns Histogram: log(frequency) by time 1.58 μs <
Memory estimate: 1.77 KiB, allocs estimate: 1. |
Summary
The idea is that
should in principle be a no-op.
Proposed changes
Specialization of
reduce
ontypeof(hcat)
andtypeof(vcat)
.The drawback is more code. But I added test coverage and it was something very easy to do 🤷 . Whether you want this or not is up to you.