hckrnws
I recall some optimization advice to choose a leading dimension that is NOT a power of two, in order to avoid cache set associativity conflicts. Guess it's highly hardware dependent (in particular, that advice was for cpu's not GPU's).
Many Intel CPUs have caches that have a number of ways that is not a power of two, but it contains a 3 or 5 factor, corresponding to cache sizes like 2.5 MB, 3 MB, 24 MB or 36 MB.
Regardless of the cache associativity, the leading dimension must always correspond to a size that is a multiple of the cache line size (64 bytes for most CPUs) and the matrix must be aligned to the cache line size.
With that condition fulfilled, the total size of a row or of a column (depending on whether the matrix uses C order or Fortran order) may be optimal as a power of two or not, which should be better tested experimentally on the target CPUs or GPUs.
DRAM page size surely ? All those cheap CASs’ ?
Unlike the cache line size that is known, you cannot predict how the DRAM address bits are mapped to CPU or GPU addresses and which is the page size of the DIMMs that happen to be used.
Matching the DRAM characteristics influences much less the performance of a linear algebra algorithm than matching the cache line size and the sizes of the L1 and L2 cache memories.
If you really want to tune an algorithm for the maximum performance that can be achieved on a particular computer, you must use an automatic tuning method, which runs benchmarks of that algorithm varying all matrix layout parameters until finding optimum values. An example of how this is done is provided by the ATLAS library that implements the BLAS API.
Such a library tuned for a given computer should be then used only for that computer.
Yes.. And no, you can inspect and measure the SDRAM component, at runtime, to best determine how object sizes will be allocated.. Is kind of what I was getting at, and have spent the last month implementing ;)
That only makes sense if you know for sure your application is running on a specific architecture. Otherwise, it's a highly specific optimization that is bound to violate another architecture's design, also would be extremely challenging to debug.
Great post! I appreciate the diagrams.
TL;DR: make sure your matrix dimensions are divisible by 2 often.
Well, that'll help with a lot :) But dealing with wave quantization requires dimensions that aren't neceessarily a multiple of 2, and often are a multiple of the number of SMs on a GPU (i.e. 132 on an H100)
I have always done that instinctively, even when there is no formal demand by the system. Every time I have implemented matmuls, at any level, with any optimization requirement, partitioning into dyadic blocks had always sped things up. So I try to feed the system the nicest numbers I can muster!
D&C approaches are applicable to lots of problems and, as you’ve noticed, tend to do well with “round” (in binary) numbers.
And if you can't make sure all of them are divisible by 2 often, at least pick the inner dimensions of your matmuls (if possible).
Crafted by Rajat
Source Code