I would'nt touch nvidia "goodies" with a barge pole. Perfectly good 5-year GPU deprecated upon upgrade to ubuntu 22.04 because of cuda/drivers. The lock-in trap of the century.
The real issue is having to keep multiple copies of CUDA on a system to satisfy all the different moving parts that each want different CUDA versions and CUDNN versions, and those different CUDA installers fight over and uninstall each others' nvidia driver versions because CUDA is shipped with the nvidia driver, bundled and listed as an apt dependency.
My system changes from nvidia-525 to nvidia-535 to nvidia-520 to nvidia-515 on a daily basis because I need to reinstall a different CUDA version just to try some new paper's code.
PyTorch did it right, it now ships with its own CUDA and doesn't take a shit about version what you have in /usr/local. Everything else should do the same.
The solution is to install the latest CUDA driver (which is always 100% backward compatible) but have multiple version of the libraries in /usr/local. Granted it can get messy with various package manager doing their own things.
Usually the CUDA versions end up uninstalling other CUDA versions and drivers due to package manager issues.
I wish nvidia would just release a `sudo apt install cuda-all` that just keeps you updated with ALL possible cuda versions simultaneously. I know it would be 20GB but that's fine, it's a drop in the bucket compared to the datasets I play with.
But containers, X forwarding, visualization, and OpenGL is a complete shitshow. I have apps that work outside of a container but freeze the UI that the mouse pointer won't even work when run from within a container.
> now ships with its own CUDA and doesn't take a shit about version
Bundling system dependency in python software is always a giant shit show. I would not name that "Everything should do the same"
Any other python library (randomly cuPy) that also ship its own Cuda, with its own different version, will segfault happily because you have duplicated symbols in a single process.
What you just quote is more or less what the GNU glibc is doing with symbol versioning which is an other example of giant shit show and an ABI nightmare.
And:
(1) It is a solution only Nvidia can provide: Cuda is a binary. And they will certainly not do that just to please the python community.
(2) That just create an other set of problems just because python packaging sucks in the first place.
Hm no, what you actually have is cuda-libraries-11-8 depends on cuda-drivers and cuda-drivers depends on cuda-drivers-520 = 520.61.05-1 where you have the version number repeated in the name. That is a really bad pattern for things intended to be backward compatible, as multiple installs of cuda will overwrite "cuda-drivers" and each is pegged to a certain "cuda-drivers-X" where X is part of the name. This results in some really weird dependency hells if you try to install more than 1 CUDA version at the same time depending on the order you install them.
Will you publish benchmarks for e.g. K80? Or provide a way for users to contribute? It's really handy to know, e.g. comparable to what is Resnet50 inference on a bunch of architectures.
well the root README.md has some "benchmarks" at the very top. maybe there's no existing benchmarks doc with more details? or even a repo of what's in the README? it's like if numpy is 5 sec on CPU, but a K80 is only 100ms, then the K80 cost might be worth not going for the A100 at 3ms. Similar argument for jetson.
We have benchmarks in the benchmarks directory, but these are for things like convolution, matrix multiples, etc. It's not for running a traditional benchmark set like resnet.
Like most benchmarks it really depends on what you want to do, and since it's a general library everyone might care about different things.
I saw those, yes having the code is great but I’m more interested in the actual numbers. E.g. what does this sample do on A100, P100, T4, K80, Jetson nano? The analog is things like Resnet50 get tested and reported and then you know if Resnet50 might work for your budget / hardware.
Versus the advert in the root Readme, which is impressive but gives no data on the pareto.
Are you talking about for desktop/GUI use? A lot of older Nvidia hardware has been implicitly depreciated for years. Without newer patches to help Wayland support, your options are either to use nouveau or stick with unsupported software and official drivers. It's ugly, but Microsoft and Apple pretty much depreciate hardware the same way; first with 'recommended' cutoff points, then hard depreciations. On Linux it's less spelled-out, but things seem to be Wayland-or-bust now.
*deprecate or “EOL” (for continuing support). Depreciate is a tax and accounting concept, though often at the end of a MACRS depreciation schedule office equipment is often immediately replaced since it’s no longer contributing to tax write-offs. (Really, just not replaced as long as it continues to contribute to tax deductions)
Surprisingly this repo is BSD licensed so it might even outgrow nvidia. Eigen is pretty strong but this new MatX syntax might catch on, especially with easy GPU integration.
Are they even comparing apples to apples to claim that they see these improvements over NumPy?
> While the code complexity and length are roughly the same, the MatX version shows a 2100x over the Numpy version, and over 4x faster than the CuPy version on the same GPU.
NumPy doesn't use GPU by default unless you use something like Jax [1] to compile NumPy code to run on GPUs. I think more honest comparison will mainly compare MatX running on same CPU like NumPy and for GPU comparison focus to compare vs CuPy.
This seems like an FFT based benchmark.... I believe NumPy for licensing reasons includes a port of old FFTPACK which is only single threaded. (Py)FFTW or others are surely much faster.
So the test seems to compare one thread of a CPU launched in 2016 running a port of 40year old fortran code to a optimised FFT library on a GPU launched in 2020.
The sample is really designed to show the simplicity of the syntax, and the performance is just a side effect. Where you'll see a bigger performance difference with numPy/cuPy is when kernel fusion happens where MatX is typically able to fuse many things into a single kernel at compile-time and cuPy launches many kernels. If you have a specific type of expression you'd like to compare please let us know.
Not really, for GPU computation, PyTorch, TensorFlow or JAX are the baselines (they are not just for ML stuff). Even for CPU, that's a more relevant comparison.
Hi, I addressed the comparisons in other comments, but in general this is for c++ users and not Python. It's more of a comparison to numPy/cuPy, and we do have a table showing the comparison in the docs.
Actually just started looking into MatX yesterday to accelerate our radar pipeline. Really interesting to see that this use-case is heavily featured in the documentation.
Is the UCLA/Nvidia/Raytheon collaboration (as presented in a recent GTC talk) a major force behind the development of MatX?
Hi, yes, the original development was started for radar users who did not know CUDA but needed to write in c++. Many of our examples and code are radar related for that reason.
Thanks, looks really interesting. Do functions like matmul support inputs of differing type, like say A=int8 and B=float? Would be nice if you could get memory efficient quantized matmul with operator fusion.
CUTLASS, which is NVIDIA’s C++ template library for writing matrix multiply and convolution kernels parametrized over input/output types, operators, and algorithm block sizes, theoretically supports this. But, each input of the (k, n) shaped matrix B will be read from global memory ceil(n / block dimension) times in an algorithm that computes one (block dimension, block dimension) submatrix of the output matrix D per thread block. It will probably be more efficient to cast your B matrix to FP16 or INT8 lower precision in a preprocessing kernel to reduce memory traffic in the matrix multiply kernel.
On newer GPUs, though, we have this huge L2 cache which makes the calculus a little different if your working set fits into it. e.g. Ampere A100 has 40MB L2$.
We typically support whatever the underlying library supports. For int8 the support would come from cuBLASLt currently. I don't believe that or Cutlass supports mixed precision inputs, but I can check.
Currently we support both CUDA and CPU to some extent. CPU is done through standard C++ (and soon stdpar). Obviously standard C++ is problematic since it doesn't include everything we support (FFTs, matrix multiplies, etc). One option is to use open-source libraries that do these, but then it ends up being a lot of dependencies that are hard to manage. We have plans to improve CPU support soon, so stay tuned.
I don't actually know a lot about massively parallel libraries like CUDA. Does AMD have an equivalent technology associate with their GPUs? It feels like it should be fairly straightforward to create some kind of high level library that just uses CUDA or whatever AMD has on the back end.
GPU performance per dollar is only competitive for specific workloads. For extremely large scale compute, getting enough data center GPUs can also be challenging.
This looks great! I'm definitely going to try this, I've been hoping for something like this for a long time. Eigen is great of course but GPU acceleration is a bit clunky to use, this looks much closer to the code written by the researchers.
It's nice to have easier to use C++ GPU numerical computing libraries, but it seems like there's a good bit of work to go in terms of feature set. Taking CuPy as an example, they've nearly entirely mirrored the enormous featureset of both NumPy and SciPy.
Although I'm not expecting on that scale here, but what's the planned future work here? E.g. matrix free operators and operations, sparse matrix support, or symmetry aware eigenvalue / singular value decompositions? There are official cuda libraries that support these (e.g. cuSPARSE).
Otherwise this looks good! Documentation is decent too.
Hi, having feature parity with cuPy is a daunting task, especially for a C++ library. At this point we feel we have a good foundation for all kinds of basic and advanced tensor manipulations, and have a growing number of library-based functions on top of that. The low-hanging fruit was wrapping the CUDA math libraries, so that has the most progress.
Since MatX was originally intended for streaming/real-time processing the focus has been on making C++ for CUDA easier to use for those kinds of applications.
There are also a lot of things in cuPy and sciPy that don't make a lot of sense to do on the GPU, like offline tasks such as filter design in signal processing.
Also, since C++ users typically are writing in that language for maximum performance, we've put a large focus on making sure we are as close to writing optimized CUDA as possible. In general, most workloads we've tested are about 3-4x faster than the cuPy counterpart due to better fusion and language overhead.
We have discussed supporting cusparse as well, but cusparse typically requires a different input type like a CSR matrix. This is not something you'd typically want to detect/convert, so we're still discussing ways of getting this integrated cleanly.
We do have several versions of SVD, including one that calls cuSolver.
If I'm reading the examples correctly, this is calling parameterized kernels in sequence, with nice-to-write matrix formats and memory movement? You're still ping-ponging from global (on device) to l2/shared memory each time? Cool approach for quick prototyping and replacing/porting some Eigen-stuff.
For those of us in-kernel day-laborers I can't wait for cuBLASDx and cuSolverDx releases, and for my brain to wrap around cutlass and cute. Please don't forget us!
Hi, what matrix sizes and types are you working with and how many batches? In general it sounds be similar to eigen, but with GPU support. We have several svd methods for different scenarios, so if you give us the info above we can let you know the performance.
Hi, besides an (subjectively) easier syntax, the performance should be higher compared to libtorch. Every operator expression (think of it as an arithmetic expression) is evaluated at compile-time and is often fused into a single GPU kernel. This also removes the need for JITing. If there's a specific workflow you're curious about comparing libtorch vs MatX please let us know and we can try it out.
Why can’t you keep the calling interfaces of functions as close to the Py libraries as possible, simplifying the transition for everyone? Will that really destroy the performance increase? Common calling interfaces make everything much simpler. Even in this simple example, the calls differ significantly.
We've tried our best to match python as well as we can, or falling back to matlab-style if Python doesn't have it. Many of our unit tests are verified against python, so the conversion is typically very easy. The one thing that python has that makes this much easier is keyword arguments. We've tried to use overloads to mimic this as best as we could.
That being said, in the example on the home page there are notable differences:
1) we have the run() method. The reason is that the expression before the run is lazily evaluated for performance and does not execute anything. Having the run() method allows you to run the same line of code on either a CPU or GPU by changing the argument to run()
2) in MatX memory allocation is explicit. Python does it as-needed, but this causes a performance penalty with allocations and deallocations that are not under your control. Specifically in the FFT example, numPy will allocate an ndarray prior to calling it, but on the same line. In MatX the allocation is (typically) done before the operation so you can control the performance of the hot path of code.
If you have any specific suggestions, we would love to hear it
The main difference is Jax is for python primarily, while MatX is c++. This might seem like a poor answer, but in many domains (quasi-real time, signal processing, etc) the language is important to give certainty guarantees on performance.
MatX has been used in several projects with hundreds of microsecond deadlines, which is not usually something you'd choose Python for out of the box.
We are instead targeting users who already have python or high-level code that they need to port to c++ for whatever reason, and want to do it in the easiest way possible. With c++17 we're able to provide a simple syntax without compromising on performance compared to most native code.
I have to admit I'm only tangentially familiar with gnuradio, but matx should be able to integrate with any C++17 codebase. We have several examples of integration with our streaming sensor pipeline called holoscan.
Good point, and agreed the landing page is a bit sensational. I mentioned it elsewhere but between MatX and cuPy we see a 3-4x performance difference on average. The gap tends to widen with more complex workflows where compile-time kernel fusion gives more improvements compared to something like a single GEMM.
The main difference is the GPU part. This is a large difference because the same lazy evaluated template type can be run on the CPU or GPU through what we call an executor. On the CPU it's likely very similar to how xtensor is already using expression trees, but on the GPU it's quite a bit different because if the libraries backing it and optimized kernels.
The syntax of MatX also allows us to do more kernel fusion in the future to improve performance without any changes.