MatX: Efficient C++17 GPU numerical computing library with Python-like syntax

nologic01 · on Oct 3, 2023

I would'nt touch nvidia "goodies" with a barge pole. Perfectly good 5-year GPU deprecated upon upgrade to ubuntu 22.04 because of cuda/drivers. The lock-in trap of the century.

Foobar8568 · on Oct 3, 2023

>Please use CUDA 11.4

Which should be compatible with any Kepler late architecture, as in 6xx models from 10years ago+?

dheera · on Oct 4, 2023

The real issue is having to keep multiple copies of CUDA on a system to satisfy all the different moving parts that each want different CUDA versions and CUDNN versions, and those different CUDA installers fight over and uninstall each others' nvidia driver versions because CUDA is shipped with the nvidia driver, bundled and listed as an apt dependency.

My system changes from nvidia-525 to nvidia-535 to nvidia-520 to nvidia-515 on a daily basis because I need to reinstall a different CUDA version just to try some new paper's code.

PyTorch did it right, it now ships with its own CUDA and doesn't take a shit about version what you have in /usr/local. Everything else should do the same.

the_svd_doctor · on Oct 4, 2023

The solution is to install the latest CUDA driver (which is always 100% backward compatible) but have multiple version of the libraries in /usr/local. Granted it can get messy with various package manager doing their own things.

dheera · on Oct 4, 2023

Usually the CUDA versions end up uninstalling other CUDA versions and drivers due to package manager issues.

I wish nvidia would just release a `sudo apt install cuda-all` that just keeps you updated with ALL possible cuda versions simultaneously. I know it would be 20GB but that's fine, it's a drop in the bucket compared to the datasets I play with.

cburdick13 · on Oct 4, 2023

If you use the run files rather than the package manager files it's always installed in completely separate folders inside /use/local/cuda.

What you're describing with an "all" install can somewhat be accomplished with containers right now and none of the dependency problems.

dheera · on Oct 5, 2023

Problem with runfiles is I've run into other .deb packages that I'd like to use that are marked as dependent on 'cuda'.

Using a runfile is ditching the package manager on a package-managed system, instead of using the package manager correctly.

dheera · on Oct 4, 2023

But containers, X forwarding, visualization, and OpenGL is a complete shitshow. I have apps that work outside of a container but freeze the UI that the mouse pointer won't even work when run from within a container.

touisteur · on Oct 5, 2023

+1 for runfiles but beware of the multi-gigabyte installer + installed footprint. Not complaining per-se, just plan for it.

adev_ · on Oct 4, 2023

> now ships with its own CUDA and doesn't take a shit about version

Bundling system dependency in python software is always a giant shit show. I would not name that "Everything should do the same"

Any other python library (randomly cuPy) that also ship its own Cuda, with its own different version, will segfault happily because you have duplicated symbols in a single process.

dheera · on Oct 4, 2023

So randomize the symbols with appeneded UUIDs

adev_ · on Oct 5, 2023

What you just quote is more or less what the GNU glibc is doing with symbol versioning which is an other example of giant shit show and an ABI nightmare.

And:

(1) It is a solution only Nvidia can provide: Cuda is a binary. And they will certainly not do that just to please the python community.

(2) That just create an other set of problems just because python packaging sucks in the first place.

Lets just solve the initial problem, shall we ?

adev_ · on Oct 4, 2023

> PyTorch did it right, it now ships with its own CUDA and doesn't take a shit about version what you have in /usr/local

Is that even legal ?

Up to my understanding Cuda is covered by an EULA license that explicitly require end user agreement.

dheera · on Oct 4, 2023

"Meh" who cares. It works.

kristjansson · on Oct 4, 2023

Don’t install the drivers with CUDA? Don’t rely on distro-packaged CUDA?

dheera · on Oct 4, 2023

NVIDIA's official CUDA is packaged with their damn driver.

cburdick13 · on Oct 4, 2023

Hi, you can choose not to install the driver with a CUDA install, and to download the driver separately.

dheera · on Oct 4, 2023

Only if you use the runfile, but that breaks yet other packages that specify 'cuda' as an apt dependency.

Your 'cuda' packages should use '>=' not '=='

e.g. 'cuda-12' should depend on nvidia>=525 NOT nvidia==525

cburdick13 · on Oct 4, 2023

That's a fair point. I'll look into it.

cburdick13 · on Oct 4, 2023

Hi, I looked into this and it seems that it is indeed using >=

Depends: cuda-libraries-11-8 (>= 11.8.0), cuda-drivers (>= 520.61.05)

dheera · on Oct 4, 2023

Hm no, what you actually have is cuda-libraries-11-8 depends on cuda-drivers and cuda-drivers depends on cuda-drivers-520 = 520.61.05-1 where you have the version number repeated in the name. That is a really bad pattern for things intended to be backward compatible, as multiple installs of cuda will overwrite "cuda-drivers" and each is pegged to a certain "cuda-drivers-X" where X is part of the name. This results in some really weird dependency hells if you try to install more than 1 CUDA version at the same time depending on the order you install them.

    $ sudo apt show cuda-drivers
    Package: cuda-drivers
    Version: 520.61.05-1
    Priority: optional
    Section: multiverse/devel
    Maintainer: cudatools <[email protected]>
    Installed-Size: 7,168 B
    Depends: cuda-drivers-520 (= 520.61.05-1)

cburdick13 · on Oct 4, 2023

That's right, we've tested down to pascal, but this should work on Kepler too since CUDA and the underlying libraries support it.

choppaface · on Oct 4, 2023

Will you publish benchmarks for e.g. K80? Or provide a way for users to contribute? It's really handy to know, e.g. comparable to what is Resnet50 inference on a bunch of architectures.

cburdick13 · on Oct 4, 2023

Hi, what specifically are you looking to benchmark on the K80? Users are free to contribute and we've had many external PRs.

Contribution guide is here: https://github.com/NVIDIA/MatX/blob/main/CONTRIBUTING.md

choppaface · on Oct 4, 2023

well the root README.md has some "benchmarks" at the very top. maybe there's no existing benchmarks doc with more details? or even a repo of what's in the README? it's like if numpy is 5 sec on CPU, but a K80 is only 100ms, then the K80 cost might be worth not going for the A100 at 3ms. Similar argument for jetson.

cburdick13 · on Oct 4, 2023

We have benchmarks in the benchmarks directory, but these are for things like convolution, matrix multiples, etc. It's not for running a traditional benchmark set like resnet.

Like most benchmarks it really depends on what you want to do, and since it's a general library everyone might care about different things.

choppaface · on Oct 4, 2023

I saw those, yes having the code is great but I’m more interested in the actual numbers. E.g. what does this sample do on A100, P100, T4, K80, Jetson nano? The analog is things like Resnet50 get tested and reported and then you know if Resnet50 might work for your budget / hardware.

Versus the advert in the root Readme, which is impressive but gives no data on the pareto.

cburdick13 · on Oct 5, 2023

Hi, if you don't mind opening an issue asking for this we can run these and put in the readme.

downrightmike · on Oct 4, 2023

I have a spare Titan Z that could be useful if so.

smoldesu · on Oct 4, 2023

Are you talking about for desktop/GUI use? A lot of older Nvidia hardware has been implicitly depreciated for years. Without newer patches to help Wayland support, your options are either to use nouveau or stick with unsupported software and official drivers. It's ugly, but Microsoft and Apple pretty much depreciate hardware the same way; first with 'recommended' cutoff points, then hard depreciations. On Linux it's less spelled-out, but things seem to be Wayland-or-bust now.

cburdick13 · on Oct 4, 2023

It was mentioned elsewhere, but we support down to cuda 11.4, which supports down to the Maxwell architecture (nearly 10 years old now).

reaperman · on Oct 4, 2023

*deprecate or “EOL” (for continuing support). Depreciate is a tax and accounting concept, though often at the end of a MACRS depreciation schedule office equipment is often immediately replaced since it’s no longer contributing to tax write-offs. (Really, just not replaced as long as it continues to contribute to tax deductions)

choppaface · on Oct 4, 2023

Surprisingly this repo is BSD licensed so it might even outgrow nvidia. Eigen is pretty strong but this new MatX syntax might catch on, especially with easy GPU integration.

arthurcolle · on Oct 4, 2023

can you downgrade?

bluish29 · on Oct 4, 2023

> Python/Numpy: 5360ms (Xeon(R) CPU E5-2698 v4 @ 2.20GHz) CuPy: 10.6ms (A100) MatX: 2.54ms (A100) >

Are they even comparing apples to apples to claim that they see these improvements over NumPy?

> While the code complexity and length are roughly the same, the MatX version shows a 2100x over the Numpy version, and over 4x faster than the CuPy version on the same GPU.

NumPy doesn't use GPU by default unless you use something like Jax [1] to compile NumPy code to run on GPUs. I think more honest comparison will mainly compare MatX running on same CPU like NumPy and for GPU comparison focus to compare vs CuPy.

[1] https://github.com/google/jax

zzbn00 · on Oct 4, 2023

This seems like an FFT based benchmark.... I believe NumPy for licensing reasons includes a port of old FFTPACK which is only single threaded. (Py)FFTW or others are surely much faster.

So the test seems to compare one thread of a CPU launched in 2016 running a port of 40year old fortran code to a optimised FFT library on a GPU launched in 2020.

cburdick13 · on Oct 5, 2023

The sample is really designed to show the simplicity of the syntax, and the performance is just a side effect. Where you'll see a bigger performance difference with numPy/cuPy is when kernel fusion happens where MatX is typically able to fuse many things into a single kernel at compile-time and cuPy launches many kernels. If you have a specific type of expression you'd like to compare please let us know.

cburdick13 · on Oct 4, 2023

I agree we should update it given that it's also a very old comparison (2 years now). The cuPy comparison should be fair since it was the same GPU.

I addressed it more here: https://news.ycombinator.com/item?id=37760120

KeplerBoy · on Oct 4, 2023

It's very much an apples to oranges comparison, but nevertheless relevant, since NumPy is the defacto baseline for anything numeric in Python.

albertzeyer · on Oct 4, 2023

Not really, for GPU computation, PyTorch, TensorFlow or JAX are the baselines (they are not just for ML stuff). Even for CPU, that's a more relevant comparison.

KeplerBoy · on Oct 4, 2023

I agree in principle, but my experience is that researchers writing code just default to NumPy.

Night_Thastus · on Oct 4, 2023

I would be curious how it compares to the big names in the field, Eigen, OpenBLAS, Boost, etc.

cburdick13 · on Oct 4, 2023

Hi, those libraries don't have a GPU counterpart, and for matrix multiplication we only support GPU right now.

cburdick13 · on Oct 3, 2023

Hi all, I'm one of the maintainers of MatX. I didn't expect it to hit HN this soon, but happy to answer any questions.

albertzeyer · on Oct 4, 2023

I think a comparison to PyTorch, TensorFlow and/or JAX is more relevant than a comparison to CuPy/NumPy.

And then maybe also a comparison to Flashlight (https://github.com/flashlight/flashlight), xtensor (https://github.com/xtensor-stack/xtensor) or other C/C++ based ML/computing libraries?

Also, there is no mention of it, so I suppose this does not support automatic differentiation?

cburdick13 · on Oct 4, 2023

Hi, I addressed the comparisons in other comments, but in general this is for c++ users and not Python. It's more of a comparison to numPy/cuPy, and we do have a table showing the comparison in the docs.

We don't support automatic differentiation (yet).

KeplerBoy · on Oct 4, 2023

Actually just started looking into MatX yesterday to accelerate our radar pipeline. Really interesting to see that this use-case is heavily featured in the documentation.

Is the UCLA/Nvidia/Raytheon collaboration (as presented in a recent GTC talk) a major force behind the development of MatX?

cburdick13 · on Oct 4, 2023

Hi, yes, the original development was started for radar users who did not know CUDA but needed to write in c++. Many of our examples and code are radar related for that reason.

iamlemec · on Oct 4, 2023

Thanks, looks really interesting. Do functions like matmul support inputs of differing type, like say A=int8 and B=float? Would be nice if you could get memory efficient quantized matmul with operator fusion.

gregjm · on Oct 4, 2023

CUTLASS, which is NVIDIA’s C++ template library for writing matrix multiply and convolution kernels parametrized over input/output types, operators, and algorithm block sizes, theoretically supports this. But, each input of the (k, n) shaped matrix B will be read from global memory ceil(n / block dimension) times in an algorithm that computes one (block dimension, block dimension) submatrix of the output matrix D per thread block. It will probably be more efficient to cast your B matrix to FP16 or INT8 lower precision in a preprocessing kernel to reduce memory traffic in the matrix multiply kernel.

On newer GPUs, though, we have this huge L2 cache which makes the calculus a little different if your working set fits into it. e.g. Ampere A100 has 40MB L2$.

cburdick13 · on Oct 4, 2023

We typically support whatever the underlying library supports. For int8 the support would come from cuBLASLt currently. I don't believe that or Cutlass supports mixed precision inputs, but I can check.

26fingies · on Oct 3, 2023

very nice but i assume it’s going to be limited to cuda as a backend because nvidia

cburdick13 · on Oct 3, 2023

Currently we support both CUDA and CPU to some extent. CPU is done through standard C++ (and soon stdpar). Obviously standard C++ is problematic since it doesn't include everything we support (FFTs, matrix multiplies, etc). One option is to use open-source libraries that do these, but then it ends up being a lot of dependencies that are hard to manage. We have plans to improve CPU support soon, so stay tuned.

rossy · on Oct 4, 2023

Vulkan Compute would be nice.

__loam · on Oct 3, 2023

I don't actually know a lot about massively parallel libraries like CUDA. Does AMD have an equivalent technology associate with their GPUs? It feels like it should be fairly straightforward to create some kind of high level library that just uses CUDA or whatever AMD has on the back end.

cmovq · on Oct 3, 2023

Traditionally OpenCL was the alternative to CUDA. Recently AMD has been pushing their ROCm platform.

pjmlp · on Oct 3, 2023

Others are free to provide proper C++ support on their GPUs.

SpaceNoodled · on Oct 3, 2023

They're also free to write math libraries that don't depend on GPUs.

cburdick13 · on Oct 3, 2023

Hi, MatX currently has partial support for CPUs too. Please see this comment:

https://news.ycombinator.com/item?id=37758635

spicybright · on Oct 4, 2023

The performance boost you get from using a GPU is incredible.

You might as well say they're free to make monitors that don't rely on color.

Mr_P · on Oct 4, 2023

GPU performance per dollar is only competitive for specific workloads. For extremely large scale compute, getting enough data center GPUs can also be challenging.

cxx · on Oct 4, 2023

This looks great! I'm definitely going to try this, I've been hoping for something like this for a long time. Eigen is great of course but GPU acceleration is a bit clunky to use, this looks much closer to the code written by the researchers.

trostaft · on Oct 3, 2023

It's nice to have easier to use C++ GPU numerical computing libraries, but it seems like there's a good bit of work to go in terms of feature set. Taking CuPy as an example, they've nearly entirely mirrored the enormous featureset of both NumPy and SciPy.

Although I'm not expecting on that scale here, but what's the planned future work here? E.g. matrix free operators and operations, sparse matrix support, or symmetry aware eigenvalue / singular value decompositions? There are official cuda libraries that support these (e.g. cuSPARSE).

Otherwise this looks good! Documentation is decent too.

cburdick13 · on Oct 3, 2023

Hi, having feature parity with cuPy is a daunting task, especially for a C++ library. At this point we feel we have a good foundation for all kinds of basic and advanced tensor manipulations, and have a growing number of library-based functions on top of that. The low-hanging fruit was wrapping the CUDA math libraries, so that has the most progress.

Since MatX was originally intended for streaming/real-time processing the focus has been on making C++ for CUDA easier to use for those kinds of applications.

There are also a lot of things in cuPy and sciPy that don't make a lot of sense to do on the GPU, like offline tasks such as filter design in signal processing.

Also, since C++ users typically are writing in that language for maximum performance, we've put a large focus on making sure we are as close to writing optimized CUDA as possible. In general, most workloads we've tested are about 3-4x faster than the cuPy counterpart due to better fusion and language overhead.

We have discussed supporting cusparse as well, but cusparse typically requires a different input type like a CSR matrix. This is not something you'd typically want to detect/convert, so we're still discussing ways of getting this integrated cleanly.

We do have several versions of SVD, including one that calls cuSolver.

Feedback is always appreciated!

touisteur · on Oct 5, 2023

If I'm reading the examples correctly, this is calling parameterized kernels in sequence, with nice-to-write matrix formats and memory movement? You're still ping-ponging from global (on device) to l2/shared memory each time? Cool approach for quick prototyping and replacing/porting some Eigen-stuff.

For those of us in-kernel day-laborers I can't wait for cuBLASDx and cuSolverDx releases, and for my brain to wrap around cutlass and cute. Please don't forget us!

waynecochran · on Oct 3, 2023

I’d be interested in a comparison with Eigen for Linear Algebra workflows. I would love to perform eigen analysis or svd on large matrices using CUDA.

cburdick13 · on Oct 4, 2023

Hi, what matrix sizes and types are you working with and how many batches? In general it sounds be similar to eigen, but with GPU support. We have several svd methods for different scenarios, so if you give us the info above we can let you know the performance.

frozenport · on Oct 3, 2023

I've preferred using libtorch for translating python to C++.

I feel this effort is poorly differentiated compared to previous efforts.

cburdick13 · on Oct 3, 2023

Hi, besides an (subjectively) easier syntax, the performance should be higher compared to libtorch. Every operator expression (think of it as an arithmetic expression) is evaluated at compile-time and is often fused into a single GPU kernel. This also removes the need for JITing. If there's a specific workflow you're curious about comparing libtorch vs MatX please let us know and we can try it out.

engfan · on Oct 4, 2023

Why can’t you keep the calling interfaces of functions as close to the Py libraries as possible, simplifying the transition for everyone? Will that really destroy the performance increase? Common calling interfaces make everything much simpler. Even in this simple example, the calls differ significantly.

cburdick13 · on Oct 4, 2023

We've tried our best to match python as well as we can, or falling back to matlab-style if Python doesn't have it. Many of our unit tests are verified against python, so the conversion is typically very easy. The one thing that python has that makes this much easier is keyword arguments. We've tried to use overloads to mimic this as best as we could.

That being said, in the example on the home page there are notable differences:

1) we have the run() method. The reason is that the expression before the run is lazily evaluated for performance and does not execute anything. Having the run() method allows you to run the same line of code on either a CPU or GPU by changing the argument to run()

2) in MatX memory allocation is explicit. Python does it as-needed, but this causes a performance penalty with allocations and deallocations that are not under your control. Specifically in the FFT example, numPy will allocate an ndarray prior to calling it, but on the same line. In MatX the allocation is (typically) done before the operation so you can control the performance of the hot path of code.

If you have any specific suggestions, we would love to hear it

LeroyRaz · on Oct 4, 2023

What are the advantages of MatX over Jax?

cburdick13 · on Oct 4, 2023

The main difference is Jax is for python primarily, while MatX is c++. This might seem like a poor answer, but in many domains (quasi-real time, signal processing, etc) the language is important to give certainty guarantees on performance.

MatX has been used in several projects with hundreds of microsecond deadlines, which is not usually something you'd choose Python for out of the box.

We are instead targeting users who already have python or high-level code that they need to port to c++ for whatever reason, and want to do it in the easiest way possible. With c++17 we're able to provide a simple syntax without compromising on performance compared to most native code.

temren · on Oct 4, 2023

would matx work with gnuradio?

cburdick13 · on Oct 4, 2023

I have to admit I'm only tangentially familiar with gnuradio, but matx should be able to integrate with any C++17 codebase. We have several examples of integration with our streaming sensor pipeline called holoscan.

see this radar pipeline example for one:

https://github.com/nvidia-holoscan/holohub/tree/main/applica...

gyrovagueGeist · on Oct 4, 2023

Very nice! I think the landing README should really be a comparison with CuPy and not NumPy though

cburdick13 · on Oct 4, 2023

Good point, and agreed the landing page is a bit sensational. I mentioned it elsewhere but between MatX and cuPy we see a 3-4x performance difference on average. The gap tends to widen with more complex workflows where compile-time kernel fusion gives more improvements compared to something like a single GEMM.

alfalfasprout · on Oct 4, 2023

How does this compare to xtensor other than being GPU specific?

cburdick13 · on Oct 4, 2023

The main difference is the GPU part. This is a large difference because the same lazy evaluated template type can be run on the CPU or GPU through what we call an executor. On the CPU it's likely very similar to how xtensor is already using expression trees, but on the GPU it's quite a bit different because if the libraries backing it and optimized kernels.

The syntax of MatX also allows us to do more kernel fusion in the future to improve performance without any changes.