I have to say I disagree with you on the AMD/TensorFlow point here, Adam. Your p...

agibsonccc · on Dec 25, 2017

I'd expect competition to heat up in 2018.

Re: AMD/Intel. We've been waiting for them to get their act together for years. Nervana could be great but I'm going to wait on that one. So far their "launches" have been nothing more than marketing fluff.

As for your projections about tensorflow. It won't be tensorflow. Tensorflow will be 1 of many frameworks. Look I like HOPS but you guys push tensorflow explicitly. A startup running its own hadoop distro that happens to push tensorflow isn't going to move the needle. You guys are great middleware I'm sure but I haven't seen the customers where it might be viable. I hope you guys continue to grow though! While you're doing that, most hadoop vendors are focused on moving up the stack. It will take people with actual resources to move the needle in terms of enterprise adoption.

Amazon is doing this with mxnet and EMR, MapR is pushing tensorflow in their serving. CNTK is being pushed in SQL server and HDInsights. There's some competition there.

What I'm getting at here is: it will take multiple vendors and competition. I'm going to place my bet on the bigger players already involved with the foundations first though. Open standards (addressed below) where it commoditizes the chip will be key. The storage infra will follow from that. It should be something that doesn't displace current infra but allows interop.

Things like nvvm from the mxnet folks, onnx (where the framework doesn't matter anymore!) being pushed by the various hardware vendors etc will move the needle. You need buy in from the actual big players who can upfront the development time in to making these things viable alternatives.

For your "seamless transition" I'm not sure that would be that hard done right. Supporting "great tensorflow" can come in multiple flavors. As a separate issue, tensorflow's production story is horrible. That's another topic I could rant about all day though. It ultimately comes from abstracting it away though. That by itself is a hard problem (Disclosure: I have my own solution for this that I won't talk about here just know I'm biased :D)

Lastly, I question whether opencl can even be a viable alternative. It's a fragmented inconsistent standard with a worse API than cuda. One reason it "won" is because it's in general cleaner and a clear leader in the space.

jamesblonde · on Dec 25, 2017

Yeah, ROCm is the most viable candidate as of today. In general, Nvidia are not good for middleware vendors. They want to be one, but don't offer a platform that integrates with anything. Licensing costs for the DGX-1 are insanely high. My problem is mostly from a data scientist perspective - teams don't need a few high performance GPUs, like a couple of DGX-1 boxes. They need a hundred 1080Tis, maybe complemented by a DGX-1 (which would cost the same a 2 DGX-1s). That way they can do lots of parallel experiments (hyperparam optimization), and distributed training. Making GPUs a scarce resource just re-inforces the lead of the hyperscale AI companies, who have thousands of GPUs available for their data scientists.

agibsonccc · on Dec 25, 2017

Oh I agree that GPUs should be more commodity. We might see alternative ASICs rather than GPUs come out though. I'm personally more interested in seeing that succeed than just confining the solution space to gpus and discrete gpu competition. I'm just not keen on trying to predict what will win (I really don't know) I just have criteria I would be looking for before trying to implement support for it in either my deep learning framework or trying to support anything for customers.