Yeah. The amount of compute required is pretty high. I wonder, is there enough d...

Filligree · on March 5, 2024

The compute exists, but we'd need some conceptual breakthroughs to make DNN training over high-latency internet links make sense.

altruios · on March 5, 2024

Distributing the training data also opens up vectors of attack. Poisoning or biasing the dataset distributed to the computer needs to be guarded against... but I don't think that's actually possible in a distributed model (in principal?). If the compute is happing off server: then trust is required (which is not {efficiently} enforceable?).

TehCorwiz · on March 5, 2024

Trust is kinda a solved problem in distributed computing, The different "@Home" projects and Bitcoin handle this by requiring multiple validations of a block of work for just this reason.

altruios · on March 5, 2024

How do you verify the work of training without redoing the exact same work for training? (That's the neat part: you don't)

Bitcoin is trust-solved because of how the new blocks depends on previous blocks. With training data, there is no such verification (prompts/answers pairs do not depend at all on other prompt/answer pairs) (if there was, we wouldn't need to do the work of training the data in the first place).

You can rely on multiplying the work where gross variations are ignored (as you suggest): but that will take a lot more overhead in compute, and still is susceptible to bad actors (but much more resistant).

There is no solid/good solution - afaik - for distributed training of an AI (Open assistant I think is working on open training data?), if there is: I'll sign up.

hackpert · on March 6, 2024

There has been some interesting work when it comes to distributed training. For example DiLoCo (https://arxiv.org/abs/2311.08105). I also know that Bittensor and nousresearch collaborated on some kind of competitive distributed model frankensteining-training thingy that seems to be going well. https://bittensor.org/bittensor-and-nous-research/

Of course it gets harder as models get larger but distributed training doesn't seem totally infeasible. For example if we were to talk about MoE transformer models, perhaps separate slices of the model can be trained in an asynchronous manner and then combined with some retraining. You can have minimal regular communication about say, mean and variance for each layer and a new loss term dependent on these statistics to keep the "expertise" for each contributor distinct.

pksebben · on March 5, 2024

Forward-Forward looked promising, but then Hinton got the AI-Doomer heebie-jeebies and bailed. Perhaps someone picks up the concept and runs with it - I'd love to myself but I don't have the skillz to build stuff at that depth, yet.