DiLoCo: Distributed Low-Communication Training of Language Models

DiLoCo: Distributed Low-Communication Training of Language Models

by Anon84

vessenes

This is really cool, and I expect groups like Eleuther will want to integrate it into their workflows.

That said, there is a fundamental rule here -- gradients must, at some point, be shared. This paper says they don't have to be shared every step. Which is really great. But, gradient sizes are on the order of magnitude of the network being trained.

To get this useful for indie LLM training, I'd guess that you'll want it to work sharing gradients not more often than once every few hours to day -- more than that, and bandwidth costs sending 50GB weights files around during training are going to kill you.

The paper seems to indicate they can share every few thousand? (8 workers x 500 times bandwidth reduction?) steps, but sharing every few millions would help that distributed large LLM use case, and my guess is that this is going to be a bit harder to get to converge. But hopefully not!

Related: there's an n^2 problem for getting the gradients out to lots of distributed compute resources.

So, if anyone wants to work on step 2, I'd suggest there's an engineering task: 1-N gradient upload and distribution, and then a science task: revalidating this with a few orders of magnitude less common checkins.

GaggiX

In the paper, they usually share every 500 steps, a few million steps is an entire training process, more often than not models are trained less than a million steps.

dinobones

Quick question: What is a "gradient" in this context? Is it a file? Is it some state that is stored somewhere?

My understanding of ML is, I made what is basically a "Hot Dog or Not Hot Dog" image classifier and know what a neural net is.

The gradient for that simple neural net, was just found by running Adam an optimizer on the current batch, and then updating the model weights. So by "gradient" do you mean model weights?

UncleOxidant

No, it's an adjustment to model weights that is made during training. Given some input and some expected value there will be some delta and that is used to calculate the gradient - these networks are essentially being trained by gradient descent. As for the size of that data that has to be shared it's going to depend on network size and what kind of representation you're using for the weights - probably bfloat16 these days, but we're certainly seeing a lot of 4 bit representations now.

lumost

The gradient is the direction and magnitude of change for each model weight. The optimizer determines how to adjust the weight based on the gradient.

techwizrd

There are a lot of distributed data parallel and federated learning (FL) algorithms that could be applied to training LLMs, and there have been several papers that tackle applying these. I don't think I've seen 500 steps in the FL literature (little eye opening to be honest), and I don't think I've tested more than 50 steps between communication rounds personally. I'd be interested in testing other algorithms, partial client participation, hierarchical federation approaches, error-feedback, and so on.

lucubratory

So we can SETI@Home or Folding@Home for large language models, now? Not sure how small the minimum size of the compute cluster can be. If it's still out of consumer reach then this would either be just an intermediate research step, or a way for small-er (but still professional/well-resourced) labs to collaborate together. I'm not sure the latter would be helpful, as if they wanted to collaborate together they could probably already do that by pooling resources for a large cloud compute run.

theendisney2

Someone onhere argued this is entirely possible and that results can be merged later on. It seems facinating. Perhaps it is not conpute each node has to contribute but just bandwidth?

gaogao

The limitations of the paper is more so maximum number of workers than minimum. It'd be pretty neat to get something that could work across 100+ distributed workers well.

Anon84

A few years ago, someone (Apple?) was working on a way to distribute the training of ML models across personal devices. The idea was that the master model in the Cloud could be trained on your personal data without the data ever leaving your device. Not sure if that was ever put into production, but this feels like a scaled-up version of that with distributed data centers instead of iPhones

magicarp

Google has done a lot of work in this area: https://federated.withgoogle.com/

mattnewton

Google has shipped this in gboard, the Google keyboard - https://arxiv.org/pdf/1812.02903.pdf

(Disclaimer, I worked on a later version of this for other models in the android keyboard at Google)

Anon84

Yeah, it makes sense that it was Google as it’s in the same wheelhouse. Thank you both for the references

GaggiX

It's unfortunate that the larger model trained to evaluate the technique is "only" 400M parameters, I'd love to see it applied to models with billions of parameters.

Crafted by Rajat

Source Code

hckrnws

DiLoCo: Distributed Low-Communication Training of Language Models