This is really cool, and I expect groups like Eleuther will want to integrate it into their workflows.
That said, there is a fundamental rule here -- gradients must, at some point, be shared. This paper says they don't have to be shared every step. Which is really great. But, gradient sizes are on the order of magnitude of the network being trained.
To get this useful for indie LLM training, I'd guess that you'll want it to work sharing gradients not more often than once every few hours to day -- more than that, and bandwidth costs sending 50GB weights files around during training are going to kill you.
The paper seems to indicate they can share every few thousand? (8 workers x 500 times bandwidth reduction?) steps, but sharing every few millions would help that distributed large LLM use case, and my guess is that this is going to be a bit harder to get to converge. But hopefully not!
Related: there's an n^2 problem for getting the gradients out to lots of distributed compute resources.
So, if anyone wants to work on step 2, I'd suggest there's an engineering task: 1-N gradient upload and distribution, and then a science task: revalidating this with a few orders of magnitude less common checkins.