Collaborative Parameter Update Based on Average Variance Reduction of Historical Gradients
摘要: 随机梯度下降算法(SGD)随机使用一个样本估计梯度,造成较大的方差,使机器学习模型收敛减慢且训练不稳定。该文提出一种基于方差缩减的分布式SGD,命名为DisSAGD。该方法采用历史梯度平均方差缩减来更新机器学习模型中的参数,不需要完全梯度计算或额外存储,而是通过使用异步通信协议来共享跨节点的参数。为了解决全局参数分发存在的“更新滞后”问题,该文采用具有加速因子的学习速率和自适应采样策略:一方面当参数偏离最优值时,增大加速因子,加快收敛速度;另一方面,当一个工作节点比其他工作节点快时,为下一次迭代采样更多样本,使工作节点有更多时间来计算局部梯度。实验表明:DisSAGD显著减少了循环迭代的等待时间,加速了算法的收敛,其收敛速度比对照方法更快,在分布式集群中可以获得近似线性的加速。Abstract: The Stochastic Gradient Descent (SGD) algorithm randomly picks up a sample to estimate gradients, creating big variance which reduces the convergence speed and makes the training unstable. A Distributed SGD based on Average variance reduction, called DisSAGD is proposed. The method uses the average variance reduction based on historical gradients to update parameters in the machine learning model, requiring little gradient calculation and additional storage, but using the asynchronous communication protocol to share parameters across nodes. In order to solve the “update staleness” problem of global parameter distribution, a learning rate with an acceleration factor and an adaptive sampling strategy are included: on the one hand, when the parameter deviates from the optimal value, the acceleration factor is increased to speed up the convergence; on the other hand, when one work node is faster than the other ones, more samples are sampled for the next iteration, so that the node has more time to calculate the local gradient. Experiments show that the DisSAGD reduces significantly the waiting time of loop iterations, accelerates the convergence of the algorithm being faster than that of the controlled methods, and obtains almost linear acceleration in distributed cluster environments.
Key words:
- Gradient descent /
- Machine learning /
- Distributed cluster /
- Adaptive sampling /
- Variance reduction
表 1 基于历史梯度平均方差缩减算法
输入:learning rate $\lambda $. 输出:$\omega $ and $g$ for next epoch. (1) Initialize $\omega $ using plain SGD for 1 epoch; (2) while not converged do (3) $\overline \omega \leftarrow 0$; (4) $\overline g \leftarrow 0$; (5) for $t$= 0, 1, ···, T do (6) Randomly sample ${i_t} = \{ 1,2,··· ,m\}$ without replacement; (7) $\overline \omega \leftarrow \frac{1}{m}\displaystyle\sum\limits_{i = 1}^m { {\omega _i} }$; (8) $\overline g \leftarrow \frac{1}{m}\displaystyle\sum\limits_{i = 1}^m {\nabla {f_i}({\omega _i})}$; (9) Update $g$ and $\omega $ using equation(1) and equation(3),
respectively;(10) end (11) end 表 2 DisSAGD算法的伪代码
输入:${D_1},{D_2}, ··· ,{D_p}$. 输出:${\omega _1},{\omega _2}, ··· ,{\omega _p}$. (1) Initialize ${\omega _p}$; (2) for s = 1, 2, ···, N do (3) for each node $k \in \{ 1,2, ··· ,p\} $do (4) Call subroutine avr_hg($\lambda $); (5) end (6) Return all ${\omega _{k,t}}$ and ${f_{k,t}}(\omega )$ from node k to server; (7) Compute ${\omega ^{s - 1} } = \dfrac{1}{k}\displaystyle\sum\limits_{i = 1}^k { {\omega _i} }$ and $f(\omega )$ using equation(4)
on the server side;(8) ${\omega ^s} = {\omega ^{s - 1} } - {\lambda _{s - 1} }\dfrac{1}{k}\displaystyle\sum\limits_{i = 1}^k {\nabla {f_i}(\omega )}$; (9) end (10) ${\omega _p} \leftarrow {\omega ^s}$; 表 3 模型的F1值
SVHN Cifar100 KDDcup99 PetuumSGD 0.9547 0.6684 0.9335 SSGD 0.9675 0.6691 0.9471 DisSVRG 0.9688 0.6776 0.9441 ASD-SVRG 0.9831 0.6834 0.9512 DisSAGD 0.9827 0.6902 0.9508 DisSAGD-tricks 0.9982 0.7086 0.9588 -
