• SGD with Momentum
  • Nesterov accelerated gradient
  • SGD-L1(Cumulative)
  • AdaGrad
  • AdaDelta
  • AdaM
  • FTRL
  • FTML


SGD with Momentum

超参数:

  • 学习率 $\eta \in (0, +\infty) \quad [default=0.001]$
  • 惯性项 $\rho \in (0, 1) \quad [default=0.9]$

更新公式:

Nesterov accelerated gradient

超参数:

  • 学习率 $\eta \in (0, +\infty) \quad [default=0.001]$
  • 惯性项 $\rho \in (0, 1) \quad [default=0.9]$

更新公式:

SGD-L1(Cumulative)

超参数:

  • 学习率 $\eta \in (0, +\infty) \quad [default=0.001]$
  • 学习率衰减 $\alpha \in (0, 1) \quad [default=0.9]$
  • 学习率衰减步长 $\tau \in (1, +\infty) \quad [default=100]$
  • L1正则化项 $\lambda_1 \in [0, +\infty) \quad [default=0.001]$

更新公式:

AdaGrad

超参数:

  • 学习率 $\eta \in (0, +\infty) \quad [default=0.001]$

更新公式:

AdaDelta

超参数:

  • 惯性项 $\rho \in [0, 1) \quad [default=0.95]$

更新公式:

AdaM

超参数:

  • 学习率 $\eta \in (0, +\infty) \quad [default=0.001]$
  • 惯性项 $\rho_1 \in [0, 1) \quad [default=0.9]$
  • 惯性项 $\rho_2 \in [0, 1) \quad [default=0.999]$
  • 衰减率 $\lambda \in [0, 1] \quad [default=1-10^{-8}]$

更新公式:

FTRL

超参数:

  • L1正则化项 $\lambda_1 \in [0, +\infty) \quad [default=0.001]$
  • L2正则化项 $\lambda_2 \in [0, +\infty) \quad [default=0.001]$
  • 学习率控制 $\alpha \in (0, +\infty) \quad [default=0.01]$
  • 学习率控制 $\beta \in [0, +\infty) \quad [default=1.0]$

更新公式:

FTML

超参数:

  • 惯性项控制 $\rho_1 \in [0, 1) \quad [default=0.6]$
  • 惯性项控制 $\rho_2 \in [0, 1) \quad [default=0.999]$
  • 学习率 $\eta \in (0, +\infty) \quad [default=0.001]$

更新公式:

参考文献

  • Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 547.
  • Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.
  • Tsuruoka, Yoshimasa, Jun’ichi Tsujii, and Sophia Ananiadou. “Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty.” Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1. Association for Computational Linguistics, 2009.
  • Duchi, John, Elad Hazan, and Yoram Singer. “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of Machine Learning Research 12.Jul (2011): 2121-2159.
  • Zeiler, Matthew D. “ADADELTA: an adaptive learning rate method.” arXiv preprint arXiv:1212.5701 (2012).
  • Kingma, Diederik, and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).
  • H. Brendan McMahan & M Streter. Adaptive Bound Optimization for Online Convex Optimization. In COLT, 2010
  • H. Brendan McMahan. Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization. In AISTATS, 2011
  • H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica, Ad Click Prediction: a View from the Trenches. In ACM SIGKDD, 2013
  • Shuai Zheng, James T. Kwok. Follow the Moving Leader in Deep Learning. The 34th International Conference on Machine Learning (ICML), Sydney, Australia, August 2017