sgd优化算法的各种变体

背景
优化算法

背景

2017年前，公司内部的算法团队还都是使用XGBoost来训练模型，手动构造的特征已经几百个，特征迭代效果微弱，但在搜索推荐场景下，有大规模的离散特征，这类特征记忆效果非常好，如果加入模型训练会获得不错的效果提升，但树模型并不适合大规模离散特征，所以我开发了一个基于Parameter Server架构的分布式机器学习框架，主要支持大规模离散的浅层模型，比如Logistic Regression、Factorization Machine、Field-aware Factorization Machine分类模型以及对应的回归模型和SVD分解。这个机器学习框架使用Yarn调度在公司的大数据集群上，在线上取得了非常不错的收益，框架后续又开始朝着深度模型和在线学习演化，目前公司算法团队已经基本往大规模离散DNN迁移完毕。这里主要记录一下训练框架支持的一些优化算法，公式脑子只能记个大概，还是写下来方便以后查阅。

窄的深度模型 -> 宽的浅层模型 -> 又宽又深的模型 -> 秒级在线更新的又宽又深的模型

优化算法

SGD with Momentum
Nesterov accelerated gradient
SGD-L1(Cumulative)
AdaGrad
RMSProp
AdaDelta
Adam
AdamW
FTRL
FTML

SGD with Momentum

超参数：

学习率
惯性项

更新公式：

Nesterov accelerated gradient

超参数：

学习率
惯性项

更新公式：

SGD-L1(Cumulative)

超参数：

学习率
学习率衰减
学习率衰减步长
L1正则化项

更新公式：

AdaGrad

超参数：

学习率

更新公式：

RMSProp

超参数：

惯性项
学习率

更新公式：

AdaDelta

超参数：

惯性项

更新公式：

Adam

超参数：

学习率
惯性项
惯性项
衰减率

更新公式：

AdamW

超参数：

学习率
惯性项
惯性项
衰减率
正则项

更新公式：

FTRL

超参数：

L1正则化项
L2正则化项
学习率控制
学习率控制

更新公式：

FTML

超参数：

惯性项控制
惯性项控制
学习率

更新公式：

参考文献

Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 547.
Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.
Tsuruoka, Yoshimasa, Jun'ichi Tsujii, and Sophia Ananiadou. "Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty." Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1. Association for Computational Linguistics, 2009.
Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization." Journal of Machine Learning Research 12.Jul (2011): 2121-2159.
Tieleman T., Hinton G. Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26–31 (2012).
Zeiler, Matthew D. "ADADELTA: an adaptive learning rate method." arXiv preprint arXiv:1212.5701 (2012).
Kingma, Diederik, and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
Ilya Loshchilov, Frank Hutter. "Decoupled Weight Decay Regularization." arXiv preprint arXiv:1711.05101 (2017).
H. Brendan McMahan & M Streter. Adaptive Bound Optimization for Online Convex Optimization. In COLT, 2010
H. Brendan McMahan. Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization. In AISTATS, 2011
H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica, Ad Click Prediction: a View from the Trenches. In ACM SIGKDD, 2013
Shuai Zheng, James T. Kwok. Follow the Moving Leader in Deep Learning. The 34th International Conference on Machine Learning (ICML), Sydney, Australia, August 2017

Blog