背景

之前在公司开发了一个Parameter Server架构的分布式机器学习系统,可以支持多种同步模式,支持异步的ASP、同步的BSP、半同步的SSP,但是在点击率预估等业务场景中,实际工作中最常用的还是ASP模式,其他两种模式并没有进行实验,这次想通过实验看下效果如何。

实验配置

  • 使用[day-89, day]共90天的线上数据做训练,使用[day+1]一天的数据做验证
  • 训练数据4亿左右,特征千万量级
  • 集群:50 worker + 20 ps on Yarn, 8core,4g/node
  • 点击率预估,模型使用最简单的LR
  • 优化算法使用FTRL
  • 不同实验只有同步方式不同,其他配置保持一样

效果

ASP模式

19:48:18 epoch: 1, train-logloss: 0.0627267, train-auc: 0.690643, valid-logloss: 0.0555645, valid-auc: 0.698787
19:54:11 epoch: 2, train-logloss: 0.0622153, train-auc: 0.704938, valid-logloss: 0.0554252, valid-auc: 0.702617
20:00:21 epoch: 3, train-logloss: 0.0620013, train-auc: 0.710765, valid-logloss: 0.0553471, valid-auc: 0.704912
20:06:43 epoch: 4, train-logloss: 0.0618459, train-auc: 0.714920, valid-logloss: 0.0552953, valid-auc: 0.706354
20:12:43 epoch: 5, train-logloss: 0.0617195, train-auc: 0.718254, valid-logloss: 0.0552550, valid-auc: 0.707505
20:18:48 epoch: 6, train-logloss: 0.0616111, train-auc: 0.721076, valid-logloss: 0.0552250, valid-auc: 0.708359
20:24:47 epoch: 7, train-logloss: 0.0615152, train-auc: 0.723545, valid-logloss: 0.0552006, valid-auc: 0.709060
20:28:27 epoch: 8, train-logloss: 0.0614288, train-auc: 0.725751, valid-logloss: 0.0551802, valid-auc: 0.709684

BSP模式

19:55:23 epoch: 1, train-logloss: 0.0627236, train-auc: 0.690612, valid-logloss: 0.0555393, valid-auc: 0.699665
20:06:58 epoch: 2, train-logloss: 0.0622139, train-auc: 0.704972, valid-logloss: 0.0554050, valid-auc: 0.703530
20:20:43 epoch: 3, train-logloss: 0.0620007, train-auc: 0.710783, valid-logloss: 0.0553289, valid-auc: 0.705700
20:29:27 epoch: 4, train-logloss: 0.0618450, train-auc: 0.714948, valid-logloss: 0.0552777, valid-auc: 0.707155
20:39:05 epoch: 5, train-logloss: 0.0617186, train-auc: 0.718281, valid-logloss: 0.0552391, valid-auc: 0.708264
20:48:00 epoch: 6, train-logloss: 0.0616103, train-auc: 0.721099, valid-logloss: 0.0552086, valid-auc: 0.709087
20:57:18 epoch: 7, train-logloss: 0.0615145, train-auc: 0.723561, valid-logloss: 0.0551841, valid-auc: 0.709804
21:09:38 epoch: 8, train-logloss: 0.0614279, train-auc: 0.725770, valid-logloss: 0.0551629, valid-auc: 0.710393

SSP模式(threshold=50)

20:00:14 epoch: 1, train-logloss: 0.0627070, train-auc: 0.690925, valid-logloss: 0.0555584, valid-auc: 0.698937
20:08:23 epoch: 2, train-logloss: 0.0622105, train-auc: 0.705031, valid-logloss: 0.0554257, valid-auc: 0.702801
20:19:17 epoch: 3, train-logloss: 0.0619988, train-auc: 0.710817, valid-logloss: 0.0553514, valid-auc: 0.704902
20:27:04 epoch: 4, train-logloss: 0.0618438, train-auc: 0.714971, valid-logloss: 0.0553002, valid-auc: 0.706375
20:38:13 epoch: 5, train-logloss: 0.0617177, train-auc: 0.718299, valid-logloss: 0.0552611, valid-auc: 0.707476
20:48:29 epoch: 6, train-logloss: 0.0616096, train-auc: 0.721115, valid-logloss: 0.0552307, valid-auc: 0.708315
21:00:22 epoch: 7, train-logloss: 0.0615140, train-auc: 0.723575, valid-logloss: 0.0552069, valid-auc: 0.709004
21:18:24 epoch: 8, train-logloss: 0.0614276, train-auc: 0.725779, valid-logloss: 0.0551862, valid-auc: 0.709557

SSP模式(threshold=20)

20:38:15 epoch: 1, train-logloss: 0.0627076, train-auc: 0.690827, valid-logloss: 0.0555455, valid-auc: 0.699352
20:46:12 epoch: 2, train-logloss: 0.0622098, train-auc: 0.705044, valid-logloss: 0.0554167, valid-auc: 0.703117
20:55:21 epoch: 3, train-logloss: 0.0619982, train-auc: 0.710830, valid-logloss: 0.0553428, valid-auc: 0.705237
21:06:25 epoch: 4, train-logloss: 0.0618434, train-auc: 0.714978, valid-logloss: 0.0552917, valid-auc: 0.706651
21:20:52 epoch: 5, train-logloss: 0.0617175, train-auc: 0.718303, valid-logloss: 0.0552538, valid-auc: 0.707730
21:32:35 epoch: 6, train-logloss: 0.0616095, train-auc: 0.721119, valid-logloss: 0.0552240, valid-auc: 0.708575
21:45:32 epoch: 7, train-logloss: 0.0615138, train-auc: 0.723582, valid-logloss: 0.0552001, valid-auc: 0.709290
21:56:57 epoch: 8, train-logloss: 0.0614273, train-auc: 0.725787, valid-logloss: 0.0551785, valid-auc: 0.709892

SSP模式(threshold=10)

21:21:47 epoch: 1, train-logloss: 0.0627156, train-auc: 0.690705, valid-logloss: 0.0555498, valid-auc: 0.699055
21:32:30 epoch: 2, train-logloss: 0.0622118, train-auc: 0.705017, valid-logloss: 0.0554154, valid-auc: 0.702878
21:44:33 epoch: 3, train-logloss: 0.0619991, train-auc: 0.710818, valid-logloss: 0.0553408, valid-auc: 0.704996
21:54:42 epoch: 4, train-logloss: 0.0618440, train-auc: 0.714974, valid-logloss: 0.0552890, valid-auc: 0.706415
22:03:06 epoch: 5, trainlogloss: 0.06171790, train-auc: 0.718299, valid-logloss: 0.0552505, valid-auc: 0.707568
22:11:31 epoch: 6, train-logloss: 0.0616097, train-auc: 0.721113, valid-logloss: 0.0552197, valid-auc: 0.708433
22:19:22 epoch: 7, train-logloss: 0.0615140, train-auc: 0.723578, valid-logloss: 0.0551951, valid-auc: 0.709113
22:29:30 epoch: 8, train-logloss: 0.0614275, train-auc: 0.725781, valid-logloss: 0.0551743, valid-auc: 0.709645

分析

可以看出,BSP效果还是比ASP好一些,但差距非常小,只有万分位差距。当然,在不同数据上可能差距有所不同。另外,BSP每轮需要10分钟,ASP只需要6分钟。而且,这次实验还是在集群负载比较空闲的时候测的,由于机器学习和Spark、MapReduce等大数据任务共享集群,如果集群负载较高,可能慢节点的情况会更严重,BSP估计速度会更慢。所以结合效果和速度来说,并没有使用BSP模式的必要。

SSP模式,在阈值为50的情况下,效果基本就和ASP差不多了,但速度还是慢一些,和BSP的速度差不多,估计有个慢节点拖后腿了。随着阈值变小,SSP效果也在慢慢变好,和BSP的效果接近。每个配置只跑了一次,所以稍微有点波动。

另外,这个测试是在CTR预估模型上做的,是个非常稀疏的模型,在梯度更新时冲突的情况不太大,所以ASP效果和BSP并没有明显差别。在图像领域,都是稠密模型,All-Reduce是最常使用的梯度汇聚和参数更新方式,是类似BSP的完全同步模式。在稠密模型上ASP效果怎么样呢?后续有时间再试试。