OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning – Prof. Song Guo (COMP)
Description:
Our task is to improve the speed of large-scale model training over Parameter Server architecture. Specifically, training datasets are distributed over multiple workers and a global machine learning model is cooperatively trained with the coordination of a server. In this case, each worker is equipped with a GPU. We run the model training on multiple GPUs for utilizing considerable computing resources. However, with the number of GPUs increasing, the communication overhead of updates aggregation and model synchronization among GPUs becomes the bottleneck. To overcome this bottleneck, we propose Overlapping Synchronization Parallelization (OSP) where each worker exchanges information with the server while simultaneously runs computation on the GPU in a non-stop manner. To evaluate OSP, we use multiple GPUs in UBDA to implement distributed experiments on various datasets and machine learning models. The experimental results demonstrate the significant improvement in training efficiency of our method.
Reference:
- Haozhao Wang and Song Guo and Ruixuan Li. OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning, Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019, Kyoto, Japan, August 05-08, 2019