Communication-Efficient Distributed Training at Scale

The Challenge: Modern machine learning requires training on massive datasets across distributed systems, where communication costs often dominate computational costs and can limit scalability.

My Approach: I develop methods that integrate communication compression with error compensation mechanisms to mitigate communication bottlenecks while preserving convergence guarantees.

Key Contributions:

  • EF21: a simple, theoretically strong, and practically fast error-feedback method with optimal communication complexity (NeurIPS 2021, Oral).
  • Six extensions and a comprehensive study of error feedback for modern systems (JMLR 2025).
  • Momentum provably improves error feedback with stability benefits and linear speed-ups (NeurIPS 2023).
  • Safe-EF: communication-efficient methods for nonsmooth constrained optimization in safety-critical settings (ICML 2025).

Impact: The EF21 line of work is now a foundation for communication-efficient federated learning systems.

Federated learning illustration
Distributed Safe Reinforcement Learning of Humanoid Agents.

Selected Publications

Research Impact

The EF21 algorithm and its extensions have become fundamental building blocks for communication-efficient distributed machine learning, enabling large-scale training while maintaining theoretical guarantees and practical performance.