The Challenge: Modern machine learning requires training on massive datasets across distributed systems, where communication costs often dominate computational costs and can limit scalability.
My Approach: I develop methods that integrate communication compression with error compensation mechanisms to mitigate communication bottlenecks while preserving convergence guarantees.
Empirical Insight: Figure 1 shows the distributed reinforcement learning deployments that motivate these systems, Figure 2 highlights how identifying the stochastic limitations of EF21 and adding momentum restores stability, and Figure 3 demonstrates Safe-EF’s worker-scaling gains that keep communication budgets in check.
Key Contributions:
Impact: The EF21 line of work is now a foundation for communication-efficient federated learning systems.
The EF21 algorithm and its extensions have become fundamental building blocks for communication-efficient distributed machine learning, enabling large-scale training while maintaining theoretical guarantees and practical performance.
Figure 1. Distributed safe reinforcement learning of humanoid agents across multi-node clusters that use compressed updates.
Figure 2. EF21-SGD diverges under simple stochastic noise when using aggressive compression, whereas the momentum-enhanced EF21-SGDM remains stable near the optimum and remains efficient in more challenging tasks.
Figure 3. Safe-EF convergence across worker counts: more workers trim communication cost even as gains taper, highlighting the method's scalability.