Anastasia Koloskova

About me

I am an assistant professor of AI and Optimization at the Department of Mathematical Modeling and Machine Learning at the University of Zurich. I am looking for PhD students and postdocs. If you are interested in pursuing a PhD with us, please visit the following link: PhD position. If you are interested in a postdoctoral position, please contact me directly via email.

My research interests include machine learning, optimization, and their intersections with decentralized and collaborative learning, and privacy. I am part of the ETH AI Center and the ELLIS society. Previously, I was a postdoctoral researcher at Stanford University at STAIR lab with Prof. Sanmi Koyejo. I completed my Ph.D. at EPFL, Switzerland in the laboratory of Machine Learning and Optimization (MLO), supervised by Prof. Martin Jaggi. My PhD was supported by a Google PhD Fellowship in Machine Learning. During my PhD I also spent some time at FAIR (Facebook AI Research), and Google Research. My PhD thesis received ELLIS PhD Awards and EPFL Thesis Distinction Award.

Publications
Selected Talks
Contact

News

• I am excited to announce that I will join the University of Zurich as an assistant professor in August this year!

I am looking for PhD students and postdocs . If you are interested in pursuing a PhD with us, please visit the following link: PhD position. If you are interested in a postdoctoral position, please contact me directly via email.

• 12/2024: I was honored to receive ELLIS PhD Awards and EPFL Thesis Distinction Award for my PhD thesis!

• 10/2024: I am honored to be selected as a Rising Star in EECS by MIT!

• 02/2024: I successfully defended my PhD thesis named "Optimization Algorithms for Decentralized, Distributed and Collaborative Machine Learning". Link to thesis

Publications

Certified Unlearning for Neural Networks
Anastasia Koloskova*, Youssef Allouah*, Animesh Jha, Rachid Guerraoui, Sanmi Koyejo
ICML 2025 • Paper • BibTex @inproceedings{ koloskova2025certified, title={Certified Unlearning for Neural Networks}, author={Anastasia Koloskova and Youssef Allouah and Animesh Jha and Rachid Guerraoui and Sanmi Koyejo}, booktitle={Forty-second International Conference on Machine Learning}, year={2025}, url={https://openreview.net/forum?id=3rWQlV3s1I} }

On Convergence of Incremental Gradient for Non-Convex Smooth Functions
Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi
ICML 2024 • Paper • BibTex @inproceedings{koloskova2024convergence, title={On Convergence of Incremental Gradient for Non-Convex Smooth Functions}, author={Anastasia Koloskova and Nikita Doikov and Sebastian U. Stich and Martin Jaggi}, year={2024}, booktitle={Proceedings of the 41st International Conference on Machine Learning}, }

The Privacy Power of Correlated Noise in Decentralized Learning
Youssef Allouah, Anastasia Koloskova, Aymane El Firdoussi, Martin Jaggi, Rachid Guerraoui
ICML 2024 • Paper • BibTex @inproceedings{allouah2024privacy, title={The Privacy Power of Correlated Noise in Decentralized Learning}, author={Youssef Allouah and Anastasia Koloskova and Aymane El Firdoussi and Martin Jaggi and Rachid Guerraoui}, year={2024}, booktitle={Proceedings of the 41st International Conference on Machine Learning}, }

Asynchronous SGD on Graphs: a Unified Framework for Asynchronous Decentralized and Federated Optimization
Mathieu Even, Anastasia Koloskova, Laurent Massoulié
AISTATS 2024 • Paper • BibTex @InProceedings{pmlr-v238-even24a, title = { Asynchronous {SGD} on Graphs: a Unified Framework for Asynchronous Decentralized and Federated Optimization }, author = {Even, Mathieu and Koloskova, Anastasia and Massoulie, Laurent}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {64--72}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/even24a/even24a.pdf}, url = {https://proceedings.mlr.press/v238/even24a.html}, abstract = { Decentralized and asynchronous communications are two popular techniques to speedup communication complexity of distributed machine learning, by respectively removing the dependency over a central orchestrator and the need for synchronization. Yet, combining these two techniques together still remains a challenge. In this paper, we take a step in this direction and introduce Asynchronous SGD on Graphs (AGRAF SGD) — a general algorithmic framework that covers asynchronous versions of many popular algorithms including SGD, Decentralized SGD, Local SGD, FedBuff, thanks to its relaxed communication and computation assumptions. We provide rates of convergence under much milder assumptions than previous decentralized asynchronous works, while still recovering or even improving over the best know results for all the algorithms covered. } }

Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy
Anastasia Koloskova, Ryan McKenna, Zachary Charles, Keith Rush, Brendan McMahan
NeurIPS 2023 • Paper • BibTex @inproceedings{ koloskova2023gradient, title={Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy}, author={Anastasia Koloskova and Ryan McKenna and Zachary Charles and J Keith Rush and Hugh Brendan McMahan}, booktitle={Thirty-seventh Conference on Neural Information Processing Systems}, year={2023}, url={https://openreview.net/forum?id=qCglMj6A4z} }

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees
Anastasia Koloskova*, Hadrien Hendrikx*, Sebastian U Stich
ICML 2023 • Paper • BibTex @inproceedings{koloskova:clipping, TITLE = {{Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees}}, AUTHOR = {Koloskova, Anastasia and Hendrikx, Hadrien and Stich, Sebastian U.}, URL = {https://openreview.net/pdf?id=C3DXiFTrve}, BOOKTITLE = {{ICML 2023 - 40th International Conference on Machine Learning}}, ADDRESS = {Honolulu, Hawaii,, United States}, PAGES = {1-19}, YEAR = {2023}, MONTH = Jul, }

Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning
Anastasia Koloskova, Sebastian U Stich, Martin Jaggi
NeurIPS 2022 (Oral, Notable top 7%) • Paper • BibTex @inproceedings{ koloskova2022sharper, title={Sharper Convergence Guarantees for Asynchronous {SGD} for Distributed and Federated Learning}, author={Anastasia Koloskova and Sebastian U Stich and Martin Jaggi}, booktitle={Advances in Neural Information Processing Systems}, editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho}, year={2022}, url={https://openreview.net/forum?id=4_oCZgBIVI} }

Decentralized Local Stochastic Extra-Gradient for Variational Inequalities
Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U. Stich, Alexander Gasnikov
NeurIPS 2022 • Paper • BibTex @inproceedings{ beznosikov2022decentralized, title={Decentralized Local Stochastic Extra-Gradient for Variational Inequalities}, author={Aleksandr Beznosikov and Pavel Dvurechensky and Anastasia Koloskova and Valentin Samokhin and Sebastian U Stich and Alexander Gasnikov}, booktitle={Advances in Neural Information Processing Systems}, editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho}, year={2022}, url={https://openreview.net/forum?id=Y4vT7m4e3d} }

An Improved Analysis of Gradient Tracking for Decentralized Machine Learning
Anastasia Koloskova, Tao Lin, Sebastian U Stich
NeurIPS 2021 • Paper • BibTex @inproceedings{ koloskova2021an, title={An Improved Analysis of Gradient Tracking for Decentralized Machine Learning}, author={Anastasia Koloskova and Tao Lin and Sebastian U Stich}, booktitle={Advances in Neural Information Processing Systems}, editor={A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan}, year={2021}, url={https://openreview.net/forum?id=CmI7NqBR4Ua} }

RelaySum for Decentralized Deep Learning on Heterogeneous Data
Thijs Vogels*, Lie He*, Anastasia Koloskova, Sai Praneeth Karimireddy, Tao Lin, Sebastian U Stich, Martin Jaggi
NeurIPS 2021 • Paper • Code • BibTex @inproceedings{ vogels2021relaysum, title={RelaySum for Decentralized Deep Learning on Heterogeneous Data}, author={Thijs Vogels and Lie He and Anastasia Koloskova and Sai Praneeth Karimireddy and Tao Lin and Sebastian U Stich and Martin Jaggi}, booktitle={Advances in Neural Information Processing Systems}, editor={A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan}, year={2021}, url={https://openreview.net/forum?id=Qo6kYy4SBI-} }

A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!
Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtárik, Sebastian U. Stich
AISTATS 2021 • Paper • BibTex @InProceedings{pmlr-v130-kovalev21a, title = { A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free! }, author = {Kovalev, Dmitry and Koloskova, Anastasia and Jaggi, Martin and Richtarik, Peter and Stich, Sebastian}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {4087--4095}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/kovalev21a/kovalev21a.pdf}, url = {https://proceedings.mlr.press/v130/kovalev21a.html}, abstract = { Decentralized optimization methods enable on-device training of machine learning models without a central coordinator. In many scenarios communication between devices is energy demanding and time consuming and forms the bottleneck of the entire system. We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators to the communicated messages. By combining our scheme with a new variance reduction technique that progressively throughout the iterations reduces the adverse effect of the injected quantization noise, we obtain a scheme that converges linearly on strongly convex decentralized problems while using compressed communication only. We prove that our method can solve the problems without any increase in the number of communications compared to the baseline which does not perform any communication compression while still allowing for a significant compression factor which depends on the conditioning of the problem and the topology of the network. We confirm our theoretical findings in numerical experiments. } }

Consensus Control for Decentralized Deep Learning
Lingjing Kong, Tao Lin, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich
ICML 2021 • Paper • BibTex @InProceedings{pmlr-v139-kong21a, title = {Consensus Control for Decentralized Deep Learning}, author = {Kong, Lingjing and Lin, Tao and Koloskova, Anastasia and Jaggi, Martin and Stich, Sebastian}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5686--5696}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/kong21a/kong21a.pdf}, url = {https://proceedings.mlr.press/v139/kong21a.html}, abstract = {Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters. Experiments in earlier works reveal that, even in a data-center setup, decentralized training often suffers from the degradation in the quality of the model: the training and test performance of models trained in a decentralized fashion is in general worse than that of models trained in a centralized fashion, and this performance drop is impacted by parameters such as network size, communication topology and data partitioning. We identify the changing consensus distance between devices as a key parameter to explain the gap between centralized and decentralized training. We show in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart. We empirically validate that the relation between generalization performance and consensus distance is consistent with this theoretical observation. Our empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. To this end, we provide practical training guidelines and exemplify its effectiveness on the data-center setup as the important first step.} }

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
Anastasia Koloskova*, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, Sebastian U. Stich*
ICML 2020 • Paper • BibTex @InProceedings{pmlr-v119-koloskova20a, title = {A Unified Theory of Decentralized {SGD} with Changing Topology and Local Updates}, author = {Koloskova, Anastasia and Loizou, Nicolas and Boreiri, Sadra and Jaggi, Martin and Stich, Sebastian}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {5381--5393}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/koloskova20a/koloskova20a.pdf}, url = {https://proceedings.mlr.press/v119/koloskova20a.html}, abstract = {Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities. Our algorithmic framework covers local SGD updates and synchronous and pairwise gossip updates on adaptive network topology. We derive universal convergence rates for smooth (convex and non-convex) problems and the rates interpolate between the heterogeneous (non-identically distributed data) and iid-data settings, recovering linear convergence rates in many special cases, for instance for over-parametrized models. Our proofs rely on weak assumptions (typically improving over prior work in several aspects) and recover (and improve) the best known complexity results for a host of important scenarios, such as for instance coorperative SGD and federated averaging (local SGD).} }

Decentralized Deep Learning with Arbitrary Communication Compression
Anastasia Koloskova*, Tao Lin*, Sebastian U. Stich, Martin Jaggi
ICLR 2020 • Paper • BibTex @inproceedings{ Koloskova*2020Decentralized, title={Decentralized Deep Learning with Arbitrary Communication Compression}, author={Anastasia Koloskova* and Tao Lin* and Sebastian U Stich and Martin Jaggi}, booktitle={International Conference on Learning Representations}, year={2020}, url={https://openreview.net/forum?id=SkgGCkrKvH} }

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
Anastasia Koloskova*, Sebastian U. Stich*, Martin Jaggi
ICML 2019 • Paper • BibTex @InProceedings{pmlr-v97-koloskova19a, title = {Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication}, author = {Koloskova, Anastasia and Stich, Sebastian and Jaggi, Martin}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {3478--3487}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/koloskova19a/koloskova19a.pdf}, url = {https://proceedings.mlr.press/v97/koloskova19a.html}, abstract = {We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning tasks) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To address the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by \delta <= 1 (\delta=1 meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate O(1/(nT) + 1/(T \rho^2 \delta)^2) for strongly convex objectives, where T denotes the number of iterations and \rho the eigengap of the connectivity matrix. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time O(1/(\rho^2\delta) \log (1/\epsilon)) for accuracy \epsilon > 0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for \delta > 0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.} }

Efficient Greedy Coordinate Descent for Composite Problems
Sai Praneeth Karimireddy*, Anastasia Koloskova*, Sebastian U. Stich, Martin Jaggi
AISTATS 2019 • Paper • BibTex @inproceedings{Karimireddy:271570, title = {Efficient Greedy Coordinate Descent for Composite Problems}, author = {Karimireddy, Sai Praneeth Reddy and Koloskova, Anastasiia and Stich, Sebastian Urban and Jaggi, Martin}, journal = {The 22nd International Conference on Artificial Intelligence and Statistics, 16-18 April 2019}, volume = {89}, series = {Proceedings of Machine Learning Research. 89}, pages = {2887-2896}, year = {2019}, abstract = {Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension n, requiring n times fewer iterations. In this paper, we consider greedy updates that are based on subgradients for a class of non-smooth composite problems, including L1-regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of n, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy. Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments.}, url = {http://infoscience.epfl.ch/record/271570}, }

Selected Talks and Presentations

Methodological Aspects of Federated Learning, Challenges and Opportunities
Basel Biometric Society, 2023 • link

Convergence of Gradient Descent with Linearly Correlated Noise and Applications to Differentially Private Learning
Simons Berkeley Federated & Collaborative Learning Workshop, 2023 • link

Sharper Convergence Guarantees for Asynchronous SGD
Federated Learning One World Seminar, 2022 • link

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
Federated Learning One World Seminar, 2021 • link

Choco-SGD: Communication Efficient Decentralized Learning
Applied Machine Learning Days, 2020 • link

Communication Efficient Decentralized Machine Learning
Youtube video @ ZettaBytes, EPFL, 2019 • link

Communication Efficient Decentralized Machine Learning
IC (CS) Department Research Day, EPFL, 2019 • link

Get In Touch
My email:
anastasiia.koloskova@[uzh].ch