Research

Research Interests

I am broadly interested in theory, algorithm and application of machine learning. I am also interested in non-convex and convex optimization.

Recently, I am dedicated to to use theory to design algorithms elegantly.

Specifically, my recent research topics are

Deep Learning Theory: theory and theory-inspired algorithms.[1][2][3][4][5][6][8][9][10][11][12][13][14][17][18]
- Expressivity: Exploring the expressive power of Transformers through the lens of approximation theory [8][12]; the expressivity of the expressivity of mixture-of-experts models (MoE) [16].
- Optimization: When training neural networks, why can optimization algorithms converge to global minima? [2][4][12]
- Implicit Bias: When training neural networks, why can optimization algorithms converge to global minima with favorable generalization ability (even without any explicit regularization)? Such as flat-minima-bias [3][5][9][10][11] and max-margin-bias aspects [4][6].
- Generalization: How to measure the generalization ability of neural networks. [1]
- Algorithm Design: For machine learning problems, design new provable optimization algorithms which can which can (i) converge faster [10][13][17][18]; (ii) generalize better [6][10].
Transformer and Large Language Model: theory and algorithm, especially in LLM pre-training. [8][10][12][13][17][18]
- Expressive Power: The expressive power and mechanisms of Transformer [8][12]; the expressivity of mixture-of-experts models (MoE)[16]; the mechanisms of in-context learning[12].
- Algorithm Design: Design provable faster optimizers for training LLMs [10][13][17][18]; design more efficient model architectures.
Non-convex and Convex Optimization: theory and algorithm. [2][4][6][10][11][12][13][14][17][18]
- Convex Optimization in ML. [6]
- Non-convex Optimization in ML. [2][4][10][11][12][13][14][17][18]
- Algorithm Design: Design provable faster / more stable optimizers for training neural networks [10][13][17][18]; accelerate the convergence for the problems with specific structure [6].

Recent Publications and Preprints

* indicates equal contribution, † means project lead.

[18] GradPower: Powering Gradients for Faster Language Model Pre-Training
Mingze Wang*†, Jinbo Wang*, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, Lei Wu.
under review, 1-22. May 2025.
[17] A Mechanistic Study of Transformer Training Instability under Mixed Precision
Shengtao Guo*, Mingze Wang*, Jinbo Wang, Lei Wu.
under review. May 2025.
[16] On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks
Mingze Wang†, Weinan E.
under review, 1-18. May 2025.
[15] On the Learning Dynamics of Two-layer ReLU Networks with Label Noise SGD
Tongcheng Zhang, Zhanpeng Zhou, Mingze Wang, Andi Han, Wei Huang, Taiji Suzuki, Junchi Yan.
ICML 2025 Workshop on High-dimensional Learning Dynamics (ICML 2025 - HiLD).
[14] A Single Global Merging Suffices: Recovering Centralized Learning Performance in Decentralized Learning
Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou, Can Wang.
ICLR 2025 Workshop Weight Space Learning (ICLR 2025 - WSL), 1-23.
[13] The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
Jinbo Wang*, Mingze Wang*†, Zhanpeng Zhou*, Junchi Yan, Weinan E, Lei Wu.
2025 International Conference on Machine Learning (ICML 2025), 1-23.
[12] How Transformers Get Rich: Approximation and Dynamics Analysis
Mingze Wang†, Ruoxi Yu, Weinan E, Lei Wu.
ICML 2025 Workshop on High-dimensional Learning Dynamics (ICML 2025 - HiLD), 1-47.
[11] Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training
Zhanpeng Zhou*, Mingze Wang*, Yuchen Mao, Bingrui Li, Junchi Yan.
2025 International Conference on Learning Representations (ICLR 2025) (Spotlight, top 5.1%), 1-31.
[10] Improving Generalization and Convergence by Enhancing Implicit Regularization
Mingze Wang†, Jinbo Wang, Haotian He, Zilin Wang, Guanhua Huang, Feiyu Xiong, Zhiyu Li, Weinan E, Lei Wu
2024 Conference on Neural Information Processing Systems (NeurIPS 2024), 1-44.
[9] Loss Symmetry and Noise Equilibrium of Stochastic Gradient Descent
Liu Ziyin, Mingze Wang, Hongchao Li, Lei Wu
2024 Conference on Neural Information Processing Systems (NeurIPS 2024), 1-26.
[8] Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
Mingze Wang, Weinan E
2024 Conference on Neural Information Processing Systems (NeurIPS 2024), 1-76.
[7] Are AI-Generated Text Detectors Robust to Adversarial Perturbations?
Guanhua Huang, Yuchen Zhang, Zhe Li, Yongjian You, Mingze Wang, Zhouwang Yang
2024 Annual Meeting of the Association for Computational Linguistics (ACL 2024), 1-20.
[6] Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling
Mingze Wang†, Zeping Min, Lei Wu
2024 International Conference on Machine Learning (ICML 2024), 1-38.
[5] A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent
Mingze Wang, Lei Wu
NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning (NeurIPS 2023 - M3L), 1-30.
[4] Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks
Mingze Wang†, Chao Ma
2023 Conference on Neural Information Processing Systems (NeurIPS 2023) (Spotlight, top 3.5%), 1-94.
[3] The alignment property of SGD noise and how it helps select flat minima: A stability analysis
Lei Wu, Mingze Wang, Weijie J. Su
2022 Conference on Neural Information Processing Systems (NeurIPS 2022), 1-25.
[2] Early Stage Convergence and Global Convergence of Training Mildly Parameterized Neural Networks
Mingze Wang†, Chao Ma
2022 Conference on Neural Information Processing Systems (NeurIPS 2022), 1-73.
[1] Generalization Error Bounds for Deep Neural Networks Trained by SGD
Mingze Wang†, Chao Ma
arXiv preprint, 1-32, June 2022.

Co-authors

Weinan E. Peking University; Princeton University; AI for Science Institute.

Chao Ma. Department of Mathematics, Stanford University.

Lei Wu. School of Mathematical Sciences, Peking University.

Weijie J. Su. Department of Statistics and Data Science, University of Pennsylvania.

Liu Ziyin. Department of Physics, MIT.