Publications

Additional information may be found at Google Scholar, dblp, and Semantic Scholar.

On the Sample Complexity of Robust Binary Hypothesis Testing Shankar Vallinayagam, Ankit Pensia, and Varun Jog.
Manuscript, 2026
arXiv
We study the sample complexity of robust binary hypothesis testing under three standard contamination models: $\epsilon$-additive (Huber), $\epsilon$-subtractive, and $\epsilon$-total variation (TV), denoted by $n^*_{\mathrm{Hub}}(\epsilon)$, $n^*_{\mathrm{Sub}}(\epsilon)$, and $n^*_{\mathrm{TV}}(\epsilon)$, respectively. For subtractive contamination, we show that least favourable distributions exist and provide explicit formulas for the same, bringing this model in line with the classical Huber and TV models. Next we show that in all three models, sample complexity may be highly unstable in the contamination parameter $\epsilon$, increasing by polynomial factors even for $o(\epsilon)$ perturbations. Similarly, there may be polynomial factor gaps between the sample complexities when $\epsilon$ is known exactly versus when it is known up to $o(\epsilon)$ error. Despite the instability of the sample complexity in all models, we show that the sample complexities across models are comparable up to constant-factor rescaling of $\epsilon$. Specifically, for any fixed $\delta_0>0$, the following hold for all distributions $p$ and $q$: (i) $n^*_{\mathrm{Hub}}(\epsilon) \lesssim n^*_{\mathrm{TV}}(\epsilon) \lesssim n^*_{\mathrm{Hub}}(2\epsilon)$, (ii) $n^*_{\mathrm{Sub}}(\epsilon) \lesssim n^*_{\mathrm{TV}}(\epsilon) \lesssim n^*_{\mathrm{Sub}}((2+\delta_0)\epsilon)$, and (iii) $n^*_{\mathrm{Sub}}(\epsilon) \lesssim n^*_{\mathrm{Hub}}(\epsilon) \lesssim n^*_{\mathrm{Sub}}((1+\delta_0)\epsilon)$, and the scaling constants are tight. Finally, we extend our results to adaptive versions of the contamination models.
Robust Regression with Adaptive Contamination in Response:
Optimal Rates and Computational Barriers with Ilias Diakonikolas, Chao Gao, Daniel M. Kane, and Dong Xie.
Manuscript, 2026
arXiv
We study robust regression under a contamination model in which covariates are clean while the responses may be corrupted in an adaptive manner. Unlike the classical Huber's contamination model, where both covariates and responses may be contaminated and consistent estimation is impossible when the contamination proportion is a non-vanishing constant, it turns out that the clean-covariate setting admits strictly improved statistical guarantees. Specifically, we show that the additional information in the clean covariates can be carefully exploited to construct an estimator that achieves a better estimation rate than that attainable under Huber contamination. In contrast to the Huber model, this improved rate implies consistency even when the contamination is a constant. A matching minimax lower bound is established using Fano's inequality together with the construction of contamination processes that match $m> 2$ distributions simultaneously, extending the previous two-point lower bound argument in Huber's setting. Despite the improvement over the Huber model from an information-theoretic perspective, we provide formal evidence---in the form of Statistical Query and Low-Degree Polynomial lower bounds---that the problem exhibits strong information-computation gaps. Our results strongly suggest that the information-theoretic improvements cannot be achieved by polynomial-time algorithms, revealing a fundamental gap between information-theoretic and computational limits in robust regression with clean covariates.
High-dimensional estimation with missing data:
Statistical and computational limits with Kabir A. Verchand, Saminul Haque, and Rohith Kuditipudi.
Manuscript, 2026
arXiv
We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $\epsilon$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $\rho$, for any constant contamination $\epsilon \in (0, 1)$, (roughly) $n \gtrsim d e^{1/\rho^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/\rho^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.
Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination with Ilias Diakonikolas, Chao Gao, Daniel M. Kane, and John Lafferty.
Advances in Neural Information Processing Systems (NeurIPS), 2025
arXiv conference version
We study the task of noiseless linear regression under Gaussian covariates in the presence of additive oblivious contamination. Specifically, we are given i.i.d. samples from a distribution $(x, y)$ on $\mathbb R^d \times \mathbb R$ with $x \sim \mathcal N(0,I_d)$ and $y = x^\top \beta + z$, where $z$ is drawn from an unknown distribution that is independent of $x$. Moreover, $z$ satisfies $\mathbb P[z = 0] = \alpha>0$. The goal is to accurately recover the regressor $\beta$ to small $\ell_2$-error. Ignoring computational considerations, this problem is known to be solvable using $O(d/\alpha)$ samples. On the other hand, the best known polynomial-time algorithms require $\Omega(d/\alpha^2)$ samples. Here we provide formal evidence that the quadratic dependence in $1/\alpha$ is inherent for efficient algorithms. Specifically, we show that any efficient Statistical Query algorithm for this task requires sample complexity at least $\tilde{\Omega}(d^{1/2}/\alpha^2)$.
SoS Certificates for Sparse Singular Values and Their Applications:
Robust Statistics, Subspace Distortion, and More with Ilias Diakonikolas, Samuel B. Hopkins, and Stefan Tiegel.
Symposium on Theory of Computing (STOC), 2025
arXiv conference version
We study \emph{sparse singular value certificates} for random rectangular matrices. If $M$ is an $n \times d$ matrix with independent Gaussian entries, we give a new family of polynomial-time algorithms which can certify upper bounds on the maximum of $\|M u\|$, where $u$ is a unit vector with at most $\eta n$ nonzero entries for a given $\eta \in (0,1)$. This basic algorithmic primitive lies at the heart of a wide range of problems across algorithmic statistics and theoretical computer science, including robust mean and covariance estimation, certification of distortion of random subspaces of $\mathbb{R}^n$, certification of the $2 \rightarrow p$ norm of a random matrix, and sparse principal component analysis.

Our algorithms certify a bound which is asymptotically smaller than the naive one, given by the maximum singular value of $M$, for nearly the widest-possible range of $n,d,$ and $\eta$. Efficiently certifying such a bound for a range of $n,d$ and $\eta$ which is larger by any polynomial factor than what is achieved by our algorithm would violate lower bounds in the statistical query and low-degree polynomials models. Our certification algorithm makes essential use of the Sum-of-Squares hierarchy. To prove the correctness of our algorithm, we develop a new combinatorial connection between the \emph{graph matrix} approach to analyze random matrices with dependent entries, and the Efron-Stein decomposition of functions of independent random variables.

As applications of our certification algorithm, we obtain new efficient algorithms for a wide range of well-studied algorithmic tasks. In algorithmic robust statistics, we obtain new algorithms for robust mean and covariance estimation with tradeoffs between breakdown point and sample complexity, which are nearly matched by statistical query and low-degree polynomial lower bounds (that we establish). We also obtain new polynomial-time guarantees for certification of $\ell_1/\ell_2$ distortion of random subspaces of $\mathbb{R}^n$ (also with nearly matching lower bounds), sparse principal component analysis, and certification of the $2\rightarrow p$ norm of a random matrix.
SoS Certifiability of Subgaussian Distributions and its Algorithmic Applications with Ilias Diakonikolas, Samuel B. Hopkins, and Stefan Tiegel.
Symposium on Theory of Computing (STOC), 2025
arXiv conference version slides (10min)slides (25min)
We prove that there is a universal constant $C>0$ so that for every $d \in \mathbb N$, every centered subgaussian distribution $\mathcal D$ on $\mathbb R^d$, and every even $p \in \mathbb N$, the $d$-variate polynomial $(Cp)^{p/2} \cdot \|v\|_{2}^p - \mathbb E_{X \sim \mathcal D} \langle v,X\rangle^p$ is a sum of square polynomials. This establishes that every subgaussian distribution is \emph{SoS-certifiably subgaussian}---a condition that yields efficient learning algorithms for a wide variety of high-dimensional statistical tasks. As a direct corollary, we obtain computationally efficient algorithms with near-optimal guarantees for the following tasks, when given samples from an arbitrary subgaussian distribution: robust mean estimation, list-decodable mean estimation, clustering mean-separated mixture models, robust covariance-aware mean estimation, robust covariance estimation, and robust linear regression. Our proof makes essential use of Talagrand's generic chaining/majorizing measures theorem.
The Sample Complexity of Simple Binary Hypothesis Testing:
Tight Bounds with Sequential Interactivity and Information Constraints Hadi Kazemi, Ankit Pensia, and Varun Jog.
Conference on Learning Theory (COLT), 2025
arXiv conference version
This paper resolves two open problems from a recent paper [Pensia, Jog, and Loh, COLT 2024] concerning the sample complexity of distributed simple binary hypothesis testing under information constraints. The first open problem asks whether interaction reduces the sample complexity of distributed simple binary hypothesis testing. In this paper, we show that sequential interaction does not help. The second problem suggests tightening existing sample complexity bounds for communication-constrained simple binary hypothesis testing. We derive optimally tight bounds for this setting and resolve this problem. Our main technical contributions are: (i) a one-shot lower bound on the Bayes error in simple binary hypothesis testing that tensorises; (ii) a streamlined proof of the formula for the sample complexity of simple binary hypothesis testing without constraints, first established in [PJL, COLT 2024]; and (iii) a reverse data-processing inequality for Hellinger-$\lambda$ divergences, generalising the results from [Bhatt, Nazer, Ordentlich, and Polyanskiy, IEEE Transactions of Information Theory 2021] and [Pensia, Jog, and Loh, IEEE Transactions of Information Theory 2024].
Optimal Robust Estimation under Local and Global Corruptions:
Stronger Adversary and Smaller Error with Thanasis Pittas.
Conference on Learning Theory (COLT), 2025
arXiv conference version
Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average. While each noise model is well understood individually, the combined contamination model poses new algorithmic challenges, with only partial results known. Existing efficient algorithms are limited in two ways: (i) they work only for a weak notion of local perturbations, and (ii) they obtain suboptimal error for isotropic subgaussian distributions (among others). The latter limitation led \cite{NieGS24} to hypothesize that improving the error might, in fact, be computationally hard. Perhaps surprisingly, we show that information theoretically optimal error can indeed be achieved in polynomial time, under an even \emph{stronger} local perturbation model (the sliced-Wasserstein metric as opposed to the Wasserstein metric). Notably, our analysis reveals that the entire family of stability-based robust mean estimators continues to work optimally in a black-box manner for the combined contamination model. This generalization is particularly useful in real-world scenarios where the specific form of data corruption is not known in advance. We also present efficient algorithms for distribution learning and principal component analysis in the combined contamination model.
A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation Ankit Pensia.
International Conference on Machine Learning (ICML), 2024 (Spotlight)
arXiv conference version slides (of a survey; 1hr)
We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic $\Omega(d^2)$, which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant~\cite{Valiant15}.
The Sample Complexity of Simple Binary Hypothesis Testing with Varun Jog and Po-Ling Loh.
Conference on Learning Theory (COLT), 2024
arXiv conference version slides (10min)
The sample complexity of simple binary hypothesis testing is the smallest number of i.i.d. samples required to distinguish between two distributions $p$ and $q$ in either: (i) the prior-free setting, with type-I error at most $\alpha$ and type-II error at most $\beta$; or (ii) the Bayesian setting, with Bayes error at most $\delta$ and prior distribution $(\alpha, 1-\alpha)$. This problem has only been studied when $\alpha = \beta$ (prior-free) or $\alpha = 1/2$ (Bayesian), and the sample complexity is known to be characterized by the Hellinger divergence between $p$ and $q$, up to multiplicative constants. In this paper, we derive a formula that characterizes the sample complexity (up to multiplicative constants that are independent of $p$, $q$, and all error parameters) for: (i) all $0 \le \alpha, \beta \le 1/8$ in the prior-free setting; and (ii) all $\delta \le \alpha/4$ in the Bayesian setting. In particular, the formula admits equivalent expressions in terms of certain divergences from the Jensen--Shannon and Hellinger families. The main technical result concerns an $f$-divergence inequality between members of the Jensen--Shannon and Hellinger families, which is proved by a combination of information-theoretic tools and case-by-case analyses. We explore applications of our results to robust and distributed hypothesis testing.
Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints with Amir R. Asadi, Varun Jog, and Po-Ling Loh.
IEEE Transactions on Information Theory (Trans. Inf. Theory), 2024
An extended abstract appeared at Conference on Learning Theory (COLT), 2023
arXiv journal version slides (20min)slides (1hr)Code
We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal{A}$ of two distributions $p$ and $q$, defined as $\mathcal{A} := \{(\mathbf{T} p, \mathbf{T} q) : \mathbf{T} \in \mathcal{C}\}$, where $\mathcal{C}$ is the set of channels characterizing the constraints.
Black-Box $k$-to-1-PCA Reductions: Theory and Applications with Arun Jambulapati, Syamantak Kumar, Jerry Li, Shourya Pandey, and Kevin Tian.
Conference on Learning Theory (COLT), 2024
arXiv conference version slides (10min)
The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have implicit access to via samples. Motivated by these implicit settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood.

Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work both in sample complexity and approximation quality.
Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2024
arXiv conference version
We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian robust $k$-sparse mean estimation on $\mathbb{R}^d$ with corruption rate $\epsilon>0$, our algorithm has sample complexity $(k^2/\epsilon^2)\mathrm{polylog}(d/\epsilon)$, runs in sample polynomial time, and approximates the target mean within $\ell_2$-error $O(\epsilon)$. Previous efficient algorithms inherently incur error $\Omega(\epsilon \sqrt{\log(1/\epsilon)})$. At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications.
Semi-supervised Group DRO: Combating Sparsity with Unlabeled Data with Pranjal Awasthi and Satyen Kale.
International Conference on Algorithmic Learning Theory (ALT), 2024
conference version
In this work we formulate the problem of group distributionally robust optimization (DRO) in a semi-supervised setting. Motivated by applications in robustness and fairness, the goal in group DRO is to learn a hypothesis that minimizes the worst case performance over a pre-specified set of groups defined over the data distribution. In contrast to existing work that assumes access to labeled data from each of the groups, we consider the practical setting where many groups may have little to no amount of labeled data.
We design near optimal learning algorithms in this setting by leveraging the unlabeled data from different groups. The performance of our algorithms can be characterized in terms of a natural quantity that captures the similarity among the various groups and the maximum {m best-in-class} error among the groups. Furthermore, for the special case of squared loss and a convex function class we show that the dependence on the best-in-class error can be avoided. We also derive sample complexity bounds for our proposed semi-supervised algorithm.
Robust regression with covariate filtering: Heavy tails and adversarial contamination with Varun Jog and Po-Ling Loh.
Journal of the American Statistical Association (JASA), 2024
arXiv journal version Code
We study the problem of linear regression where both covariates and responses are potentially (i) heavy-tailed and (ii) adversarially contaminated. Several computationally efficient estimators have been proposed for the simpler setting where the covariates are sub-Gaussian and uncontaminated; however, these estimators may fail when the covariates are either heavy-tailed or contain outliers. In this work, we show how to modify the Huber regression, least trimmed squares, and least absolute deviation estimators to obtain estimators which are simultaneously computationally and statistically efficient in the stronger contamination model. Our approach is quite simple, and consists of applying a filtering algorithm to the covariates, and then applying the classical robust regression estimators to the remaining data. We show that the Huber regression estimator achieves near-optimal error rates in this setting, whereas the least trimmed squares and least absolute deviation estimators can be made to achieve near-optimal error after applying a postprocessing step.
Communication-constrained hypothesis testing:
Optimality, robustness, and reverse data processing inequalities with Varun Jog and Po-Ling Loh.
IEEE Transactions on Information Theory (Trans. Inf. Theory), 2024
A shorter version appeared at ISIT 2022
arXiv journal version
We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting and this bound is tight. We develop a polynomial-time algorithm that achieves the aforementioned sample complexity. Our framework extends to robust hypothesis testing, where the distributions are corrupted in the total variation distance. Our proofs rely on a new reverse data processing inequality and a reverse Markov inequality, which may be of independent interest. For simple $M$-ary hypothesis testing, the sample complexity in the absence of communication constraints has a logarithmic dependence on $M$. We show that communication constraints can cause an exponential blow-up leading to $\Omega(M)$ sample complexity even for adaptive algorithms.
Near-Optimal Algorithms for Gaussians with Huber Contamination:
Mean Estimation and Linear Regression with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2023
arXiv conference version
We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination parameter $\epsilon \in (0,\epsilon_0)$ for a small absolute constant $\epsilon_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(\eps)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\epsilon)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest.
A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm with Ilias Diakonikolas, Daniel M. Kane, Jasper C.H. Lee, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2023 (Spotlight)
arXiv conference version
We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb{R}^d$ such that an unknown $\alpha<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal N(\mu ,\Sigma)$, the goal is to output a list of $O(1/\alpha)$ hypotheses at least one of which is close to $\Sigma$ in relative Frobenius norm. Our main result is a $\mathrm{poly}(d,1/\alpha)$ sample and time algorithm for this task that guarantees relative Frobenius norm error of $\mathrm{poly}(1/\alpha)$. Importantly, our algorithm relies purely on spectral techniques. As a corollary, we obtain an efficient spectral algorithm for robust partial clustering of Gaussian mixture models (GMMs) -- a key ingredient in the recent work of [BDJ+22] on robustly learning arbitrary GMMs. Combined with the other components of [BDJ+22], our new method yields the first Sum-of-Squares-free algorithm for robustly learning GMMs. At the technical level, we develop a novel multi-filtering method for list-decodable covariance estimation that may be useful in other settings.
Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2023
arXiv conference version
We study principal component analysis (PCA), where given a dataset in $\mathbb R^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA. Recent work has developed computationally-efficient algorithms for robust PCA that either take super-linear time or have sub-optimal error guarantees. Our main contribution is to develop a nearly linear time algorithm for robust PCA with near-optimal error guarantees. We also develop a single-pass streaming algorithm for robust PCA with memory usage nearly-linear in the dimension.
Gaussian Mean Testing Made Simple with Ilias Diakonikolas and Daniel M. Kane.
SIAM Symposium on Simplicity in Algorithms (SOSA), 2023
arXiv conference version
We study the following fundamental hypothesis testing problem, which we term Gaussian mean testing. Given i.i.d. samples from a distribution $p$ on $\mathbb{R}^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $\mathcal{N}(0,I_d)$, and (ii) $p$ is a Gaussian $\mathcal{N}(\mu,\Sigma)$ for some unknown covariance $\Sigma$ and mean $\mu \in \mathbb{R}^d$ satisfying $\|\mu\|_2 \geq \epsilon$. Recent work gave an algorithm for this testing problem with the optimal sample complexity of $\Theta(\sqrt{d}/\epsilon^2)$. Both the previous algorithm and its analysis are quite complicated. Here we give an extremely simple algorithm for Gaussian mean testing with a one-page analysis. Our algorithm is sample optimal and runs in sample linear time.
Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions with Ilias Diakonikolas, Daniel M. Kane, and Jasper C.H. Lee.
Advances in Neural Information Processing Systems (NeurIPS), 2022
arXiv conference version
We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.
List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2022 (Oral)
arXiv conference version
We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb R^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\hat \mu$ such that $\|\hat \mu - \mu\|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ''certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\textrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.
Robust Sparse Mean Estimation via Sum of Squares with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
Conference on Learning Theory (COLT), 2022
arXiv conference version
We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with `certifiably bounded' $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \textrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively best possible.
Streaming Algorithms for High-Dimensional Robust Statistics with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2022
arXiv conference version
We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.
Sharp Concentration Inequalities for the Centered Relative Entropy with Alankrita Bhatt.
Information and Inference: a Journal of the IMA, 2022
arXiv journal version
We study the relative entropy between the empirical estimate of a discrete distribution and the true underlying distribution. If the minimum value of the probability mass function exceeds an $\alpha>0$ (i.e. when the true underlying distribution is bounded sufficiently away from the boundary of the simplex), we prove an upper bound on the moment generating function of the centered relative entropy that matches (up to logarithmic factors in the alphabet size and $\alpha$) the optimal asymptotic rates, subsequently leading to a sharp concentration inequality for the centered relative entropy. As a corollary of this result we also obtain confidence intervals and moment bounds for the centered relative entropy that are sharp up to logarithmic factors in the alphabet size and $\alpha.$
Statistical Query Lower Bounds for List-Decodable Linear Regression with Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, and Alistair Stewart.
Advances in Neural Information Processing Systems (NeurIPS), 2021 (Spotlight)
arXiv conference version
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x,y) \in \mathbb{R}^d\times \mathbb{R}$ and a parameter $0<\alpha<1/2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1−\alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\textrm{poly}(1/\alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.
Estimating location parameters in sample-heterogeneous distributions with Varun Jog and Po-Ling Loh.
Information and Inference: a Journal of the IMA, 2021
A shorter version of this article appeared at ISIT 2019
arXiv journal version PDF
Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finite-sample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$-dimensional non-identical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth, and median estimators, and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near-optimal when data are i.i.d. and when the fraction of ``low-noise'' distributions is as small as $\Omega\left(\frac{d \log n}{n}\right)$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points.
Outlier Robust Mean Estimation with Subgaussian Rates via Stability with Ilias Diakonikolas and Daniel M. Kane.
Advances in Neural Information Processing Systems (NeurIPS), 2020
arXiv conference version
We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition. As a corollary, it follows that a number of recently developed algorithms for robust mean estimation, including iterative filtering and non-convex gradient descent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of these algorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the first computationally efficient algorithm with subgaussian rate for outlier-robust mean estimation in the strong contamination model under a finite covariance assumption.
Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient with Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos.
Advances in Neural Information Processing Systems (NeurIPS), 2020 (Spotlight)
arXiv conference version
The strong /lottery ticket hypothesis/ (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. (2020) establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random one that is a factor $O(d^4l^2)$ wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width d and depth l can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep. Our analysis heavily relies on connecting pruning random ReLU networks to random instances of the +SubsetSum+ problem. We then show that this logarithmic over-parameterization is essentially optimal for constant depth networks. Finally, we verify several of our theoretical insights with experiments.
Extracting robust and accurate features via a robust information bottleneck with Varun Jog and Po-Ling Loh.
IEEE Journal on Selected Areas in Information Theory (JSAIT), 2020
journal version PDF
We propose a novel strategy for extracting features in supervised learning that can be used to construct a classifier which is more robust to small perturbations in the input space. Our method builds upon the idea of the information bottleneck, by introducing an additional penalty term that encourages the Fisher information of the extracted features to be small when parametrized by the inputs. We present two formulations where the relevance of the features to output labels is measured using either mutual information or MMSE. By tuning the regularization parameter, we can explicitly trade off the opposing desiderata of robustness and accuracy when constructing a classifier. We derive optimal solutions to both robust information bottleneck formulations when the inputs and outputs are jointly Gaussian, proving that the optimally robust features are also jointly Gaussian in this setting. We also propose methods for optimizing variational bounds on the robust information bottleneck objectives in general settings using stochastic gradient descent, which may be implemented efficiently in neural networks. Our experimental results for synthetic and real data sets show that the proposed feature extraction methods indeed produce classifiers with increased robustness to perturbations.
Deep Topic Models for Multi-label Learning with Rajat Panda, Nikhil Mehta, Mingyuan Zhou, and Piyush Rai.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2019
conference version
We present a probabilistic framework for multi-label learning based on a deep generative model for the binary label vector associated with each observation. Our generative model learns deep multi-layer latent embeddings of the binary label vector, which are conditioned on the input features of the observation. The model also has an interesting interpretation in terms of a deep topic model, with each label vector representing a bag-of-words document, with the input features being its meta-data. In addition to capturing the structural properties of the label space (e.g., a near-low-rank label matrix), the model also offers a clean, geometric interpretation. In particular, the nonlinear classification boundaries learned by the model can be seen as the union of multiple convex polytopes. Our model admits a simple and scalable inference via efficient Gibbs sampling or EM algorithm. We compare our model with state-of-the-art baselines for multi-label learning on benchmark data sets, and also report some interesting qualitative results.
Generalization Error Bounds for Noisy, Iterative Algorithms with Varun Jog and Po-Ling Loh.
IEEE International Symposium on Information Theory (ISIT), 2018
arXiv conference version
In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I(S;W)$ between the algorithm input $S$ and the algorithm output $W$, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.

Some papers naturally reappear across multiple topics.

01. Robust Statistics

Estimation and inference that remain reliable under adversarial contamination, outliers, or model misspecification.

On the Sample Complexity of Robust Binary Hypothesis Testing Shankar Vallinayagam, Ankit Pensia, and Varun Jog.
Manuscript, 2026
arXiv
We study the sample complexity of robust binary hypothesis testing under three standard contamination models: $\epsilon$-additive (Huber), $\epsilon$-subtractive, and $\epsilon$-total variation (TV), denoted by $n^*_{\mathrm{Hub}}(\epsilon)$, $n^*_{\mathrm{Sub}}(\epsilon)$, and $n^*_{\mathrm{TV}}(\epsilon)$, respectively. For subtractive contamination, we show that least favourable distributions exist and provide explicit formulas for the same, bringing this model in line with the classical Huber and TV models. Next we show that in all three models, sample complexity may be highly unstable in the contamination parameter $\epsilon$, increasing by polynomial factors even for $o(\epsilon)$ perturbations. Similarly, there may be polynomial factor gaps between the sample complexities when $\epsilon$ is known exactly versus when it is known up to $o(\epsilon)$ error. Despite the instability of the sample complexity in all models, we show that the sample complexities across models are comparable up to constant-factor rescaling of $\epsilon$. Specifically, for any fixed $\delta_0>0$, the following hold for all distributions $p$ and $q$: (i) $n^*_{\mathrm{Hub}}(\epsilon) \lesssim n^*_{\mathrm{TV}}(\epsilon) \lesssim n^*_{\mathrm{Hub}}(2\epsilon)$, (ii) $n^*_{\mathrm{Sub}}(\epsilon) \lesssim n^*_{\mathrm{TV}}(\epsilon) \lesssim n^*_{\mathrm{Sub}}((2+\delta_0)\epsilon)$, and (iii) $n^*_{\mathrm{Sub}}(\epsilon) \lesssim n^*_{\mathrm{Hub}}(\epsilon) \lesssim n^*_{\mathrm{Sub}}((1+\delta_0)\epsilon)$, and the scaling constants are tight. Finally, we extend our results to adaptive versions of the contamination models.
Robust Regression with Adaptive Contamination in Response:
Optimal Rates and Computational Barriers with Ilias Diakonikolas, Chao Gao, Daniel M. Kane, and Dong Xie.
Manuscript, 2026
arXiv
We study robust regression under a contamination model in which covariates are clean while the responses may be corrupted in an adaptive manner. Unlike the classical Huber's contamination model, where both covariates and responses may be contaminated and consistent estimation is impossible when the contamination proportion is a non-vanishing constant, it turns out that the clean-covariate setting admits strictly improved statistical guarantees. Specifically, we show that the additional information in the clean covariates can be carefully exploited to construct an estimator that achieves a better estimation rate than that attainable under Huber contamination. In contrast to the Huber model, this improved rate implies consistency even when the contamination is a constant. A matching minimax lower bound is established using Fano's inequality together with the construction of contamination processes that match $m> 2$ distributions simultaneously, extending the previous two-point lower bound argument in Huber's setting. Despite the improvement over the Huber model from an information-theoretic perspective, we provide formal evidence---in the form of Statistical Query and Low-Degree Polynomial lower bounds---that the problem exhibits strong information-computation gaps. Our results strongly suggest that the information-theoretic improvements cannot be achieved by polynomial-time algorithms, revealing a fundamental gap between information-theoretic and computational limits in robust regression with clean covariates.
High-dimensional estimation with missing data:
Statistical and computational limits with Kabir A. Verchand, Saminul Haque, and Rohith Kuditipudi.
Manuscript, 2026
arXiv
We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $\epsilon$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $\rho$, for any constant contamination $\epsilon \in (0, 1)$, (roughly) $n \gtrsim d e^{1/\rho^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/\rho^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.
Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination with Ilias Diakonikolas, Chao Gao, Daniel M. Kane, and John Lafferty.
Advances in Neural Information Processing Systems (NeurIPS), 2025
arXiv conference version
We study the task of noiseless linear regression under Gaussian covariates in the presence of additive oblivious contamination. Specifically, we are given i.i.d. samples from a distribution $(x, y)$ on $\mathbb R^d \times \mathbb R$ with $x \sim \mathcal N(0,I_d)$ and $y = x^\top \beta + z$, where $z$ is drawn from an unknown distribution that is independent of $x$. Moreover, $z$ satisfies $\mathbb P[z = 0] = \alpha>0$. The goal is to accurately recover the regressor $\beta$ to small $\ell_2$-error. Ignoring computational considerations, this problem is known to be solvable using $O(d/\alpha)$ samples. On the other hand, the best known polynomial-time algorithms require $\Omega(d/\alpha^2)$ samples. Here we provide formal evidence that the quadratic dependence in $1/\alpha$ is inherent for efficient algorithms. Specifically, we show that any efficient Statistical Query algorithm for this task requires sample complexity at least $\tilde{\Omega}(d^{1/2}/\alpha^2)$.
SoS Certificates for Sparse Singular Values and Their Applications:
Robust Statistics, Subspace Distortion, and More with Ilias Diakonikolas, Samuel B. Hopkins, and Stefan Tiegel.
Symposium on Theory of Computing (STOC), 2025
arXiv conference version
We study \emph{sparse singular value certificates} for random rectangular matrices. If $M$ is an $n \times d$ matrix with independent Gaussian entries, we give a new family of polynomial-time algorithms which can certify upper bounds on the maximum of $\|M u\|$, where $u$ is a unit vector with at most $\eta n$ nonzero entries for a given $\eta \in (0,1)$. This basic algorithmic primitive lies at the heart of a wide range of problems across algorithmic statistics and theoretical computer science, including robust mean and covariance estimation, certification of distortion of random subspaces of $\mathbb{R}^n$, certification of the $2 \rightarrow p$ norm of a random matrix, and sparse principal component analysis.

Our algorithms certify a bound which is asymptotically smaller than the naive one, given by the maximum singular value of $M$, for nearly the widest-possible range of $n,d,$ and $\eta$. Efficiently certifying such a bound for a range of $n,d$ and $\eta$ which is larger by any polynomial factor than what is achieved by our algorithm would violate lower bounds in the statistical query and low-degree polynomials models. Our certification algorithm makes essential use of the Sum-of-Squares hierarchy. To prove the correctness of our algorithm, we develop a new combinatorial connection between the \emph{graph matrix} approach to analyze random matrices with dependent entries, and the Efron-Stein decomposition of functions of independent random variables.

As applications of our certification algorithm, we obtain new efficient algorithms for a wide range of well-studied algorithmic tasks. In algorithmic robust statistics, we obtain new algorithms for robust mean and covariance estimation with tradeoffs between breakdown point and sample complexity, which are nearly matched by statistical query and low-degree polynomial lower bounds (that we establish). We also obtain new polynomial-time guarantees for certification of $\ell_1/\ell_2$ distortion of random subspaces of $\mathbb{R}^n$ (also with nearly matching lower bounds), sparse principal component analysis, and certification of the $2\rightarrow p$ norm of a random matrix.
SoS Certifiability of Subgaussian Distributions and its Algorithmic Applications with Ilias Diakonikolas, Samuel B. Hopkins, and Stefan Tiegel.
Symposium on Theory of Computing (STOC), 2025
arXiv conference version slides (10min)slides (25min)
We prove that there is a universal constant $C>0$ so that for every $d \in \mathbb N$, every centered subgaussian distribution $\mathcal D$ on $\mathbb R^d$, and every even $p \in \mathbb N$, the $d$-variate polynomial $(Cp)^{p/2} \cdot \|v\|_{2}^p - \mathbb E_{X \sim \mathcal D} \langle v,X\rangle^p$ is a sum of square polynomials. This establishes that every subgaussian distribution is \emph{SoS-certifiably subgaussian}---a condition that yields efficient learning algorithms for a wide variety of high-dimensional statistical tasks. As a direct corollary, we obtain computationally efficient algorithms with near-optimal guarantees for the following tasks, when given samples from an arbitrary subgaussian distribution: robust mean estimation, list-decodable mean estimation, clustering mean-separated mixture models, robust covariance-aware mean estimation, robust covariance estimation, and robust linear regression. Our proof makes essential use of Talagrand's generic chaining/majorizing measures theorem.
Optimal Robust Estimation under Local and Global Corruptions:
Stronger Adversary and Smaller Error with Thanasis Pittas.
Conference on Learning Theory (COLT), 2025
arXiv conference version
Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average. While each noise model is well understood individually, the combined contamination model poses new algorithmic challenges, with only partial results known. Existing efficient algorithms are limited in two ways: (i) they work only for a weak notion of local perturbations, and (ii) they obtain suboptimal error for isotropic subgaussian distributions (among others). The latter limitation led \cite{NieGS24} to hypothesize that improving the error might, in fact, be computationally hard. Perhaps surprisingly, we show that information theoretically optimal error can indeed be achieved in polynomial time, under an even \emph{stronger} local perturbation model (the sliced-Wasserstein metric as opposed to the Wasserstein metric). Notably, our analysis reveals that the entire family of stability-based robust mean estimators continues to work optimally in a black-box manner for the combined contamination model. This generalization is particularly useful in real-world scenarios where the specific form of data corruption is not known in advance. We also present efficient algorithms for distribution learning and principal component analysis in the combined contamination model.
A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation Ankit Pensia.
International Conference on Machine Learning (ICML), 2024 (Spotlight)
arXiv conference version slides (of a survey; 1hr)
We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic $\Omega(d^2)$, which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant~\cite{Valiant15}.
Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2024
arXiv conference version
We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian robust $k$-sparse mean estimation on $\mathbb{R}^d$ with corruption rate $\epsilon>0$, our algorithm has sample complexity $(k^2/\epsilon^2)\mathrm{polylog}(d/\epsilon)$, runs in sample polynomial time, and approximates the target mean within $\ell_2$-error $O(\epsilon)$. Previous efficient algorithms inherently incur error $\Omega(\epsilon \sqrt{\log(1/\epsilon)})$. At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications.
Robust regression with covariate filtering: Heavy tails and adversarial contamination with Varun Jog and Po-Ling Loh.
Journal of the American Statistical Association (JASA), 2024
arXiv journal version Code
We study the problem of linear regression where both covariates and responses are potentially (i) heavy-tailed and (ii) adversarially contaminated. Several computationally efficient estimators have been proposed for the simpler setting where the covariates are sub-Gaussian and uncontaminated; however, these estimators may fail when the covariates are either heavy-tailed or contain outliers. In this work, we show how to modify the Huber regression, least trimmed squares, and least absolute deviation estimators to obtain estimators which are simultaneously computationally and statistically efficient in the stronger contamination model. Our approach is quite simple, and consists of applying a filtering algorithm to the covariates, and then applying the classical robust regression estimators to the remaining data. We show that the Huber regression estimator achieves near-optimal error rates in this setting, whereas the least trimmed squares and least absolute deviation estimators can be made to achieve near-optimal error after applying a postprocessing step.
Near-Optimal Algorithms for Gaussians with Huber Contamination:
Mean Estimation and Linear Regression with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2023
arXiv conference version
We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination parameter $\epsilon \in (0,\epsilon_0)$ for a small absolute constant $\epsilon_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(\eps)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\epsilon)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest.
A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm with Ilias Diakonikolas, Daniel M. Kane, Jasper C.H. Lee, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2023 (Spotlight)
arXiv conference version
We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb{R}^d$ such that an unknown $\alpha<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal N(\mu ,\Sigma)$, the goal is to output a list of $O(1/\alpha)$ hypotheses at least one of which is close to $\Sigma$ in relative Frobenius norm. Our main result is a $\mathrm{poly}(d,1/\alpha)$ sample and time algorithm for this task that guarantees relative Frobenius norm error of $\mathrm{poly}(1/\alpha)$. Importantly, our algorithm relies purely on spectral techniques. As a corollary, we obtain an efficient spectral algorithm for robust partial clustering of Gaussian mixture models (GMMs) -- a key ingredient in the recent work of [BDJ+22] on robustly learning arbitrary GMMs. Combined with the other components of [BDJ+22], our new method yields the first Sum-of-Squares-free algorithm for robustly learning GMMs. At the technical level, we develop a novel multi-filtering method for list-decodable covariance estimation that may be useful in other settings.
Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2023
arXiv conference version
We study principal component analysis (PCA), where given a dataset in $\mathbb R^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA. Recent work has developed computationally-efficient algorithms for robust PCA that either take super-linear time or have sub-optimal error guarantees. Our main contribution is to develop a nearly linear time algorithm for robust PCA with near-optimal error guarantees. We also develop a single-pass streaming algorithm for robust PCA with memory usage nearly-linear in the dimension.
Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions with Ilias Diakonikolas, Daniel M. Kane, and Jasper C.H. Lee.
Advances in Neural Information Processing Systems (NeurIPS), 2022
arXiv conference version
We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.
List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2022 (Oral)
arXiv conference version
We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb R^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\hat \mu$ such that $\|\hat \mu - \mu\|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ''certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\textrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.
Robust Sparse Mean Estimation via Sum of Squares with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
Conference on Learning Theory (COLT), 2022
arXiv conference version
We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with `certifiably bounded' $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \textrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively best possible.
Streaming Algorithms for High-Dimensional Robust Statistics with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2022
arXiv conference version
We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.
Statistical Query Lower Bounds for List-Decodable Linear Regression with Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, and Alistair Stewart.
Advances in Neural Information Processing Systems (NeurIPS), 2021 (Spotlight)
arXiv conference version
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x,y) \in \mathbb{R}^d\times \mathbb{R}$ and a parameter $0<\alpha<1/2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1−\alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\textrm{poly}(1/\alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.
Outlier Robust Mean Estimation with Subgaussian Rates via Stability with Ilias Diakonikolas and Daniel M. Kane.
Advances in Neural Information Processing Systems (NeurIPS), 2020
arXiv conference version
We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition. As a corollary, it follows that a number of recently developed algorithms for robust mean estimation, including iterative filtering and non-convex gradient descent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of these algorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the first computationally efficient algorithm with subgaussian rate for outlier-robust mean estimation in the strong contamination model under a finite covariance assumption.

02. Heavy-Tailed Estimation

Sharp guarantees when data may have weak moment assumptions, rare large deviations, or heavy tails.

Robust regression with covariate filtering: Heavy tails and adversarial contamination with Varun Jog and Po-Ling Loh.
Journal of the American Statistical Association (JASA), 2024
arXiv journal version Code
We study the problem of linear regression where both covariates and responses are potentially (i) heavy-tailed and (ii) adversarially contaminated. Several computationally efficient estimators have been proposed for the simpler setting where the covariates are sub-Gaussian and uncontaminated; however, these estimators may fail when the covariates are either heavy-tailed or contain outliers. In this work, we show how to modify the Huber regression, least trimmed squares, and least absolute deviation estimators to obtain estimators which are simultaneously computationally and statistically efficient in the stronger contamination model. Our approach is quite simple, and consists of applying a filtering algorithm to the covariates, and then applying the classical robust regression estimators to the remaining data. We show that the Huber regression estimator achieves near-optimal error rates in this setting, whereas the least trimmed squares and least absolute deviation estimators can be made to achieve near-optimal error after applying a postprocessing step.
Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions with Ilias Diakonikolas, Daniel M. Kane, and Jasper C.H. Lee.
Advances in Neural Information Processing Systems (NeurIPS), 2022
arXiv conference version
We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.
Outlier Robust Mean Estimation with Subgaussian Rates via Stability with Ilias Diakonikolas and Daniel M. Kane.
Advances in Neural Information Processing Systems (NeurIPS), 2020
arXiv conference version
We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition. As a corollary, it follows that a number of recently developed algorithms for robust mean estimation, including iterative filtering and non-convex gradient descent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of these algorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the first computationally efficient algorithm with subgaussian rate for outlier-robust mean estimation in the strong contamination model under a finite covariance assumption.

03. Information-Computation Gaps

Settings where the statistically optimal rate appears out of reach for computationally efficient algorithms.

Robust Regression with Adaptive Contamination in Response:
Optimal Rates and Computational Barriers with Ilias Diakonikolas, Chao Gao, Daniel M. Kane, and Dong Xie.
Manuscript, 2026
arXiv
We study robust regression under a contamination model in which covariates are clean while the responses may be corrupted in an adaptive manner. Unlike the classical Huber's contamination model, where both covariates and responses may be contaminated and consistent estimation is impossible when the contamination proportion is a non-vanishing constant, it turns out that the clean-covariate setting admits strictly improved statistical guarantees. Specifically, we show that the additional information in the clean covariates can be carefully exploited to construct an estimator that achieves a better estimation rate than that attainable under Huber contamination. In contrast to the Huber model, this improved rate implies consistency even when the contamination is a constant. A matching minimax lower bound is established using Fano's inequality together with the construction of contamination processes that match $m> 2$ distributions simultaneously, extending the previous two-point lower bound argument in Huber's setting. Despite the improvement over the Huber model from an information-theoretic perspective, we provide formal evidence---in the form of Statistical Query and Low-Degree Polynomial lower bounds---that the problem exhibits strong information-computation gaps. Our results strongly suggest that the information-theoretic improvements cannot be achieved by polynomial-time algorithms, revealing a fundamental gap between information-theoretic and computational limits in robust regression with clean covariates.
High-dimensional estimation with missing data:
Statistical and computational limits with Kabir A. Verchand, Saminul Haque, and Rohith Kuditipudi.
Manuscript, 2026
arXiv
We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $\epsilon$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $\rho$, for any constant contamination $\epsilon \in (0, 1)$, (roughly) $n \gtrsim d e^{1/\rho^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/\rho^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.
Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination with Ilias Diakonikolas, Chao Gao, Daniel M. Kane, and John Lafferty.
Advances in Neural Information Processing Systems (NeurIPS), 2025
arXiv conference version
We study the task of noiseless linear regression under Gaussian covariates in the presence of additive oblivious contamination. Specifically, we are given i.i.d. samples from a distribution $(x, y)$ on $\mathbb R^d \times \mathbb R$ with $x \sim \mathcal N(0,I_d)$ and $y = x^\top \beta + z$, where $z$ is drawn from an unknown distribution that is independent of $x$. Moreover, $z$ satisfies $\mathbb P[z = 0] = \alpha>0$. The goal is to accurately recover the regressor $\beta$ to small $\ell_2$-error. Ignoring computational considerations, this problem is known to be solvable using $O(d/\alpha)$ samples. On the other hand, the best known polynomial-time algorithms require $\Omega(d/\alpha^2)$ samples. Here we provide formal evidence that the quadratic dependence in $1/\alpha$ is inherent for efficient algorithms. Specifically, we show that any efficient Statistical Query algorithm for this task requires sample complexity at least $\tilde{\Omega}(d^{1/2}/\alpha^2)$.
SoS Certificates for Sparse Singular Values and Their Applications:
Robust Statistics, Subspace Distortion, and More with Ilias Diakonikolas, Samuel B. Hopkins, and Stefan Tiegel.
Symposium on Theory of Computing (STOC), 2025
arXiv conference version
We study \emph{sparse singular value certificates} for random rectangular matrices. If $M$ is an $n \times d$ matrix with independent Gaussian entries, we give a new family of polynomial-time algorithms which can certify upper bounds on the maximum of $\|M u\|$, where $u$ is a unit vector with at most $\eta n$ nonzero entries for a given $\eta \in (0,1)$. This basic algorithmic primitive lies at the heart of a wide range of problems across algorithmic statistics and theoretical computer science, including robust mean and covariance estimation, certification of distortion of random subspaces of $\mathbb{R}^n$, certification of the $2 \rightarrow p$ norm of a random matrix, and sparse principal component analysis.

Our algorithms certify a bound which is asymptotically smaller than the naive one, given by the maximum singular value of $M$, for nearly the widest-possible range of $n,d,$ and $\eta$. Efficiently certifying such a bound for a range of $n,d$ and $\eta$ which is larger by any polynomial factor than what is achieved by our algorithm would violate lower bounds in the statistical query and low-degree polynomials models. Our certification algorithm makes essential use of the Sum-of-Squares hierarchy. To prove the correctness of our algorithm, we develop a new combinatorial connection between the \emph{graph matrix} approach to analyze random matrices with dependent entries, and the Efron-Stein decomposition of functions of independent random variables.

As applications of our certification algorithm, we obtain new efficient algorithms for a wide range of well-studied algorithmic tasks. In algorithmic robust statistics, we obtain new algorithms for robust mean and covariance estimation with tradeoffs between breakdown point and sample complexity, which are nearly matched by statistical query and low-degree polynomial lower bounds (that we establish). We also obtain new polynomial-time guarantees for certification of $\ell_1/\ell_2$ distortion of random subspaces of $\mathbb{R}^n$ (also with nearly matching lower bounds), sparse principal component analysis, and certification of the $2\rightarrow p$ norm of a random matrix.
Robust Sparse Mean Estimation via Sum of Squares with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
Conference on Learning Theory (COLT), 2022
arXiv conference version
We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with `certifiably bounded' $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \textrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively best possible.
Statistical Query Lower Bounds for List-Decodable Linear Regression with Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, and Alistair Stewart.
Advances in Neural Information Processing Systems (NeurIPS), 2021 (Spotlight)
arXiv conference version
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x,y) \in \mathbb{R}^d\times \mathbb{R}$ and a parameter $0<\alpha<1/2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1−\alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\textrm{poly}(1/\alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.

04. Hypothesis Testing

Sample complexity and algorithmic limits for distinguishing distributions, including robust and constrained settings.

On the Sample Complexity of Robust Binary Hypothesis Testing Shankar Vallinayagam, Ankit Pensia, and Varun Jog.
Manuscript, 2026
arXiv
We study the sample complexity of robust binary hypothesis testing under three standard contamination models: $\epsilon$-additive (Huber), $\epsilon$-subtractive, and $\epsilon$-total variation (TV), denoted by $n^*_{\mathrm{Hub}}(\epsilon)$, $n^*_{\mathrm{Sub}}(\epsilon)$, and $n^*_{\mathrm{TV}}(\epsilon)$, respectively. For subtractive contamination, we show that least favourable distributions exist and provide explicit formulas for the same, bringing this model in line with the classical Huber and TV models. Next we show that in all three models, sample complexity may be highly unstable in the contamination parameter $\epsilon$, increasing by polynomial factors even for $o(\epsilon)$ perturbations. Similarly, there may be polynomial factor gaps between the sample complexities when $\epsilon$ is known exactly versus when it is known up to $o(\epsilon)$ error. Despite the instability of the sample complexity in all models, we show that the sample complexities across models are comparable up to constant-factor rescaling of $\epsilon$. Specifically, for any fixed $\delta_0>0$, the following hold for all distributions $p$ and $q$: (i) $n^*_{\mathrm{Hub}}(\epsilon) \lesssim n^*_{\mathrm{TV}}(\epsilon) \lesssim n^*_{\mathrm{Hub}}(2\epsilon)$, (ii) $n^*_{\mathrm{Sub}}(\epsilon) \lesssim n^*_{\mathrm{TV}}(\epsilon) \lesssim n^*_{\mathrm{Sub}}((2+\delta_0)\epsilon)$, and (iii) $n^*_{\mathrm{Sub}}(\epsilon) \lesssim n^*_{\mathrm{Hub}}(\epsilon) \lesssim n^*_{\mathrm{Sub}}((1+\delta_0)\epsilon)$, and the scaling constants are tight. Finally, we extend our results to adaptive versions of the contamination models.
The Sample Complexity of Simple Binary Hypothesis Testing:
Tight Bounds with Sequential Interactivity and Information Constraints Hadi Kazemi, Ankit Pensia, and Varun Jog.
Conference on Learning Theory (COLT), 2025
arXiv conference version
This paper resolves two open problems from a recent paper [Pensia, Jog, and Loh, COLT 2024] concerning the sample complexity of distributed simple binary hypothesis testing under information constraints. The first open problem asks whether interaction reduces the sample complexity of distributed simple binary hypothesis testing. In this paper, we show that sequential interaction does not help. The second problem suggests tightening existing sample complexity bounds for communication-constrained simple binary hypothesis testing. We derive optimally tight bounds for this setting and resolve this problem. Our main technical contributions are: (i) a one-shot lower bound on the Bayes error in simple binary hypothesis testing that tensorises; (ii) a streamlined proof of the formula for the sample complexity of simple binary hypothesis testing without constraints, first established in [PJL, COLT 2024]; and (iii) a reverse data-processing inequality for Hellinger-$\lambda$ divergences, generalising the results from [Bhatt, Nazer, Ordentlich, and Polyanskiy, IEEE Transactions of Information Theory 2021] and [Pensia, Jog, and Loh, IEEE Transactions of Information Theory 2024].
The Sample Complexity of Simple Binary Hypothesis Testing with Varun Jog and Po-Ling Loh.
Conference on Learning Theory (COLT), 2024
arXiv conference version slides (10min)
The sample complexity of simple binary hypothesis testing is the smallest number of i.i.d. samples required to distinguish between two distributions $p$ and $q$ in either: (i) the prior-free setting, with type-I error at most $\alpha$ and type-II error at most $\beta$; or (ii) the Bayesian setting, with Bayes error at most $\delta$ and prior distribution $(\alpha, 1-\alpha)$. This problem has only been studied when $\alpha = \beta$ (prior-free) or $\alpha = 1/2$ (Bayesian), and the sample complexity is known to be characterized by the Hellinger divergence between $p$ and $q$, up to multiplicative constants. In this paper, we derive a formula that characterizes the sample complexity (up to multiplicative constants that are independent of $p$, $q$, and all error parameters) for: (i) all $0 \le \alpha, \beta \le 1/8$ in the prior-free setting; and (ii) all $\delta \le \alpha/4$ in the Bayesian setting. In particular, the formula admits equivalent expressions in terms of certain divergences from the Jensen--Shannon and Hellinger families. The main technical result concerns an $f$-divergence inequality between members of the Jensen--Shannon and Hellinger families, which is proved by a combination of information-theoretic tools and case-by-case analyses. We explore applications of our results to robust and distributed hypothesis testing.
Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints with Amir R. Asadi, Varun Jog, and Po-Ling Loh.
IEEE Transactions on Information Theory (Trans. Inf. Theory), 2024
An extended abstract appeared at Conference on Learning Theory (COLT), 2023
arXiv journal version slides (20min)slides (1hr)Code
We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal{A}$ of two distributions $p$ and $q$, defined as $\mathcal{A} := \{(\mathbf{T} p, \mathbf{T} q) : \mathbf{T} \in \mathcal{C}\}$, where $\mathcal{C}$ is the set of channels characterizing the constraints.
Communication-constrained hypothesis testing:
Optimality, robustness, and reverse data processing inequalities with Varun Jog and Po-Ling Loh.
IEEE Transactions on Information Theory (Trans. Inf. Theory), 2024
A shorter version appeared at ISIT 2022
arXiv journal version
We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting and this bound is tight. We develop a polynomial-time algorithm that achieves the aforementioned sample complexity. Our framework extends to robust hypothesis testing, where the distributions are corrupted in the total variation distance. Our proofs rely on a new reverse data processing inequality and a reverse Markov inequality, which may be of independent interest. For simple $M$-ary hypothesis testing, the sample complexity in the absence of communication constraints has a logarithmic dependence on $M$. We show that communication constraints can cause an exponential blow-up leading to $\Omega(M)$ sample complexity even for adaptive algorithms.
Gaussian Mean Testing Made Simple with Ilias Diakonikolas and Daniel M. Kane.
SIAM Symposium on Simplicity in Algorithms (SOSA), 2023
arXiv conference version
We study the following fundamental hypothesis testing problem, which we term Gaussian mean testing. Given i.i.d. samples from a distribution $p$ on $\mathbb{R}^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $\mathcal{N}(0,I_d)$, and (ii) $p$ is a Gaussian $\mathcal{N}(\mu,\Sigma)$ for some unknown covariance $\Sigma$ and mean $\mu \in \mathbb{R}^d$ satisfying $\|\mu\|_2 \geq \epsilon$. Recent work gave an algorithm for this testing problem with the optimal sample complexity of $\Theta(\sqrt{d}/\epsilon^2)$. Both the previous algorithm and its analysis are quite complicated. Here we give an extremely simple algorithm for Gaussian mean testing with a one-page analysis. Our algorithm is sample optimal and runs in sample linear time.

05. Resource Constraints

Inference under limits on memory, communication, privacy, interaction, or streaming access.

The Sample Complexity of Simple Binary Hypothesis Testing:
Tight Bounds with Sequential Interactivity and Information Constraints Hadi Kazemi, Ankit Pensia, and Varun Jog.
Conference on Learning Theory (COLT), 2025
arXiv conference version
This paper resolves two open problems from a recent paper [Pensia, Jog, and Loh, COLT 2024] concerning the sample complexity of distributed simple binary hypothesis testing under information constraints. The first open problem asks whether interaction reduces the sample complexity of distributed simple binary hypothesis testing. In this paper, we show that sequential interaction does not help. The second problem suggests tightening existing sample complexity bounds for communication-constrained simple binary hypothesis testing. We derive optimally tight bounds for this setting and resolve this problem. Our main technical contributions are: (i) a one-shot lower bound on the Bayes error in simple binary hypothesis testing that tensorises; (ii) a streamlined proof of the formula for the sample complexity of simple binary hypothesis testing without constraints, first established in [PJL, COLT 2024]; and (iii) a reverse data-processing inequality for Hellinger-$\lambda$ divergences, generalising the results from [Bhatt, Nazer, Ordentlich, and Polyanskiy, IEEE Transactions of Information Theory 2021] and [Pensia, Jog, and Loh, IEEE Transactions of Information Theory 2024].
Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints with Amir R. Asadi, Varun Jog, and Po-Ling Loh.
IEEE Transactions on Information Theory (Trans. Inf. Theory), 2024
An extended abstract appeared at Conference on Learning Theory (COLT), 2023
arXiv journal version slides (20min)slides (1hr)Code
We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal{A}$ of two distributions $p$ and $q$, defined as $\mathcal{A} := \{(\mathbf{T} p, \mathbf{T} q) : \mathbf{T} \in \mathcal{C}\}$, where $\mathcal{C}$ is the set of channels characterizing the constraints.
Communication-constrained hypothesis testing:
Optimality, robustness, and reverse data processing inequalities with Varun Jog and Po-Ling Loh.
IEEE Transactions on Information Theory (Trans. Inf. Theory), 2024
A shorter version appeared at ISIT 2022
arXiv journal version
We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting and this bound is tight. We develop a polynomial-time algorithm that achieves the aforementioned sample complexity. Our framework extends to robust hypothesis testing, where the distributions are corrupted in the total variation distance. Our proofs rely on a new reverse data processing inequality and a reverse Markov inequality, which may be of independent interest. For simple $M$-ary hypothesis testing, the sample complexity in the absence of communication constraints has a logarithmic dependence on $M$. We show that communication constraints can cause an exponential blow-up leading to $\Omega(M)$ sample complexity even for adaptive algorithms.
Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2023
arXiv conference version
We study principal component analysis (PCA), where given a dataset in $\mathbb R^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA. Recent work has developed computationally-efficient algorithms for robust PCA that either take super-linear time or have sub-optimal error guarantees. Our main contribution is to develop a nearly linear time algorithm for robust PCA with near-optimal error guarantees. We also develop a single-pass streaming algorithm for robust PCA with memory usage nearly-linear in the dimension.
Streaming Algorithms for High-Dimensional Robust Statistics with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2022
arXiv conference version
We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.

06. Beyond-Worst-Case Outliers

Structured corruption models that refine classical adversarial contamination and capture more realistic outlier behavior.

Robust Regression with Adaptive Contamination in Response:
Optimal Rates and Computational Barriers with Ilias Diakonikolas, Chao Gao, Daniel M. Kane, and Dong Xie.
Manuscript, 2026
arXiv
We study robust regression under a contamination model in which covariates are clean while the responses may be corrupted in an adaptive manner. Unlike the classical Huber's contamination model, where both covariates and responses may be contaminated and consistent estimation is impossible when the contamination proportion is a non-vanishing constant, it turns out that the clean-covariate setting admits strictly improved statistical guarantees. Specifically, we show that the additional information in the clean covariates can be carefully exploited to construct an estimator that achieves a better estimation rate than that attainable under Huber contamination. In contrast to the Huber model, this improved rate implies consistency even when the contamination is a constant. A matching minimax lower bound is established using Fano's inequality together with the construction of contamination processes that match $m> 2$ distributions simultaneously, extending the previous two-point lower bound argument in Huber's setting. Despite the improvement over the Huber model from an information-theoretic perspective, we provide formal evidence---in the form of Statistical Query and Low-Degree Polynomial lower bounds---that the problem exhibits strong information-computation gaps. Our results strongly suggest that the information-theoretic improvements cannot be achieved by polynomial-time algorithms, revealing a fundamental gap between information-theoretic and computational limits in robust regression with clean covariates.
High-dimensional estimation with missing data:
Statistical and computational limits with Kabir A. Verchand, Saminul Haque, and Rohith Kuditipudi.
Manuscript, 2026
arXiv
We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $\epsilon$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $\rho$, for any constant contamination $\epsilon \in (0, 1)$, (roughly) $n \gtrsim d e^{1/\rho^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/\rho^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.
Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination with Ilias Diakonikolas, Chao Gao, Daniel M. Kane, and John Lafferty.
Advances in Neural Information Processing Systems (NeurIPS), 2025
arXiv conference version
We study the task of noiseless linear regression under Gaussian covariates in the presence of additive oblivious contamination. Specifically, we are given i.i.d. samples from a distribution $(x, y)$ on $\mathbb R^d \times \mathbb R$ with $x \sim \mathcal N(0,I_d)$ and $y = x^\top \beta + z$, where $z$ is drawn from an unknown distribution that is independent of $x$. Moreover, $z$ satisfies $\mathbb P[z = 0] = \alpha>0$. The goal is to accurately recover the regressor $\beta$ to small $\ell_2$-error. Ignoring computational considerations, this problem is known to be solvable using $O(d/\alpha)$ samples. On the other hand, the best known polynomial-time algorithms require $\Omega(d/\alpha^2)$ samples. Here we provide formal evidence that the quadratic dependence in $1/\alpha$ is inherent for efficient algorithms. Specifically, we show that any efficient Statistical Query algorithm for this task requires sample complexity at least $\tilde{\Omega}(d^{1/2}/\alpha^2)$.
Optimal Robust Estimation under Local and Global Corruptions:
Stronger Adversary and Smaller Error with Thanasis Pittas.
Conference on Learning Theory (COLT), 2025
arXiv conference version
Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average. While each noise model is well understood individually, the combined contamination model poses new algorithmic challenges, with only partial results known. Existing efficient algorithms are limited in two ways: (i) they work only for a weak notion of local perturbations, and (ii) they obtain suboptimal error for isotropic subgaussian distributions (among others). The latter limitation led \cite{NieGS24} to hypothesize that improving the error might, in fact, be computationally hard. Perhaps surprisingly, we show that information theoretically optimal error can indeed be achieved in polynomial time, under an even \emph{stronger} local perturbation model (the sliced-Wasserstein metric as opposed to the Wasserstein metric). Notably, our analysis reveals that the entire family of stability-based robust mean estimators continues to work optimally in a black-box manner for the combined contamination model. This generalization is particularly useful in real-world scenarios where the specific form of data corruption is not known in advance. We also present efficient algorithms for distribution learning and principal component analysis in the combined contamination model.

07. Fast Algorithms

Scalable methods, such as near-linear, streaming, or subquadratic algorithms, beyond mere polynomial-time feasibility.

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation Ankit Pensia.
International Conference on Machine Learning (ICML), 2024 (Spotlight)
arXiv conference version slides (of a survey; 1hr)
We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic $\Omega(d^2)$, which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant~\cite{Valiant15}.
Black-Box $k$-to-1-PCA Reductions: Theory and Applications with Arun Jambulapati, Syamantak Kumar, Jerry Li, Shourya Pandey, and Kevin Tian.
Conference on Learning Theory (COLT), 2024
arXiv conference version slides (10min)
The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have implicit access to via samples. Motivated by these implicit settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood.

Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work both in sample complexity and approximation quality.
Near-Optimal Algorithms for Gaussians with Huber Contamination:
Mean Estimation and Linear Regression with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2023
arXiv conference version
We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination parameter $\epsilon \in (0,\epsilon_0)$ for a small absolute constant $\epsilon_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(\eps)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\epsilon)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest.
Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2023
arXiv conference version
We study principal component analysis (PCA), where given a dataset in $\mathbb R^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA. Recent work has developed computationally-efficient algorithms for robust PCA that either take super-linear time or have sub-optimal error guarantees. Our main contribution is to develop a nearly linear time algorithm for robust PCA with near-optimal error guarantees. We also develop a single-pass streaming algorithm for robust PCA with memory usage nearly-linear in the dimension.
Streaming Algorithms for High-Dimensional Robust Statistics with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2022
arXiv conference version
We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.

08. Machine Learning

Learning-theoretic questions around generalization, robustness, representations, and neural-network structure.

Black-Box $k$-to-1-PCA Reductions: Theory and Applications with Arun Jambulapati, Syamantak Kumar, Jerry Li, Shourya Pandey, and Kevin Tian.
Conference on Learning Theory (COLT), 2024
arXiv conference version slides (10min)
The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have implicit access to via samples. Motivated by these implicit settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood.

Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work both in sample complexity and approximation quality.
Semi-supervised Group DRO: Combating Sparsity with Unlabeled Data with Pranjal Awasthi and Satyen Kale.
International Conference on Algorithmic Learning Theory (ALT), 2024
conference version
In this work we formulate the problem of group distributionally robust optimization (DRO) in a semi-supervised setting. Motivated by applications in robustness and fairness, the goal in group DRO is to learn a hypothesis that minimizes the worst case performance over a pre-specified set of groups defined over the data distribution. In contrast to existing work that assumes access to labeled data from each of the groups, we consider the practical setting where many groups may have little to no amount of labeled data.
We design near optimal learning algorithms in this setting by leveraging the unlabeled data from different groups. The performance of our algorithms can be characterized in terms of a natural quantity that captures the similarity among the various groups and the maximum {m best-in-class} error among the groups. Furthermore, for the special case of squared loss and a convex function class we show that the dependence on the best-in-class error can be avoided. We also derive sample complexity bounds for our proposed semi-supervised algorithm.
Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient with Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos.
Advances in Neural Information Processing Systems (NeurIPS), 2020 (Spotlight)
arXiv conference version
The strong /lottery ticket hypothesis/ (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. (2020) establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random one that is a factor $O(d^4l^2)$ wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width d and depth l can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep. Our analysis heavily relies on connecting pruning random ReLU networks to random instances of the +SubsetSum+ problem. We then show that this logarithmic over-parameterization is essentially optimal for constant depth networks. Finally, we verify several of our theoretical insights with experiments.
Extracting robust and accurate features via a robust information bottleneck with Varun Jog and Po-Ling Loh.
IEEE Journal on Selected Areas in Information Theory (JSAIT), 2020
journal version PDF
We propose a novel strategy for extracting features in supervised learning that can be used to construct a classifier which is more robust to small perturbations in the input space. Our method builds upon the idea of the information bottleneck, by introducing an additional penalty term that encourages the Fisher information of the extracted features to be small when parametrized by the inputs. We present two formulations where the relevance of the features to output labels is measured using either mutual information or MMSE. By tuning the regularization parameter, we can explicitly trade off the opposing desiderata of robustness and accuracy when constructing a classifier. We derive optimal solutions to both robust information bottleneck formulations when the inputs and outputs are jointly Gaussian, proving that the optimally robust features are also jointly Gaussian in this setting. We also propose methods for optimizing variational bounds on the robust information bottleneck objectives in general settings using stochastic gradient descent, which may be implemented efficiently in neural networks. Our experimental results for synthetic and real data sets show that the proposed feature extraction methods indeed produce classifiers with increased robustness to perturbations.
Deep Topic Models for Multi-label Learning with Rajat Panda, Nikhil Mehta, Mingyuan Zhou, and Piyush Rai.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2019
conference version
We present a probabilistic framework for multi-label learning based on a deep generative model for the binary label vector associated with each observation. Our generative model learns deep multi-layer latent embeddings of the binary label vector, which are conditioned on the input features of the observation. The model also has an interesting interpretation in terms of a deep topic model, with each label vector representing a bag-of-words document, with the input features being its meta-data. In addition to capturing the structural properties of the label space (e.g., a near-low-rank label matrix), the model also offers a clean, geometric interpretation. In particular, the nonlinear classification boundaries learned by the model can be seen as the union of multiple convex polytopes. Our model admits a simple and scalable inference via efficient Gibbs sampling or EM algorithm. We compare our model with state-of-the-art baselines for multi-label learning on benchmark data sets, and also report some interesting qualitative results.
Generalization Error Bounds for Noisy, Iterative Algorithms with Varun Jog and Po-Ling Loh.
IEEE International Symposium on Information Theory (ISIT), 2018
arXiv conference version
In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I(S;W)$ between the algorithm input $S$ and the algorithm output $W$, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.

09. Probability and Concentration Inequalities

High-dimensional probability and concentration tools underlying statistical guarantees.

SoS Certifiability of Subgaussian Distributions and its Algorithmic Applications with Ilias Diakonikolas, Samuel B. Hopkins, and Stefan Tiegel.
Symposium on Theory of Computing (STOC), 2025
arXiv conference version slides (10min)slides (25min)
We prove that there is a universal constant $C>0$ so that for every $d \in \mathbb N$, every centered subgaussian distribution $\mathcal D$ on $\mathbb R^d$, and every even $p \in \mathbb N$, the $d$-variate polynomial $(Cp)^{p/2} \cdot \|v\|_{2}^p - \mathbb E_{X \sim \mathcal D} \langle v,X\rangle^p$ is a sum of square polynomials. This establishes that every subgaussian distribution is \emph{SoS-certifiably subgaussian}---a condition that yields efficient learning algorithms for a wide variety of high-dimensional statistical tasks. As a direct corollary, we obtain computationally efficient algorithms with near-optimal guarantees for the following tasks, when given samples from an arbitrary subgaussian distribution: robust mean estimation, list-decodable mean estimation, clustering mean-separated mixture models, robust covariance-aware mean estimation, robust covariance estimation, and robust linear regression. Our proof makes essential use of Talagrand's generic chaining/majorizing measures theorem.
Sharp Concentration Inequalities for the Centered Relative Entropy with Alankrita Bhatt.
Information and Inference: a Journal of the IMA, 2022
arXiv journal version
We study the relative entropy between the empirical estimate of a discrete distribution and the true underlying distribution. If the minimum value of the probability mass function exceeds an $\alpha>0$ (i.e. when the true underlying distribution is bounded sufficiently away from the boundary of the simplex), we prove an upper bound on the moment generating function of the centered relative entropy that matches (up to logarithmic factors in the alphabet size and $\alpha$) the optimal asymptotic rates, subsequently leading to a sharp concentration inequality for the centered relative entropy. As a corollary of this result we also obtain confidence intervals and moment bounds for the centered relative entropy that are sharp up to logarithmic factors in the alphabet size and $\alpha.$

10. Sparse Estimation

High-dimensional estimation and testing when the signal has sparse or otherwise low-complexity structure.

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation Ankit Pensia.
International Conference on Machine Learning (ICML), 2024 (Spotlight)
arXiv conference version slides (of a survey; 1hr)
We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic $\Omega(d^2)$, which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant~\cite{Valiant15}.
Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
International Conference on Machine Learning (ICML), 2024
arXiv conference version
We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian robust $k$-sparse mean estimation on $\mathbb{R}^d$ with corruption rate $\epsilon>0$, our algorithm has sample complexity $(k^2/\epsilon^2)\mathrm{polylog}(d/\epsilon)$, runs in sample polynomial time, and approximates the target mean within $\ell_2$-error $O(\epsilon)$. Previous efficient algorithms inherently incur error $\Omega(\epsilon \sqrt{\log(1/\epsilon)})$. At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications.
Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions with Ilias Diakonikolas, Daniel M. Kane, and Jasper C.H. Lee.
Advances in Neural Information Processing Systems (NeurIPS), 2022
arXiv conference version
We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.
List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2022 (Oral)
arXiv conference version
We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb R^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\hat \mu$ such that $\|\hat \mu - \mu\|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ''certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\textrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.
Robust Sparse Mean Estimation via Sum of Squares with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
Conference on Learning Theory (COLT), 2022
arXiv conference version
We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with `certifiably bounded' $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \textrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively best possible.

11. Heterogeneity

Inference when samples come from non-identical populations, mixtures, missingness, or other heterogeneous sources.

A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm with Ilias Diakonikolas, Daniel M. Kane, Jasper C.H. Lee, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2023 (Spotlight)
arXiv conference version
We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb{R}^d$ such that an unknown $\alpha<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal N(\mu ,\Sigma)$, the goal is to output a list of $O(1/\alpha)$ hypotheses at least one of which is close to $\Sigma$ in relative Frobenius norm. Our main result is a $\mathrm{poly}(d,1/\alpha)$ sample and time algorithm for this task that guarantees relative Frobenius norm error of $\mathrm{poly}(1/\alpha)$. Importantly, our algorithm relies purely on spectral techniques. As a corollary, we obtain an efficient spectral algorithm for robust partial clustering of Gaussian mixture models (GMMs) -- a key ingredient in the recent work of [BDJ+22] on robustly learning arbitrary GMMs. Combined with the other components of [BDJ+22], our new method yields the first Sum-of-Squares-free algorithm for robustly learning GMMs. At the technical level, we develop a novel multi-filtering method for list-decodable covariance estimation that may be useful in other settings.
List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas.
Advances in Neural Information Processing Systems (NeurIPS), 2022 (Oral)
arXiv conference version
We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb R^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\hat \mu$ such that $\|\hat \mu - \mu\|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ''certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\textrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.
Statistical Query Lower Bounds for List-Decodable Linear Regression with Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, and Alistair Stewart.
Advances in Neural Information Processing Systems (NeurIPS), 2021 (Spotlight)
arXiv conference version
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x,y) \in \mathbb{R}^d\times \mathbb{R}$ and a parameter $0<\alpha<1/2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1−\alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\textrm{poly}(1/\alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.
Estimating location parameters in sample-heterogeneous distributions with Varun Jog and Po-Ling Loh.
Information and Inference: a Journal of the IMA, 2021
A shorter version of this article appeared at ISIT 2019
arXiv journal version PDF
Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finite-sample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$-dimensional non-identical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth, and median estimators, and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near-optimal when data are i.i.d. and when the fraction of ``low-noise'' distributions is as small as $\Omega\left(\frac{d \log n}{n}\right)$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points.