With recent advances in technology, it is now possible to measure and record significant numbers of features on a single individual. The volume, velocity, and variety, the “3Vs”, of Big Data pose significant challenges for modeling and analysis of these massive datasets. For example, to understand cancer at the genetic level, researchers need to detect rare and weak signals from thousands, or even millions, of candidate genetic markers obtained from a limited number of subjects. Existing methods typically assume that the number of subjects is very large, an assumption often violated in practice. The main goal of this project is to develop efficient methods for extremely large-dimensional, small sample size data. The methodological advances will be extremely valuable in addressing Big Data challenges in different areas such as medical research, bioinformatics, financial analysis, and astronomic image analysis. Efficient software packages and algorithms to implement the proposed methods will be developed and made publicly available.
The key innovative idea motivating this research is viewing a high-dimensional problem from a novel packing perspective, which allows the number of variables, p, to be arbitrarily large and the number of observations, n, to be finite. The proposed research will systematically investigate three fundamental problems under this “finite n, arbitrarily large p” paradigm: (1) asymptotic theory of spurious correlations, (2) fast detection of low-rank correlation structures, and (3) detection boundary and optimal testing procedures for detecting rare and weak signals. This research will transform the current asymptotic framework, transitioning from the regimes of “large n, small p” and “large n, larger p” to the regime of “finite n, arbitrarily large p”.