In today’s digital world, huge amounts of data, i.e., big data, can be found in almost every aspect of scientific research and human activity. These data need to be managed effectively for reliable prediction and inference to improve decision making. Statistical learning is an emergent scientific discipline wherein mathematical modeling, computational algorithms, and statistical analysis are jointly employed to address these challenging data management problems. Invariably, quantitative criteria need to be introduced for the overall learning process in order to gauge the quality of the solutions obtained. This research focuses on two important criteria: data fitness and sparsity representation of the underlying learning model. Potential applications of the results can be found in computational statistics, compressed sensing, imaging, machine learning, bio-informatics, portfolio selection, and decision making under uncertainty, among many areas involving big data.
Till now, convex optimization has been the dominant methodology for statistical learning in which the two criteria employed are expressed by convex functions either to be optimized and/or set as constraints of the variables being sought. Recently, non-convex functions of the difference-of-convex (DC) type and the difference-of-convex algorithm (DCA) have been shown to yield superior results in many contexts and serve as the motivation for this project. The goal is to develop a solid foundation and a unified framework to address many fundamental issues in big data problems in which non-convexity and non-differentiability are present in the optimization problems to be solved. These two non-standard features in computational statistical learning are challenging and their rigorous treatment requires the fusion of expertise from different domains of mathematical sciences. Technical issues to be investigated will cover the optimality, sparsity, and statistical properties of computable solutions to the non-convex, non-smooth optimization problems arising from statistical learning and its many applications. Novel algorithms will be developed and tested first on synthetic data sets for preliminary experimentation and then on publicly available data sets for realism; comparisons will be made among different formulations of the learning problems.