Mathematics -- The Science of Patterns and Algorithms

Computing with Large Data Sets

Breakthroughs in sensor technology are leading to generation of unprecedented volumes of high-dimensional data. The amazing continued growth in the raw power of computers and algorithmic advances offer the promise of addressing the tremendous challenges of analyzing such data. Huge data sets, terabytes of data, in very high-dimensional spaces, are now collected routinely in almost all sciences [MS]. Examples are images from diverse sources, dynamics of the Internet, neural recordings and, in the discrete domain, gene expression arrays. Whereas data sets in 1, 2, or 3 dimensions are easily visualized and analyzed, data sets in 1000 dimensions are much harder to understand. Even 10,000 points constitute a very sparse set in 1000-dimensional space and it is easy to "overfit" the data with a model that makes you detect spurious "patterns" that disappear when you acquire more data. A major challenge is to find methods to analyze the structure of such sets, to fit models robustly and identify and validate patterns. Techniques from statistics, harmonic analysis, graph theory and computer science are only beginning to clarify this problem.