 Back to graduate school, I had been working on the so-called small sample size problem. In particular, I was working on linear discriminant analysis (LDA). For high-dimensional data (e.g. images, gene expression, etc.), the within-scatter matrix is singular when the number of samples is smaller than the dimensionality. Therefore LDA cannot be applied directly. You may think that we don’t have such small sample size problems anymore in the era of Big Data. Well, the challenge is deeper than what it looks like.

In some areas, the obvious small sample size problem still exists. For example, even though the cost of microarray and next-gen sequencing has been dramatically reduced in recent years, it is simply very hard to collect a large number of tissues to study in biomedical science and bioinformatics.

How about the areas where we do have a lot of data now? For example, we can easily access millions of images on Internet. Well, it is well known that the optimal rate of convergence to fit the m-th derivate of θ (θ is a p-times differentiable regression function, especially nonparametric ones) to the data is at best proportional to n-(p-m)/(2p+d), where n is the number of data points, and d is the dimensionality of the data. In a nutshell, the rate of convergence will be exponentially slower when we increase the dimensionality of our inputs. In other words, with a fixed number of training samples, the predictive power reduces as the dimensionality increases, known as the Hughes effect.

A more straightforward way to understand the curse of dimensionality is combinatorial explosion. Suppose each variable can take one of several discrete values. Taking the variables together, a huge number of combinations of values must be considered. Even in the simplest case of d binary variables, the number of possible combinations already is O(2d), exponential in the dimensionality. Naively, each additional dimension doubles the effort needed to try all combinations. The below picture gives a visual illustration that each variable takes 10 possible discrete values. Therefore, even millions images are not really big in the context of the curse of dimensionality. But deep neural networks have shown excellent results on image recognition in recent years. How do they achieve that? First of all, the model complexity of deep neural networks doesn’t depend on the samples, in contrast to support vector machines. The researchers can control the model complexity by defining the network structure before training it on the samples. On the other hand, the model complexity of support vector machines grows with the sample size and dimensionality. Second, deep convolutional networks focus on the high level features rather than the pixel level information, which effectively reduces the dimensionality of features. The idea of word embedding similarly reduces the feature dimension in natural language processing, which also shows very promising results.

So do we conquer the curse of dimensionality with deep neural networks? Not yet. Deep neural networks involves ten millions parameters. The recent advance of super deep neural network by Microsoft, with 150 layers, just makes the parameter space even larger. The Microsoft team explored even deeper networks – up to 1000 layers – but the results weren’t as good presumably due to overfitting. Overfitting basically means that the model fits the random errors in the data rather than the underlying structure. In other words, for this size of model the dataset is comparatively small.