My friends, we are not going to discuss ancient Chinese philosophy, but mathematics. In this era of social networking and Big Data, every data scientist wants more connections in the social network to crunch because more connections (i.e. more edges in the graph) mean more information, right? So today’s quiz is which graph in above contains more information? Continue reading →

I have been developing a comprehensive machine learning library of advanced algorithms, called SMILE (Statistical Machine Intelligence and Learning Engine), for several years with my spare time. Today I am very pleased to announce that SMILE is now available on GitHub under Apache 2.0 license. SMILE is self contained and requires only the standard Java library. With advanced data structures and learning algorithms, SMILE achieves the state of the art of performance.

Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.

Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, Ridge Regression.

Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, Signal Noise ratio, Sum Squares ratio.

SMILE is well documented and you can browse the javadoc for more information.

SMILE also has a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, qq plot, contour plot, surface, and wireframe. The class PlotCanvas provides builtin functions such as zoom in/out, export, print, customization, etc.

SmilePlot requires SwingX library for JXTable. But if your environment cannot use SwingX, it is easy to remove this dependency by using JTable.

In statistics, the method of maximum likelihood is widely used to estimate an unobservable population parameter that maximizes the log-likelihood function

where the observations are independently drawn from the distribution parameterized by . The Expectation-Maximization (EM) algorithm is a general approach to iteratively compute the maximum-likelihood estimates when the observations can be viewed as incomplete data and one assumes the existence of additional but missing data corresponding to . The observations together with the missing data are called complete data. Continue reading →