We are proud to announce the release of Smile 1.2.0.

The key features of the 1.2.0 release are:

Headless plot. Smile’s plot functions depends on Java Swing. In server applications, it is needed to generate plots without creating Swing windows. With headless plot (enabled by -Djava.awt.headless=true JVM options), we can create plots as follows:

val canvas = ScatterPlot.plot(x, '.')
val headless = new Headless(canvas);
headless.pack();
headless.setVisible(true);
canvas.save(new java.io.File("zone.png"))

All classification and regression models can be serialized by

write(model) // Java serialization

or

write.xstream(model) // XStream serialization

Refactor of smile.io Scala API.

Parsers are in smile.read object.

Parse JDBC ResultSet to AttributeDataset.

Model serialization methods in smile.write object.

Today I am very excited to announce that Smile 1.1 is released! Among many improvements, we get the new high level Scala API, interactive Shell, and a nice project website with programming guides, API doc, etc.!

With Smile 1.1, data scientists can develop advanced models with high level Scala operators in the Shell and developers can deploy them immediately in the app. That is, data scientists and developers can speak the same language now! Continue reading →

Back to graduate school, I had been working on the so-called small sample size problem. In particular, I was working on linear discriminant analysis (LDA). For high-dimensional data (e.g. images, gene expression, etc.), the within-scatter matrix is singular when the number of samples is smaller than the dimensionality. Therefore LDA cannot be applied directly. You may think that we don’t have such small sample size problems anymore in the era of Big Data. Well, the challenge is deeper than what it looks like. Continue reading →

I have been developing a comprehensive machine learning library of advanced algorithms, called SMILE (Statistical Machine Intelligence and Learning Engine), for several years with my spare time. Today I am very pleased to announce that SMILE is now available on GitHub under Apache 2.0 license. SMILE is self contained and requires only the standard Java library. With advanced data structures and learning algorithms, SMILE achieves the state of the art of performance.

Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.

Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, Ridge Regression.

Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, Signal Noise ratio, Sum Squares ratio.

SMILE is well documented and you can browse the javadoc for more information.

SMILE also has a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, qq plot, contour plot, surface, and wireframe. The class PlotCanvas provides builtin functions such as zoom in/out, export, print, customization, etc.

SmilePlot requires SwingX library for JXTable. But if your environment cannot use SwingX, it is easy to remove this dependency by using JTable.

A picture worths a thousand words. So no needs to repeat the title. But the point is that similarity is the core of many machine learning algorithms. Before talking about your mathematical models, go understand your business and problems. Lead the model with your insights (or a priori in terms of machine learning). Don’t be lead by the uninterpretable numbers of black box models.

In statistics, the method of maximum likelihood is widely used to estimate an unobservable population parameter that maximizes the log-likelihood function

where the observations are independently drawn from the distribution parameterized by . The Expectation-Maximization (EM) algorithm is a general approach to iteratively compute the maximum-likelihood estimates when the observations can be viewed as incomplete data and one assumes the existence of additional but missing data corresponding to . The observations together with the missing data are called complete data. Continue reading →