We are proud to announce the release of Smile 1.2.0.
The key features of the 1.2.0 release are:
- Headless plot. Smile’s plot functions depends on Java Swing. In server applications, it is needed to generate plots without creating Swing windows. With headless plot (enabled by -Djava.awt.headless=true JVM options), we can create plots as follows:
val canvas = ScatterPlot.plot(x, '.') val headless = new Headless(canvas); headless.pack(); headless.setVisible(true); canvas.save(new java.io.File("zone.png"))
- All classification and regression models can be serialized by
write(model) // Java serialization
write.xstream(model) // XStream serialization
- Refactor of smile.io Scala API.
- Parsers are in smile.read object.
- Parse JDBC ResultSet to AttributeDataset.
- Model serialization methods in smile.write object.
- Platt scaling for SVM
- Smile NLP tokenizers are unicode-aware.
- Least squares can handle rank deficient now.
- Various code improvements.
Today I am very excited to announce that Smile 1.1 is released! Among many improvements, we get the new high level Scala API, interactive Shell, and a nice project website with programming guides, API doc, etc.!
With Smile 1.1, data scientists can develop advanced models with high level Scala operators in the Shell and developers can deploy them immediately in the app. That is, data scientists and developers can speak the same language now! Continue reading
Back to graduate school, I had been working on the so-called small sample size problem. In particular, I was working on linear discriminant analysis (LDA). For high-dimensional data (e.g. images, gene expression, etc.), the within-scatter matrix is singular when the number of samples is smaller than the dimensionality. Therefore LDA cannot be applied directly. You may think that we don’t have such small sample size problems anymore in the era of Big Data. Well, the challenge is deeper than what it looks like. Continue reading
I have been developing a comprehensive machine learning library of advanced algorithms, called SMILE (Statistical Machine Intelligence and Learning Engine), for several years with my spare time. Today I am very pleased to announce that SMILE is now available on GitHub under Apache 2.0 license. SMILE is self contained and requires only the standard Java library. With advanced data structures and learning algorithms, SMILE achieves the state of the art of performance.
- Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.
- Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, Ridge Regression.
- Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, Signal Noise ratio, Sum Squares ratio.
- Clustering: BIRCH, CLARANS, DBScan, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical Clustering, Sequential Information Bottleneck, Self-Organizing Maps, Spectral Clustering, Minimum Entropy Clustering.
- Association Rule & Frequent Itemset Mining: FP-growth mining algorithm
- Manifold learning: IsoMap, LLE, Laplacian Eigenmap, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection
- Multi-Dimensional Scaling: Classical MDS, Isotonic MDS, Sammon Mapping
- Nearest Neighbor Search: BK-Tree, Cover Tree, KD-Tree, LSH
- Sequence Learning: Hidden Markov Model
SMILE is well documented and you can browse the javadoc for more information.
SMILE also has a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, qq plot, contour plot, surface, and wireframe. The class PlotCanvas provides builtin functions such as zoom in/out, export, print, customization, etc.
SmilePlot requires SwingX library for JXTable. But if your environment cannot use SwingX, it is easy to remove this dependency by using JTable.
A picture worths a thousand words. So no needs to repeat the title. But the point is that similarity is the core of many machine learning algorithms. Before talking about your mathematical models, go understand your business and problems. Lead the model with your insights (or a priori in terms of machine learning). Don’t be lead by the uninterpretable numbers of black box models.
In statistics, the method of maximum likelihood is widely used to estimate an unobservable population parameter that maximizes the log-likelihood function
where the observations are independently drawn from the distribution parameterized by . The Expectation-Maximization (EM) algorithm is a general approach to iteratively compute the maximum-likelihood estimates when the observations can be viewed as incomplete data and one assumes the existence of additional but missing data corresponding to . The observations together with the missing data are called complete data. Continue reading