# Smile leaves R, Python, H2O, Spark, and xgboost in the dust

**11**
*Thursday*
May 2017

Posted Big Data, Machine Learning

in
**11**
*Thursday*
May 2017

Posted Big Data, Machine Learning

in
**15**
*Monday*
Aug 2016

Posted Machine Learning

inDear Smilers,

We are proud to announce the release of Smile 1.2.0.

The key features of the 1.2.0 release are:

- Headless plot. Smile’s plot functions depends on Java Swing. In server applications, it is needed to generate plots without creating Swing windows. With headless plot (enabled by -Djava.awt.headless=true JVM options), we can create plots as follows:
val canvas = ScatterPlot.plot(x, '.') val headless = new Headless(canvas); headless.pack(); headless.setVisible(true); canvas.save(new java.io.File("zone.png"))

- All classification and regression models can be serialized by
write(model) // Java serialization

or

write.xstream(model) // XStream serialization

- Refactor of smile.io Scala API.
- Parsers are in smile.read object.
- Parse JDBC ResultSet to AttributeDataset.
- Model serialization methods in smile.write object.

- Platt scaling for SVM
- Smile NLP tokenizers are unicode-aware.
- Least squares can handle rank deficient now.
- Various code improvements.

**21**
*Monday*
Mar 2016

Posted Machine Learning

in

Today I am very excited to announce that Smile 1.1 is released! Among many improvements, we get the new high level Scala API, interactive Shell, and a nice project website with programming guides, API doc, etc.!

With Smile 1.1, data scientists can develop advanced models with high level Scala operators in the Shell and developers can deploy them immediately in the app. That is, data scientists and developers can speak the same language now! Continue reading

**29**
*Monday*
Feb 2016

Posted Big Data, Machine Learning

inBack to graduate school, I had been working on the so-called small sample size problem. In particular, I was working on linear discriminant analysis (LDA). For high-dimensional data (e.g. images, gene expression, etc.), the within-scatter matrix is singular when the number of samples is smaller than the dimensionality. Therefore LDA cannot be applied directly. You may think that we don’t have such small sample size problems anymore in the era of Big Data. Well, the challenge is deeper than what it looks like. Continue reading

**03**
*Tuesday*
Mar 2015

**Tags**

Posted by Haifeng Li | Filed under Analytics, Big Data, Machine Learning, Programming, Statistics

**20**
*Thursday*
Nov 2014

Posted Analytics, Big Data, Machine Learning, Programming, Statistics

in**Tags**

I have been developing a comprehensive machine learning library of advanced algorithms, called SMILE (Statistical Machine Intelligence and Learning Engine), for several years with my spare time. Today I am very pleased to announce that SMILE is now available on GitHub under Apache 2.0 license. SMILE is self contained and requires only the standard Java library. With advanced data structures and learning algorithms, SMILE achieves the state of the art of performance.

**Classification**: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.**Regression**: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, Ridge Regression.- Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, Signal Noise ratio, Sum Squares ratio.
**Clustering**: BIRCH, CLARANS, DBScan, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical Clustering, Sequential Information Bottleneck, Self-Organizing Maps, Spectral Clustering, Minimum Entropy Clustering.**Association Rule & Frequent Itemset Mining**: FP-growth mining algorithm**Manifold learning**: IsoMap, LLE, Laplacian Eigenmap, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection**Multi-Dimensional Scaling**: Classical MDS, Isotonic MDS, Sammon Mapping**Nearest Neighbor Search**: BK-Tree, Cover Tree, KD-Tree, LSH**Sequence Learning**: Hidden Markov Model

SMILE is well documented and you can browse the javadoc for more information.

SMILE also has a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, qq plot, contour plot, surface, and wireframe. The class PlotCanvas provides builtin functions such as zoom in/out, export, print, customization, etc.

SmilePlot requires SwingX library for JXTable. But if your environment cannot use SwingX, it is easy to remove this dependency by using JTable.

**25**
*Monday*
Aug 2014

Posted Machine Learning, Opinion

in**Tags**

A picture worths a thousand words. So no needs to repeat the title. But the point is that similarity is the core of many machine learning algorithms. Before talking about your mathematical models, go understand your business and problems. Lead the model with your insights (or *a priori* in terms of machine learning). Don’t be lead by the uninterpretable numbers of black box models.

**17**
*Thursday*
Jul 2014

Posted Machine Learning, Statistics

inIn statistics, the method of maximum likelihood is widely used to estimate an unobservable population parameter that maximizes the log-likelihood function

where the observations are independently drawn from the distribution parameterized by . The Expectation-Maximization (EM) algorithm is a general approach to iteratively compute the maximum-likelihood estimates when the observations can be viewed as *incomplete data* and one assumes the existence of additional but *missing* data corresponding to . The observations together with the missing data are called *complete data*. Continue reading