ADP revenue includes full HCM services besides payroll.
Notice something here? ADP moves a lot of money than Paypal, but makes less revenue on money movement (less the revenue from other HCM services). It has a smaller market cap too. Why? Well, ADP is in the business of solution shop and value add process while Paypal is a facilitated network. Continue reading →
MapReduce provides a scatter-gather parallel computing model, which is very limited. Dryad, a research project at Microsoft Research, attempted to support a more general purpose runtime for parallel data processing. A Dryad job is a directed acyclic graph (DAG) where each vertex is a program and edges represent data channels (files, TCP pipes, or shared-memory FIFOs). Continue reading →
I have been developing a comprehensive machine learning library of advanced algorithms, called SMILE (Statistical Machine Intelligence and Learning Engine), for several years with my spare time. Today I am very pleased to announce that SMILE is now available on GitHub under Apache 2.0 license. SMILE is self contained and requires only the standard Java library. With advanced data structures and learning algorithms, SMILE achieves the state of the art of performance.
Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.
Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, Ridge Regression.
Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, Signal Noise ratio, Sum Squares ratio.
SMILE is well documented and you can browse the javadoc for more information.
SMILE also has a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, qq plot, contour plot, surface, and wireframe. The class PlotCanvas provides builtin functions such as zoom in/out, export, print, customization, etc.
SmilePlot requires SwingX library for JXTable. But if your environment cannot use SwingX, it is easy to remove this dependency by using JTable.
We have reviewed Apache Hive and Cloudera Impala, which are great for ad hoc analysis of big data. Today, Facebook’s Hive data warehouse holds 300 PB data with an incoming daily rate of about 600 TB! It is amazing but it does’t mean that most analytics is on that scale (even for Facebook). In fact, queries usually focus on a particular subset or time window and touch only a small number of columns of tables. Continue reading →
In previous post, we discussed Apache Hive, which first brought SQL to Hadoop. There are actually several SQL on Hadoop solutions competing with Hive head-to-head. Today, we will look into Google BigQuery, Cloudera Impala and Apache Drill, which all have a root to Google Dremel that was designed for interactive analysis of web-scale datasets. In a nutshell, they are native massively parallel processing query engine on read-only data. Continue reading →
In previous post, we discussed Apache Pig that provides a data flow DSL Pig Latin to ease the MapReduce programming. Although many statements in Pig Latin look just like SQL clauses, it is a procedural programming language. Today we will discuss Apache Hive that first brought SQL to Hadoop. Similar to Pig, Hive translates its own dialect of SQL (HiveQL) queries to a directed acyclic graph of MapReduce (or Tez since 0.13) jobs. However, the difference between Pig and Hive is not only procedural vs declarative. Pig is a relatively thin layer on top of MapReduce for offline analytics. But Hive is towards a data warehouse. With the recent stinger initiative, Hive is closer to interactive analytics by 100x performance improvement. Continue reading →
MapReduce is a good tool for offline, ad-hoc analytics, which often involves multiple successive jobs. A single MapReduce job essentially performs a group-by aggregation in a massively parallel way. However, its programming model is very low level. Custom code has to be written for even simple operations like projection and filtering. It is even more tedious and verbose to implement common relational operators such as join. Several efforts have been devoted to simplify the development of MapReduce programs by providing high level DSLs that can be translated to native MapReduce code. Different from many other projects that bring SQL to Hadoop, Pig is special in that it provides a procedural (data flow) programming language Pig Latin as it was designed for experienced programmers. Continue reading →
In the previous post, we discussed MapReduce. Although it is great for large scale data processing, it is not friendly for iterative algorithms or interactive analytics because the data have to be repeatedly loaded for each iteration or be materialized and replicated on the distributed file system between successive jobs. Apache Spark is designed to solve this problem in means of in-memory computing. The overall framework and parallel computing model of Spark is similar to MapReduce but with an important innovation, reliant distributed dataset (RDD). Continue reading →