The Other Side of Sharing Economy


, ,


The Sharing Economy is now touching on nearly every aspect of everyday life. Besides the skyrocketing valuations, people also talk things like:

  • Uber, the world’s largest taxi company, owns no vehicles.
  • AirBnB, the world’s largest accommodation provider, owns no real estate.

It is true that AirBnB doesn’t own a single room. But most hotels don’t own real estate either! They lease. Continue reading

Oracle’s Dilemma


, , ,

Opinions expressed are solely my own and do not express the views or opinions of my employer.

Today I am in the Oracle Cloud World. I have high hope to learn their cloud strategy because Oracle is the only hugely successful company founded in the time of client/server movement that is still leaded by its founder. Today we are facing another significant change: Cloud and SaaS. It is very interesting to see how Oracle responses to it. Continue reading

How to Disrupt Financial Services


“Silicon Valley is coming,” JPMorgan Chase CEO Jamie Dimon warned in his annual letter to shareholders. Yes, FinTech startups are coming. In fact, there are currently 12,000 FinTech startups in the field. No doubt, most of them will fail but a few will succeed and disrupt financial service market. Of course, the million-dollar question is “How?!” I don’t have a crystal ball to divine the future, but history may teach us something. Let’s look at how technologies disrupted other fields (and themselves). Continue reading

Big Bank’s Bet on FinTech

Antony Jenkins, the former CEO of Barclays, says a new wave of tech-savvy startups that can do things better, faster, and cheaper than the big banks will disrupt their traditional businesses like lending, payments, and wealth management. A number of fintech startups are working hard to disrupt. On the other hand, big banks don’t sit still either. Citi, BoA, Capital One are opening up APIs to outsiders. Citi also hopes to innovate much faster through crowd sourcing Mobile Challenge initiative with a mobile-first strategy. There are a lot of interesting moves. Continue reading

Less is More, Sometimes




My friends, we are not going to discuss ancient Chinese philosophy, but mathematics. In this era of social networking and Big Data, every data scientist wants more connections in the social network to crunch because more connections (i.e. more edges in the graph) mean more information, right? So today’s quiz is which graph in above contains more information? Continue reading

ZooKeeper Insides

For in-depth information on various Big Data technologies, check out my free e-book “Introduction to Big Data“.

In YARN, the Resource Manager is a single point of failure (SPOF). Multiple Resource Manager instances can be brought up for fault tolerance but only one instance is Active. When the Active goes down or becomes unresponsive, another Resource Manager has to be elected to be the Active. Such a leader election problem is common for distributed systems with a active/standby design. YARN relays on ZooKeeper for electing the new Active. In fact, distributed systems also face other common problems such as naming service, configuration management, synchronization, group membership, etc. ZooKeeper is a highly reliable distributed coordination service for all these use cases. Higher order constructs, e.g. barriers, message queues, locks, two-phase commit, and leader election, can also be implemented with ZooKeeper. In the rest of book, we will find that many distributed services depend on the ZooKeeper, which is actually the goal of ZooKeeper: implementing the coordination service once and well and shared by many distributed applications. Continue reading

The CAP Theorem Revisited

For in-depth information on various Big Data technologies, check out my free e-book “Introduction to Big Data“.

In PODC 2000, Eric Brewer conjectured that a distributed shared-data system cannot simultaneously provide all three of the following desirable properties:

  • Consistency

All nodes see the same data at the same time. It is equivalent to having a single up-to-date copy of the data.

  • Availability

Every request received by a non-failing node in the system must result in a response. Even when severe network failures occur, every request must terminate.

  • Partition tolerance

The system continues to operate despite arbitrary message loss or failure of part of the system.

In 2002, Gilbert and Lynch proved this in the asynchronous and partially synchronous network models. Thus it is called the CAP Theorem now. Continue reading

Inner Join with MapReduce

For in-depth information on various Big Data technologies, check out my free e-book “Introduction to Big Data“.

An inner join operation combines two data sets, A and B, to produce a third one containing all record pairs from A and B with matching attribute value. The sort-merge join algorithm and hash-join algorithm are two common alternatives to implement the join operation in a parallel data flow environment. In sort-merge join, both A and B are sorted by the join attribute and then compared in sorted order. The matching pairs are inserted into the output stream. The hash-join first prepares a hash table of the smaller data set with the join attribute as the hash key. Then we scan the larger dataset and find the relevant rows from the smaller dataset by searching the hash table. Continue reading

MapReduce Insides

For in-depth information on various Big Data technologies, check out my free e-book “Introduction to Big Data“.

Distributed parallel computing is not new. Supercomputers have been using MPI for years for complex numerical computing. Although MPI provides a comprehensive API for data transfer and synchronization, it is not very suitable for big data. Due to the large data size and shared-nothing architecture for scalability, data distribution and I/O are critical to big data analytics while MPI almost ignores it. On the other hand, many big data analytics are conceptually straightforward and does not need very complicated communication and synchronization mechanism. Based on these observations, Google invented MapReduce to deal the issues of how to parallelize the computation, distribute the data, and handle failures. Continue reading