In statistics, the method of maximum likelihood is widely used to estimate an unobservable population parameter that maximizes the log-likelihood function

$L(\Theta;\mathcal{X})=\sum_{i=1}^{n} \log p(x_i|\Theta)$

where the observations $\mathcal{X}=\{x_i|i=1,\ldots,n\}$ are independently drawn from the distribution $p(x)$ parameterized by $\Theta$. The Expectation-Maximization (EM) algorithm is a general approach to iteratively compute the maximum-likelihood estimates when the observations can be viewed as incomplete data and one assumes the existence of additional but missing data $\mathcal{Y}=\{y_i|i=1,\ldots,n\}$ corresponding to $\mathcal{X}$. The observations together with the missing data are called complete data.

The EM algorithm maximizes the log-likelihood of the incomplete data by exploiting the relationship between the complete data and the incomplete data. In each iteration, two steps, called E-step and M-step, are involved. In the E-step, the EM algorithm determines the expectation of log-likelihood of the complete data based on the incomplete data and the current parameter

$Q(\Theta|\Theta^{(t)})= E\left(\log p(\mathcal{X},\mathcal{Y}|\Theta)\bigl\lvert\mathcal{X},\Theta^{(t)}\right)$

In the M-step, the algorithm determines a new parameter maximizing $Q$

$\Theta^{(t+1)}=\arg\max_{\Theta}Q(\Theta|\Theta^{(t)})$

Each iteration is guaranteed to increase the likelihood, and finally the algorithm converges to a local maximum of the likelihood function.

Clearly, the missing data $\mathcal{Y}$ has strong affects on the performance of the EM algorithm since the optimal parameter $\Theta^{*}$ is obtained by maximizing $E\left(\log p(\mathcal{X},\mathcal{Y}|\Theta)\right)$. For example, the EM algorithm finds a local maximum of the likelihood function, which depends on the choice of $\mathcal{Y}$. Note that given the incomplete-data $\mathcal{X}$, there are many possible specifications of the missing data. Sometimes a natural choice will be obvious. At other times there may be several different ways to define $\mathcal{Y}$. In these cases, how can we choose a suitable $\mathcal{Y}$ to make the solution more physically plausible? This question is not addressed in the EM algorithm because the likelihood function does not reflect any influence of the missing data.

Intuitively, we would like to choose the missing data that has a strong physical relation with the observations so that the reached local maximum of likelihood has good physical plausibility. The strong relationship between the missing data and the observations implies that the missing data has little uncertainty given the observations. In other words, the observations contain a lot of information about the missing data and we can infer the missing data from the observations with a small error rate.

The information about one object contained in another object can be measured by mutual information

$H(X|Y)=\sum_{y}p(y)H(X|Y=y)=-\sum_{x}\sum_{y}p(x,y)\log p(x|y)$

based on Shannon information theory. The relationship between entropy and mutual information

$I(X;Y)=H(X)-H(X\vert Y)=H(Y)-H(Y\vert X)$

demonstrates that the mutual inforamtion measures the amount of information that one random variable contains about another one.

With mutual information as the regularizer, we have the regularized likelihood

$\widetilde{L}(\Theta;\mathcal{X})=L(\Theta;\mathcal{X})+\gamma I(X;Y|\Theta)$

where $X$ is the random variable of observations and $Y$ is the random variable of missing data. Because we usually do not know much about the missing data, we may naturally assume that $Y$ follows a uniform distribution and thus $H(Y)$ is a constant value given the range of $Y$. Since $I(X;Y)=H(Y)-H(Y\vert X)$, we may also use the following regularized likelihood

$\widetilde{L}(\Theta;\mathcal{X}) =L(\Theta;\mathcal{X})-\gamma H(Y|X;\Theta)$

The conditional entropy $H(Y|X)$ measures how uncertain we are of $Y$ on the average when we know $X$. In fact, $H(Y|X)=0$ if and only if $Y$ is a function of $X$. Thus, we expect $H(Y|X)$ to be small if the observations $\mathcal{X}$ and the missing data $\mathcal{Y}$ have a strong co-relation.

To optimize the regularized likelihood, we only need slightly modify the M-step of the EM algorithm. Now it is

$\Theta^{(t+1)}=\arg\max_{\Theta}\widetilde{Q}(\Theta|\Theta^{(t)})$

where

$\widetilde{Q}(\Theta|\Theta^{(t)})=Q(\Theta|\Theta^{(t)})+\gamma I(X;Y|\Theta)$

or

$\widetilde{Q}(\Theta|\Theta^{(t)})=Q(\Theta|\Theta^{(t)})-\gamma H(Y|X;\Theta)$

The modified algorithm is called the regularized EM algorithm. And we can easily prove the convergence of the regularized EM algorithm in the framework of proximal point algorithm.

We applied the regularized EM algorithm to fit the finite mixture model, which is of great use in machine learning. Due to space limit, we didn’t present the derivation of Equation (19) and (22) in the paper. In case that you are interested, here is the derivation.