{"title": "Deterministic Annealing Variant of the EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 545, "page_last": 552, "abstract": null, "full_text": "Deterministic Annealing Variant \n\nof the EM Algorithm \n\nN aonori U eda \nueda@cslab.kecl.ntt.jp \n\nRyohei N alcano \nnakano@cslab.kecl.ntt.jp \n\nNTT Communication Science Laboratories \n\nHikaridai, Seika-cho, Soraku-gun, \n\nKyoto 619-02 Japan \n\nAbstract \n\nWe present a deterministic annealing variant of the EM algorithm \nfor maximum likelihood parameter estimation problems. In our \napproach, the EM process is reformulated as the problem of min(cid:173)\nimizing the thermodynamic free energy by using the principle of \nmaximum entropy and statistical mechanics analogy. Unlike simu(cid:173)\nlated annealing approaches, this minimization is deterministically \nperformed. Moreover, the derived algorithm, unlike the conven(cid:173)\ntional EM algorithm, can obtain better estimates free of the initial \nparameter values. \n\n1 \n\nINTRODUCTION \n\nThe Expectation-Maximization (EM) algorithm (Dempster, Laird & Rubin, 1977) \nis an iterative statistical technique for computing maximum likelihood parameter \nestimates from incomplete data. It has generally been employed to a wide variety \nof parameter estimation problems. Recently, the EM algorithm has also been suc(cid:173)\ncessfully employed as the learning algorithm of the hierarchical mixture of experts \n(Jordan & Jacobs, 1993). In addition, it has been found to have some relationship \nto the learning of the Boltzmann machines (Byrne, 1992). \n\nThis algorithm has attractive features such as reliable global convergence, low cost \nper iteration, economy of storage, and ease of programming, but it is not free from \nproblems in practice. The serious practical problem associated with the algorithm \n\n\f546 \n\nNaonori Veda, Ryohei Nakano \n\nis the local maxima problem. This problem makes the performance dependent on \nthe initial parameter value. Indeed, the EM algorithm should be performed from \nas wide a choice of starting values as possible according to some ad hoc criterion. \n\nTo overcome this problem, we adopt the principle of statistical mechanics. Namely, \nby using the principle of maximum entropy, the thermodynamic free energy is de(cid:173)\nfined as an effective cost function that depends on the temperature. The maxi(cid:173)\nmization of log-likelihood is done by minimizing the cost function. Unlike simulated \nannealing (Geman & Geman, 1984) where stochastic search is performed on the \ngiven energy surface, this cost function is deterministically optimized at each tem(cid:173)\nperature. \n\nSuch deterministic annealing (DA) approach has been successfully adopted for vec(cid:173)\ntor quantization or clustering problems (Rose et al., 1992; Buhmann et al., 1993; \nWong, 1993). Recently, Yuile et al.(Yuile, Stolorz, & Utans, 1994) have shown that \nthe EM algorithm can be used in conjunction with the DA. In our previous paper, \nindependent ofYuile's work, we presented a new EM algorithm with DA for mixture \ndensity estimation problems (Ueda & Nakano, 1994). The aim of this paper is to \ngeneralize our earlier work and to derive a DA variant of the general EM algorithm. \nSince the EM algorithm can be used not only for the mixture estimation problems \nbut also for other parameter estimation problems, this generalization is expected to \nbe of value in practice. \n\n2 GENERAL THEORY OF THE EM ALGORITHM \n\nSuppose that a measure space Y of \"unobservable data\" exists corresponding to a \nmeasure space X of \"observable data\". An observable data sample ~(E X) with \ndensity p(~; @) is called incomplete and (~, y) with joint density p(~, y;@) is called \ncomplete, where y is an unobservable data sample 1 corresponding to~. Note that \n~ E'R,n and y E 'R,m. @ is parameter of the density distribution to be estimated. \nGiven incomplete data samples X = {~klk = 1\"\", N}, the goal of the EM al(cid:173)\ngorithm is to compute the maximum-likelihood estimate of @ that maximizes the \nfollowing log-likelihood function : \n\nN \n\nL(@;X) = Llogp(~k;@)' \n\n1:=1 \n\nby using the following complete data log-likelihood function: \n\nLe(@;X) = I: logp(zk' Yk;@) ' \n\nN \n\nk=l \n\n(1) \n\n(2) \n\nIn the EM algorithm, the parameter @ is iteratively estimated. Suppose that @(t) \ndenotes the current estimate of @ after the tth iteration of the algorithm. Then \n@(t+1) at the next iteration is determined by the following two steps: \n\n1 In such unsupervised learning as mixture problems, 1/ reduces to an integer value \n(1/ E {I, 2, ... , C}, where C is the number of components), indicating the component from \nwhich a data sample x originates. \n\n\fDeterministic Annealing Variant of the EM Algorithm \n\n547 \n\nE-step: Compute the Q-function defined by the conditional expectation of the \n\ncomplete data log-likelihood given X and @(t): \n\nQ(@; @(t\u00bb ~ E{Lc(@;X)IX, @(t)}. \n\n(3) \n\nM-step: Set @(t+l) equal to @ which maximizes Q(@, @(t\u00bb. \n\nIt has theoretically been shown that an iterative procedure for maximizing Q over \n@ will cause the likelihood L to monotonically increase, e.g., L(@(t+l\u00bb 2: L(@(t\u00bb. \nEventually, L(@(t\u00bb converges to a local maximum. The EM algorithm is especially \nuseful when the maximization of the Q-function can be more easily performed than \nthat of L. \nBy substituting Eq. 2 into Eq. 3, we have \n\nQ(@;@{t\u00bb \n\n= ~ J ... J{logp(~j:'Yj:;@)} IIp(Yjl~j;@(t))dYl . .. dYN \n\nN \n\nN \n\nk=l \nN \n~ J {logp(~k' Yk; @)}p(Ykl~k; @(t\u00bbdYk' \n\nj=l \n\n(4) \n\n(5) \n\nk=l \n\n@ that maximizes Q(@; @(t\u00bb should satisfy aQla@ = 0, or equivalently, \n\nN J a \n~ {a@ logp( ~k, Yk; @) }P(Yk I~k; @(t\u00bbdYk = o. \nk=l \n\nHere, p(Ykl~k' @(t\u00bb denotes the posterior and can be computed by the following \nBayes rule: \n\n(6) \n\nIt can be interpreted that the missing information is estimated by using the poste(cid:173)\nrior. However, because the reliability of the posterior highly depends on the param(cid:173)\neter @(t), the performance of the EM algorithm is sensitive to an initial parameter \nvalue @(O). This has often caused the algorithm to become trapped by some local \nmaxima. In the next section, we will derive a new variant of the EM algorithm as \nan attempt at global maximization of the Q-function in the EM process. \n\n3 DETERMINISTIC ANNEALING APPROACH \n\n3.1 DERIVATION OF PARAMETERIZED POSTERIOR \n\nInstead of the posterior given in Eq. 6, we introduce another posterior f(Ykl~k)' \nThe function form of f will be specified later. Using f(Ykl~k)' we consider a new \nfunction instead of Q, say E, defined as: \n\nE ~r ~ J {-logp(~k' Yk; @)}f(Yk l~k)dYk' \n\nN \n\nk=l \n\n(7) \n\n\f548 \n\nNaonori Ueda. Ryohei Nakano \n\n(N ote: E is always nonnegative.) One can easily see that (-E) is also the condi(cid:173)\ntional expectation of the complete data log-likelihood but it differs from Q in that \nthe expectation is taken with respect to !(lIk I:l:k) instead of the posterior given by \nEq. 6. In other words, if !(Ykl:l:k) = P(lIkl:l:k; B(t\u00bb, then E = -Q. \nSince we do not have a priori knowledge about !(lIk I:l:k), we apply the principle of \nmaximum entropy to specify it. That is, by maximizing the entropy given by: \n\nN \n\nH = - L J {log !(1I k I:l:k)} !(Yk I:l:k )dllb \n\nk=1 \n\n(8) \nwith respect to !, under the constraints of Eq. 7 and J !dYk = 1, we can obtain \nthe following Gibbs distribution: \n1 \nZ:l:\" \n\n(9) \nwhere Z:I:\" = J exp{ -{J( -logp(:l:k, 11k; B\u00bb)}dllb and is called the partition func(cid:173)\ntion. The parameter {J is the Lagrange multiplier determined by the value E. From \nan analogy of the annealing, 1 j {J corresponds to the \"temperature\". \n\nexp{ -{J( -logp(:l:k.lIk; B\u00bb)}, \n\n!(lIk I:l:k) = -\n\nBy simplifying Eq. 9, we obtain a new posterior parameterized by {J, \n\n!( I ) \n\nlIk:l:k = J p(:l:k, 11k; B)fJdllk . \n\np(:l:k, 11k; B)fJ \n\n(10) \n\nClearly, when {J = 1, !(lIkl:l:k) reduces to the original posterior given in Eq. 6. The \neffect of {J will be explained later. \n\nSince :1:1, ..\u2022 ,:l:N are identically and independently distributed observations, the \npartition function Zp(B) for X becomes Ok Z:l:\". Therefore, \n\nZp(B) = II J P(lIbllk; B)Pdllk \u00b7 \n\nN \n\nk=l \n\n(11) \n\nOnce the partition function is obtained explicitly, using statistical mechanics anal(cid:173)\nogy, we can define the free energy as an effective cost function that depends on the \ntemperature: \n\n(12) \n\nAt equilibrium, it is well known that a thermodynamic system settles into a config(cid:173)\nuration that minimizes its free energy. Hence, B should satisfy oFp(B)joB = O. \nIt follows that \n\n(13) \n\nInterestingly, We have arrived at the same equation as the result ofthe maximization \nof the Q-function, except that the posterior P(lIk I:l:k; B(t\u00bb \nin Eq. 5 is replaced by \n!(lIk I:l:k). \n\n\fDeterministic Annealing Variant of the EM Algorithm \n\n549 \n\n3.2 ANNEALING VARIANT OF THE EM ALGORITHM \n\nLet Qf3(@; @(I\u00bb be the expectation of the complete data log-likelihood by the pa(cid:173)\nrameterized posterior f(y\" I~,,). Then, the following deterministic annealing variant \nof the EM algorithm can be naturally derived to maximize -Ff3(@). \n[Annealing EM (AEM) algorithm] \n1. Set {3 +- {3min(O < {3min 0, function Qf3 will have \nseveral local maxima. However, at each step, it can be assumed that the new global \nmaximum is close to the previous one. Hence, by this assumption, the algorithm \ncan track the global maximum at each {3 while increasing {3. Clearly, when {3 = 1 \n-Fl(@) = L(@), {3max ought be one. \nthe parameterized posterior coincides with the original one. Moreover, noting that \n\n4 Demonstration \n\nTo visualize how the proposed algorithm works, we consider a simple one(cid:173)\ndimensional, two-component normal mixture problem. The mixture is given by \np(x;m},m2) = 0.3kexp{-~(x - ml?} + 0.7*exp{-~(x - m2)2}. In this \ncase, @ = (mb m2), y\" E {I, 2}, and therefore, the joint density p(x, 1; ml, m2) = \n0.3$ exp{ -~(x - ml)2}, while p(x, 2; ml, m2) = 0.7 $ \nOne hundred samples in total were generated from this mixture with ml = -2 \nand m2 = 2. Figure 1 shows contour plots of the -FfJ(ml,m2)/N surface. It is \ninteresting to see how FfJ(m}, m2) varies with {3. One can see that a finer and truer \nstructure emerges by increasing {3. Note that as explained before, when {3 = 1, \n2When the sequence converges to a saddle point (e.g., when the Hessian matrix of \n-Fp(9) has at least one positive eigen value), a local line search in the direction of the \neigen vector corresponding to the largest eigen value should be performed to escape the \nsolution from the saddle point. \n\nexp{ -~(x - m2)2}. \n\n\f550 \n\nNaonori Ueda, Ryohei Nakano \n\n-4 \n\n0 \n\n-2 \n2 \n(a) ~ .. 0.1 \n\n0 \n\n-2 \n2 \n(b) ~ - 0.25 \n\n-4 \n\n0 \n\n-2 \n2 \n(c) ~ '\" 0.3 \n\n-4 \n\n0 \n\n-2 \n2 \n(d) ~ = 0.5 \n\n4 ml \n\n-4 \n\n0 \n\n-2 \n2 \n(e) ~ .. 0.7 \n\n4 ml \n\n-4 \n\n0 \n\n-2 \n2 \n(f) ~ = 1 \n\nFigure 1: Contour plots of -Fp(mhm2)fN surface. \n\n(\"+\" denotes a global maximum at each JJ.) \n\n0 \n\n2 \n\n-2 \n\n-4 \n(a) (~),m~~ = (4, -1) \n\n4 ml \n\n-4 \n\n-2 \n\n0 \n\n2 \n\n4 ml \n\n(b) (m(~),m~~ =(-2,-4) \n\nFigure 2: Trajectories for EM and AEM procedures. \n\n\fDeterministic Annealing Variant of the EM Algorithm \n\n551 \n\n-Fl(ml, m2) == L(mb m2). The maximum of -Fl (or L) occurs at m1 = -2.0 and \nm2 = 1.9. As initial points, (m~O),m~O\u00bb = (4,-1),(-2,-4) were tested. Note that \nthese points are close to a local maximum. Figure 2 shows how both algorithms \nconverge from these starting points3. \nThe original EM algorithm converges to (mp2), m~12\u00bb = (2.002, -1.687) (Figure \n2( a\u00bb and to (m~19), m~19\u00bb = (1.999, -1.696) (Figure 2(b\u00bb. In contrast, the pro(cid:173)\nposed AEM algorithm converges to (m~43), m~43\u00bb = (-2.022,1.879) (Figure 2(a\u00bb \nand (m~43),m~43\u00bb = (-2.022,1.879) (Figure 2(b\u00bb. Since these starting points are \nclose to the local maximum point, the original EM algorithm becomes trapped by \nthe local maximum point, while the proposed algorithm successfully converges to \nthe global maximum in both cases. \n\n5 Discussion \n\nA new view of the EM algorithm has recently been presented (Hathaway, 1986; Neal \n& Hinton, 1993). This view states that the EM steps can be regarded as a grouped \nversion of the method of coordinate ascent of the following objective function: \n\nJ ~f I: E p(lIlo) {logp(:l:l:, 111:; @)} - I: E p(lIlo) {logp(lIl:)} \n\n(15) \n\nN \n\nI: \n\nN \n\nI: \n\nover P(lIl:) and @. That is, the E-step corresponds to the maximization of J with \nrespect to P(lIl:) with fixed @, while the M-step corresponds to that of J with \nrespect to @ with fixed P(lIl:)' Neal & Hinton mentioned that apart from the sign \nchange, J is analogous to the \"free energy\" well known in statistical physics. \nIt is worth mentioning another interpretation of the derived parameterized posterior; \ni.e., it plays a central role in the AEM algorithm. Taking the logarithm on both \nsides of Eq. 10, we have \n\n(16) \n\nN \n\nMoreover, taking the conditional expectation4 given :1:1:, summing over k, and using \nEq. 12, we have \nFfJ(@) = I: EJ(lIlol:l:lo){ -logp(:l:I:, 111:; @)}+ p I: EJ(lIlo 1:1:10) {log 1(111: 1:l:1:)}\u00b7 (17) \nFrom Eqs. 7 and 8, Eq. 17 can be rewritten as FfJ(@) = E - !H. Noting that l/f3 \ncorresponds to the \"temperature\", one can see that this expression exactly agrees \nwith the free energy, while Eq. 15 is completely without the \"temperature\". In \nother words, FfJ(@) can be interpreted as an annealing variant of -J. Clearly, \nwhen f3 = 1, FfJ(@) == -J. \n\n1 N \n\n1:=1 \n\n1:=1 \n\n3 Although the AEM procedure was actually performed for successive fJ (fJnew -\n\n1.4), the results a.re superimposed on -Fl(ml, m2)/N surfa.ce for convenience. \n\nfJ x \n\n4Since the LHS of Eq. 16 is independent of 111\" it does not cha.nge a.fter expecta.tion. \n\n\f552 \n\nNaonori Veda. Ryohei Nakano \n\nThe proposed algorithm is also applicable to the learning of the (Generalized) Ra(cid:173)\ndial Basis Function (RBF, GRBF) networks. Indeed, Nowlan (Nowlan, 1990), for \ninstance, proposes a maximum likelihood competitive learning algorithm for the \nRBF networks. In his study, \"soft competition\" and \"hard competition\" are eX(cid:173)\nperimentally compared and it is shown that the soft competition can give better \nperformance. In our algorithm, on the other hand, the soft model exactly corre(cid:173)\nsponds to the case f3 = 1, while the hard model corresponds to the case f3 -+ 00. \nConsequently, both models can be regarded as special cases in our algorithm. \n\nAcknowledgements \n\nWe would like to thank Dr. Tsukasa Kawaoka, NTT CS labs, for his encouragement. \n\nReferences \n\nJ. Buhmann & H. Kuhnel. (1993) Complexity optimized data clustering by com(cid:173)\npetitive neural networks. Neural Computation, 5:75-88. \n\n(1992) Alternating minimization and Boltzmann machine learning. \n\nW. Byrne. \nIEEE Trans. Neural Networks, 3:612-620. \nA. P. Dempster, N. M. Laird & D. B. Rubin. (1977) Maximum-likelihood from \nincomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B (methodolog(cid:173)\nical),39:1-38. \nS. Geman & D. Geman. (1984) Stochastic relaxation, Gibbs distribution and the \nBaysian restortion in images. IEEE Trans. Pattern Anal. Machine Intel/\" 6,6:721-\n741. \n\nR. J. Hathaway. (1986) Another interpretation of the EM algorithm for mixture \ndistributions. Statistics fj Probability Leiters, 4: 53-56. \nM. 1. Jordan & R. A. Jacob. (1993) Hierarchical mixtures of experts and the EM \nalgorithm. MIT Dept. of Brain and Cognitive Science preprint. \nR. M. Neal & G. E. Hinton. (1993) A new view of the EM algorithm that justifies \nincremental and other variants. submitted to Biometrika. \n\nS. J. Nowlan. (1990) Maximum likelihood competitve learning. in D. S. Touretzky \net al. eds., Advances in Neural Information Systems 2, Morgan Kaufmann. 574-582. \n\nK. Rose, E. Gurewitz & G. C. Fox. (1992) Vector quantization by deterministic \nannealing. IEEE Trans. Information Theory, 38,4:1249-1257. \nN. Ueda & R. Nakano. (1994) Mixture density estimation via EM algorithm with \ndeterministic annealing. \nin proc. IEEE Neural Networks for Signal Processing, \n69-77. \n\nY. Wong. (1993) Clustering data by melting. Neural Computation, 5:89-104. \nA. 1. Yuille, P. Stolorz & J. Utans. (1994) Statistical physics, mixtures of distribu(cid:173)\ntions, and the EM algorithm. Neural Computation, 6:334-340. \n\n\f", "award": [], "sourceid": 941, "authors": [{"given_name": "Naonori", "family_name": "Ueda", "institution": null}, {"given_name": "Ryohei", "family_name": "Nakano", "institution": null}]}