{"title": "Recognizing Handwritten Digits Using Mixtures of Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1015, "page_last": 1022, "abstract": null, "full_text": "Recognizing Handwritten Digits Using Mixtures of Linear Models \n\nGeoffrey E Hinton Michael Revow \nPeter Dayan \nDeparbnent of Computer Science, University of Toronto \n\nToronto, Ontario, Canada M5S lA4 \n\nAbstract \n\nWe construct a mixture of locally linear generative models of a col(cid:173)\nlection of pixel-based images of digits, and use them for recogni(cid:173)\ntion. Different models of a given digit are used to capture different \nstyles of writing, and new images are classified by evaluating their \nlog-likelihoods under each model. We use an EM-based algorithm \nin which the M-step is computationally straightforward principal \ncomponents analysis (PCA). Incorporating tangent-plane informa(cid:173)\ntion [12] about expected local deformations only requires adding \ntangent vectors into the sample covariance matrices for the PCA, \nand it demonstrably improves performance. \n\n1 \n\nIntroduction \n\nThe usual way of using a neural network for digit recognition is to train it to output \none of the ten classes. When the training data is limited to N examples equally \ndistributed among the classes, there are only N log2 10 bits of constraint in the class \nlabels so the number of free parameters that can be allowed in a discriminative \nneural net model is severely limited. An alternative approach, motivated by density \nestimation, is to train a separate autoencoder network on examples of each digit \nclass and to recognise digits by seeing which autoencoder network gives the best \nreconstruction of the data. Because the output of each autoencoder has the same \ndimensionality as the image, each training example provides many more bits of \nconstraint on the parameters. In the example we describe, 7000 training images are \nsufficient to fit 384, 000 parameters and the training procedure is fast enough to do \nthe fitting overnight on an R4400-based machine. \nAuto-encoders can be viewed in terms of minimum description length descriptions \nof data in which the input-hidden weights produce a code for a particular case and \n\n\f1016 \n\nGeoffrey E. Hinton, Michael Revow, Peter Dayan \n\nthe hidden-output weights embody a generative model which turns this code back \ninto a close approximation of the original example [14,7]. Code costs (under some \nprior) and reconstruction error (squared error assuming an isotropic Gaussian \nmisfit model) sum to give the overall code length which can be viewed as a lower \nbound on the log probability density that the autoencoder assigns to the image. \nThis properly places the emphasis on designing appropriate generative models \nfor the data which will generalise well by assigning high log probabilities to new \npatterns from the same classes. \n\nWe apply this idea to recognising handwritten digits from grey-level pixel im(cid:173)\nages using linear auto-encoders. Linear hidden units for autoencoders are barely \nworse than non-linear ones when squared reconstruction error is used [I], but have \nthe great computational advantage during training that input-hidden and hidden(cid:173)\noutput weights can be derived from principal components analysis (PCA) of the \ntraining data. In effect a PCA encoder approximates the entire N dimensional dis(cid:173)\ntribution of the data with a lower dimensional\"Gaussian pancake\" [13], choosing, \nfor optimal data compression, to retain just a few of the PCs. One could build a \nsingle PCA model for each digit - however the many different styles of writing \nsuggest that more than one Gaussian pancake should be used, by dividing the \npopulation of a single digit class into a number of sub-classes and approximating \neach class by its own model. A similar idea for data compression was used by [9] \nwhere vector quantization was used to define sub-classes and PCA was performed \nwithin each sub-class (see also [2]). We used an iterative method based on the \nExpectation Maximisation (EM) algorithm [4] to fit mixtures of linear models. \nThe reductio of the local linear approach would have just one training pattern in \neach model. This approach would amount to a nearest neighbour method for \nrecognition using a Euclidean metric for error, a technique which is known to be \ninfelicitous. One reason for this poor performance is that characters do not only \ndiffer in writing styles, but also can undergo simple transformations such as trans(cid:173)\nlations and rotations. These give rise to somewhat non-linear changes as measured \nin pixel space. Nearest neighbour methods were dramatically improved methods \nby defining a metric in which locally linearised versions of these transformations \ncost nothing [12]. In their tangent distance method, each training or test point is \nrepresented by a h-dimensionallinear subspace, where each of these dimensions \ncorresponds to a linear version of one of the transformations, and distances are \nmeasured between subspaces rather than points. The local linear autoencoder \nmethod can be seen just like this - variations along one of the h principal compo(cid:173)\nnent directions are free, while variations along the remaining principal directions \ncost. However, rather than storing and testing each training pattern, the local \nmodels summarise the regularities over a number of patterns. This reduces stor(cid:173)\nage and recognition time, and allows the directions of free variation to be averaged \nover numbers of patterns and also to be determined by the data rather than being \npre-specified. A priori knowledge that particular transformations are important \ncan be incorporated using a version of the tangent-prop procedure [12], which is \nequivalent in this case to adding in slightly transformed versions of the patterns. \nAlso reconstruction error could be assessed either at a test pattern, or, more like \ntangent distance, between its transformation subspace and the models' principal \ncomponents subs paces. \n\n\fRecognizing Handwritten Digits Using Mixtures of Linear Models \n\n1017 \n\n,a , , , \n\nFigure 1: Didactic example of tangent information and local linear models. See \ntext for details. \n\nFigure 1 illustrates the idea. Imagine that the four points 1-4 portray in image \nspace different examples of the same digit, subject to some smooth transformation. \nAs in tangent distance, one could represent this curve using the points and their \nlocal tangents (thick lines). However one might do better splitting it into three \nlocal linear models rather than four - model 'a' Gust a line in this simple case) \naverages the upper part of the curvp more effectively than the combination of the \ntwo tangents at '1' and '2'. However, given just the points, one might c~nstruct \nmodel 'b' for '3' and '4', which would be unfortunate. Incorporating information \nabout the tangents as well would encourage the separation of these segments. Care \nshould be taken in generalising this picture to high dimensional spaces. \nThe next section develops the theory behind variants of these systems (which is \nvery similar to that in [5, 10]), and section 3 discusses how they perform. \n\n2 Theory \n\nLinear auto-encoders embody a model in which variations from the mean of a \npopulation along certain directions are cheaper than along others, as measured by \nthe log-unlikelihood of examples. Creating such a generative model is straight(cid:173)\nforward. Principal component analysis (PCA) is performed on the training data \nand the leading h principal components are retained, defining an h-dimensional \nsubspace onto which the n-dimensional inputs are projected. We choose h using \ncross-validation, although a minimum description length criterion could also be \nused. We ignore the effect of different variances along the different principal com(cid:173)\nponents and use a model in which the code length for example i (the negative \nlog-likelihood) is proportional to the reconstruction error - the squared Euclidean \ndistance between the output of the autoencoder and the pattern itself. \nRather than having just one autoencoder for each digit, there is a whole collection, \nand therefore we use EM to assign examples to n sub-classes, just as in clustering \nusing a mixture of Gaussians generative model. During the E-step, the responsi(cid:173)\nbility for each pattern is assigned amongst the sub-classes, and in the M-step PCA \nis performed, altering the parameters of a sub-class appropriately to minimise the \n\n\f1018 \n\nGeoffrey E. Hinton, Michael Revow, Peter Dayan \n\nreconstruction cost of the data for which it is responsible. \nFormally, the algorithm is: \n\n1. Choose initial autoencoder assignments for each example in the training \n\nset (typically using a K-means clustering algorithm). \n\n2. Perform PCA separately for each autoencoder; \n3. Reassign patterns to the autoencoder that reconstructs them the best; \n4. Stop if no patterns have changed sub-class, otherwise return to step 2. \n\nThere is a 'soft' version of the algorithm in which the responsibility of autoencoder \nq for example i is calculated as ri.q = e-IIE,q 112 /Za 2 /(L.r e-IIE'TII2 /Z( 2 ) where Ei.r \nis the reconstruction error. For this, in step 2, the examples are weighted for the \nPCA by the responsibilities, and convergence is assessed by examining the change \nin the log-likelihood of the data at each iteration. The soft version requires a choice \nof O'z, the assumed variance in the directions orthogonal to the pancake. \nThe algorithm generates a set of local linear models for each digit. Given a test \npattern we evaluate the code length (the log-likelihood) against all the models for \nall the digits. We use a hard method for classification - determining the identity of \nthe pattern only by the model which reconstructs it best. The absolute quality of the \nbest reconstruction and the relative qualities of slightly sub-optimal reconstructions \nare available to reject ambiguous cases. \nFor a given linear model, not counting the code cost implies that deformations \nof the images along the principal components for the sub-class are free. This is \nlike the metric used by [11] except that they explicitly specified the directions in \nwhich deformations should be free, rather than learning them from the data. We \nwished to incorporate information about the preferred directions without losing \nthe summarisation capacity of the local models, and therefore turned to the tangent \nprop algorithm [12]. \nTangent prop takes into account information about how the output of the system \nshould vary locally with particular distortions of the input by penalising the system \nfor having incorrect derivatives in the relevant directions. In our case, the overall \nclassification process is highly non-linear, making the application of tangent-prop \nto it computationally unattractive, but it is easy to add a tangent constraint to the \nreconstruction step because it is linear. Imagine requiring the system f(p) = A.p \nto reconstruct x + .M and x -7-. t as well as x, for an input example x, and distortion \nt (the tangent vector), and where 7-. is a weight. Their contribution to the error is \nproportional to Ix - A.xl z + 7-.zlt - A.tI Z, where the second term is equivalent to \nthe error term that tangent-prop would add. Incorporating this into the PCA is as \nsimple as adding a weighted version of tt T to the covariance matrix - the tangent \nvectors are never added to the means of the sub-classes. \n\n3 Results \n\nWe have evaluated the performance of the system on data from the CEDAR \nCDROM 1 database containing handwritten digits lifted from mail pieces pass(cid:173)\ning through a United States Post Office [8]. We divided the br training set of binary \n\n\fRecognizing Handwritten Digits Using Mixtures of Linear Models \n\n1019 \n\nTable 1: Classification errors on the validation test when different weightings are \nused for the tangent vectors during clustering and recognition. No rejections were \nallowed. \n\nsegmented digits into 7,000 training examples, 2,000 validation examples and 2,000 \n\"internal test\" examples. All digits were equally represented in these sets. The \nbinary images in the database are all of different sizes, so they were scaled onto a \n16 x 16 grid and then smoothed with a Gaussian filter. \nThe validation set was used to investigate different choices for the Gaussian filter \nvariances, the numbers of sub-classes per digit and principal components per \nmodel, and the different weightings on the tangent vectors. Clearly this is a large \nparameter space and we were only able to perform a very coarse search. In the \nresults reported here all digits have the same number of sub-classes (10) and the \nnumber of principal components per model was picked so as to explain 95% of the \ntraining set variance assigned to that model. Once a reasonable set of parameter \nsettings had been decided upon, we used all 11,000 images to train a final version \nwhich was tested on the official bs set (2711 images). \nThere are two major steps to the algorithm; defining sub-classes within each digit \nand reconstructing for recognition. We found that the tangent vectors should \nbe more heavily weighted in the sub-class clustering step than during ultimate \nrecognition. Figure 2 shows the means of the 10 sub-classes for the digit two where \nthe clustering has been done (a) without or (b) with tangent vectors. It is clear that \nthe clusters defined in (b) capture different styles of 2s in a way that those in (a) \ndo not - they are more diverse and less noisy. They also perform better. The raw \nerror rate (no rejections allowed) on the validation set with different amounts of \ntangent vector weightings are shown in Table 1. The results on the official test set \n(2711 examples) are shown in Table 2. \n\n4 Discussion and Conclusions \n\nA mixture of local linear models is an effective way to capture the underlying \nstyles of handwritten digits. The first few principal components (less than 20 on \n256-dimensional data) extract a significant proportion of the variance in the images \nwithin each sub-class. The resulting models classify surprisingly well without \nrequiring large amounts of either storage or computation during recognition. Fur(cid:173)\nther, it is computationally easy and demonstrably useful to incorporate tangent \ninformation requiring reconstruction to be good for small transformations of the \nsample patterns in a handful of pre-specified directions. Adding tangent infor(cid:173)\nmation is exactly equivalent to replacing each digit by a Gaussian cloud of digits \nperturbed along the tangent plane and is much more efficient than adding extra \n\n\f1020 \n\nGeoffrey E. Hinton, Michael Revow, Peter Dayan \n\na) \n\nb) \n\nFigure 2: Cluster means for images of twos. (a) Without tangent vectors. (b) Cluster \nmeans using translation, rotation and scaling tangent vectors. \n\nstochastically perturbed examples. The weightings applied to reconstruction of \nthe tangent vectors are equivalent to the variances of the cloud. \nThere is an interesting relationship between mixtures of linear models and mixtures \nof Gaussians. Fitting a mixture of full-covariance Gaussians is typically infeasible \nwhen the dimensionality of the input image space, n, is high and the number of \ntraining examples is limited. Each Gaussian requires n parameters to define its \nmean and n(n + 1 )/2 more parameters to define its covariance matrix. One way to \nreduce the number of parameters is to use a diagonal covariance matrix but this is a \nvery poor approximation for images because neighbouring pixel intensities are very \nhighly correlated. An alternative way to simplify the covariance matrix is to flatten \nthe Gaussian into a pancake that has h dimensions within the pancake and n - h \ndimensions orthogonal to the pancake. Within this orthogonal subspace we assume \nan isotropic Gaussian distribution (ie a diagonal covariance matrix with equal, \nsmall variances along the diagonal), so we eliminate (n - h)(n - h + 1 )/2 degrees \nof freedom which is nearly all the degrees of freedom of the full covariance matrix \nif h is small compared with n. Not counting the mean, this leaves h(2n - h + 1) /2 \ndegrees of freedom which is exactly the number of parameters required to define \nthe first h principal components. l Thus we can view PCA as a way of fiercely \nconstraining a full covariance Gaussian but nevertheless leaving it free to model \n\nlThe h th principal component only requires n - h + 1 parameters because it must be \n\northogonal to all previous components. \n\n\fRecognizing Handwritten Digits Using Mixtures of Linear Models \n\n1021 \n\n\u2022 \n\u2022 \n\n. .. ' I\u00b7\u00b7\u00b7 .. \u00b7'\u00b7 \u00b71\u00b7\u00b7 .. \u00b7 .. \u00b7 '''1 \n\n~it:I; .. :~l ~ \n.. i;;::~\u00b7: \n.,. \n. ..... _._.... \n. .......... . \n... \n.\u2022\u2022\u2022 f.... \n,,~:I!:::';;.\u00b7; \nI;':~\u00b7~; \n.. .::P.'a \u2022 \n\u2022 roo \nI'\" \u2022. ........ \u2022\u2022 \n. ..... ..... \" \n\u2022 \n..... a. \no. \n.. \n,.. \n. .\u2022. \n\u2022. ........ \n, \n.-. ..... \n. . . . \n.. \n. \n.o... \n.... \n\u2022\u2022 \nu ' \n\u2022 ~ ........ \" . . . . . ..... \nf!~~, -. \n:rL \n!si \n:d. .. : .1.: \n.:t-ir:~:,. 'ii: \nLi. \n. \u2022\u2022. ~ .....\u2022\u2022.. a:!:\u00b7 .... a&,. \n.................. \n.......... \n\"'lii!;~\" \n:H:fiiiiP.~!\u00b7 \n\u00b7~HiiiH~ :-!'**!Hr:~:' \n\n...... , \n\nr. \n\nFigure 3: Reconstruction of a 2 (left column) using 2 (upper) and 0 (lower) models. \n\nLinear \n\n0 e \n\nLinear Models + Tangent Dist \n\nTan ent Distance \n\nTable 2: Classification errors on the official test test when no rejections are allowed. \nMemory requirements are indicated in terms of the number of 16 x 16 images that \nneed to be stored. \n\nimportant correlations. \nWe have investigated a number of extensions to the basic scheme. As only one of \nthese yielded improved results on the validation set, we will only briefly review \nthem. If instead of assuming an isotropic Gaussian within the pancake we use \nan ellipsoidal subspace, then we can can take into account the different variances \nalong each of the h principal directions. This is akin to incorporating a code cost [14]. \nSimilarly, the squared reconstruction error used in the basic scheme also assumes \nan isotropic distribution. Again a diagonal covariance matrix may be substituted. \nWe surmise that this was not successful because we had insufficient training data \nto estimate the variances along some dimensions reliably; for example some of the \nedge pixels were never turned on in the training data for some models. \nThe Euclidean metric for the reconstruction error is convenient as it authorises the \nuse of a powerful methods such as principal components; however as pointed out \nin the introduction it is deficient for situations like character recognition. We tried a \nscheme in which the models were trained as described above, but tangent distance \n[11] was used during testing. This method yielded marginally improved results \non both the validation and test sets (Table 2). More adventurous tangent options, \nincluding using them in the clustering phase, were explored by [5, 10]. \nPCA models are not ideal as generative models of the data because they say noth(cid:173)\ning about how to generate values of components in the directions orthogonal \nto the pancake. To ameliorate this, the generative model may be formulated as \nf(p) = A.p + \u20ac, where the components of \u20ac are independent and the factors p have \nsome prior covariance matrix. The generative weights of this autoencoder can \n\n\f1022 \n\nGeoffrey E. Hinton, Michael Revow, Peter Dayan \n\nbe obtained using the technique of maximum likelihood factor analysis (which is \nclosely related to PCA) and the resulting architecture and hierarchical variants of it \ncan be formulated as real valued versions of the Helmholtz machine [3,6]. The cost \nof coding the factors relative to their prior is implicitly included in this formula(cid:173)\ntion, as is the possibility that different input pixels are subject to different amounts \nof noise. Unlike PCA, factor analysis privileges the particular input coordinates \n(rather than being invariant to rotations of the input covariance matrix). \n\nReferences \n[1] Bourlard, H & Kamp, Y (1988). Auto-association by Multilayer Perceptrons and Singular \n\nValue Decomposition. BioI. Cybernetics 59,291-294. \n\n[2] Bregler, C & Omohundro, SM (1995). Non-linear image interpolation using surface \n\nlearning. This volume. \n\n[3] Dayan, P, Hinton, GE, Neal, RM & Zemel, RS (1995). The Helmholtz machine. Neural \n\nComputation, in press. \n\n[4] Dempster, AP, Laird, NM & Rubin, DB (1976). Maximum likelihood from incomplete \n\ndata via the EM algorithm. Proceedings of the Royal Statistical Society, 1-38. \n\n[5] Hastie, T, Simard, P & Sackinger, E (1995). Learning prototype models for tangent \n\ndistance. This volume. \n\n[6] Hinton, GE, Dayan, P, Frey, BJ, Neal, RM (1995). The wake-sleep algorithm for unsu(cid:173)\n\npervised neural networks. Submitted for publication. \n\n[7] Hinton, GE & Zemel, RS (1994). Autoencoders, minimum deSCription length and \nHelmholtz free energy. In JD Cowan, G Tesauro & J Alspector, editors, Advances in \nNeural Information Processing Systems 6. San Mateo, CA: Morgan Kaufmann. \n\n[8] Hull, JJ (1994). A database for handwritten text recognition research. IEEE Transactions \n\non Pattern Analysis and Machine Intelligence, 16, 550-554. \n\n[9] Kambhatla, N & Leen, TK (1994). Fast non-linear dimension reduction. In JD Cowan, \nG Tesauro & J Alspector, editors, Advances in Neural Information Processing Systems 6. \nSan Mateo, CA: Morgan Kaufmann. \n\n[10] Schwenk, H & Milgram, M (1995). Transformation invariant autoassociation with ap(cid:173)\n\nplication to handwritten character recognition. This volume. \n\n[11] Simard, P, Le Cun, Y & and Denker, J (1993). Efficient pattern recognition using a new \ntransformation distance. In SJ Hanson, JD Cowan & CL Giles, editors, Advances in \nNeural Information Processing Systems 5, 50-58. San Mateo, CA: Morgan Kaufmann. \n\n[12] Simard, P, Victorri, B, LeCun, Y & Denker, J (1992). Tangent Prop - A formalism for \nspecifying selected invariances in an adaptive network. In JE Moody, SJ Hanson & RP \nLippmann, editors, Advances in Neural Information Processing Systems 4. San Mateo, CA: \nMorgan Kaufmann. \n\n[13] Williams, CKI, Zemel, RS & Mozer, MC (1993). Unsupervised learning of object models. \n\nIn AAAI FalI1993 Symposium on Machine Learning in Computer Vision, 20-24. \n\n[14] Zemel, RS (1993). A Minimum Description Length Framework for Unsupervised Learning. \n\nPhD Dissertation, Computer Science, University of Toronto, Canada. \n\nIThis research was funded by the Ontario Information Technology Research Centre and \nNSERC. We thank Patrice Simard, Chris Williams, Rob Tibshirani and Yann Le Cun for \nhelpful discussions. Geoffrey Hinton is the Noranda Fellow of the Canadian Institute for \nAdvanced Research. \n\n\f", "award": [], "sourceid": 962, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Michael", "family_name": "Revow", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}