## Abstract

We investigate a latent variable model for multinomial classification inspired by recent capsule architectures for visual object recognition (Sabour, Frosst, & Hinton, 2017). Capsule architectures use vectors of hidden unit activities to encode the pose of visual objects in an image, and they use the lengths of these vectors to encode the probabilities that objects are present. Probabilities from different capsules can also be propagated through deep multilayer networks to model the part-whole relationships of more complex objects. Notwithstanding the promise of these networks, there still remains much to understand about capsules as primitive computing elements in their own right. In this letter, we study the problem of capsule regression—a higher-dimensional analog of logistic, probit, and softmax regression in which class probabilities are derived from vectors of competing magnitude. To start, we propose a simple capsule architecture for multinomial classification: the architecture has one capsule per class, and each capsule uses a weight matrix to compute the vector of hidden unit activities for patterns it seeks to recognize. Next, we show how to model these hidden unit activities as latent variables, and we use a squashing nonlinearity to convert their magnitudes as vectors into normalized probabilities for multinomial classification. When different capsules compete to recognize the same pattern, the squashing nonlinearity induces nongaussian terms in the posterior distribution over their latent variables. Nevertheless, we show that exact inference remains tractable and use an expectation-maximization procedure to derive least-squares updates for each capsule's weight matrix. We also present experimental results to demonstrate how these ideas work in practice.

## 1 Introduction

Recently Sabour, Frosst, and Hinton (2017) introduced a novel capsule-based architecture for visual object recognition. A capsule is a group of hidden units that responds maximally to the presence of a particular object (or object part) in an image. But most important, the capsule responds to this presence in a specific way: its hidden units encode a pose vector for the object—a vector that varies (for instance) with the object's position and orientation—while the length of this vector encodes the probability that the object is present. Capsules were conceived by Hinton, Krizhevsky, and Wang (2011) to address a shortcoming of convolutional neural nets, whose hidden representations of objects in deeper layers are designed to be invariant to changes in pose. With such representations, it is difficult to model the spatial relationships between different objects in the same image. By contrast, the pose vectors in capsules learn equivariant representations of visual objects: these vectors do change with the pose, but in such a way that the object is still recognized with high probability. These ideas have led to a surge of interest in multilayer capsule networks for increasingly difficult problems in computer vision (Duarte, Rawat, & Shah, 2018; Kosiorek, Sabour, Teh, & Hinton, 2019; Qin et al., 2020).

Much of this work has focused on the message passing between capsules in different layers, as is needed to model the part-whole relationships of complex objects or entire scenes (Wang & Liu, 2018; Bahadori, 2018; Hinton, Sabour, & Frosst, 2018; Jeong, Lee, & Kim, 2019; Hahn, Pyeon, & Kim, 2019; Ahmed & Torresani, 2019; Tsai, Srivastava, Goh, & Salakhutdinov, 2020; Venkataraman, Balasubramanian, & Sarma, 2020). But capsules also introduced a novel paradigm for subspace learning that deserves to be explored—and elucidated—in its own right (Zhang, Edraki, & Qi, 2018). Consider, for instance, an individual capsule: its vector of hidden unit activities already represents a powerful generalization of the scalar activity computed by a simple neural element. In this letter, we shall discover an additional source of richness by modeling these hidden unit activities as latent variables in a probabilistic graphical model. The models we study are not exactly a special case of previous work, but they are directly motivated by it. As Sabour et al. (2017) wrote, “There are many possible ways to implement the general idea of capsules… . We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input.” This general idea is also the starting point for our work.

In this letter, we study the problem of capsule regression, a problem in which multiple capsules must learn in tandem how to map inputs to pose vectors with competing magnitudes. We have written this letter with two readers in mind. The first is a practitioner of deep learning. She will view our models as a kind of evolutionary precursor to existing capsule networks; certainly, she will recognize at once how pose vectors are computed from inputs and converted via squashing nonlinearities into normalized probabilities. The second reader we have in mind is the working data scientist. Where the letter succeeds, she will view our models as a natural generalization of logistic and softmax regression—essentially, a higher-dimensional analog of these workhorses in which vectors compete in magnitude to determine how inputs should be classified. We hope that both types of readers see the novel possibilities for learning that these models afford.

The organization of this letter is as follows. In section 2, we formulate our latent variable model for capsule regression. Despite the model's squashing nonlinearity, we show that exact inference remains tractable and use an expectation-maximization (EM) procedure (Dempster, Laird, & Rubin, 1977) to derive least-squares updates for each capsule's weight matrix. In section 3, we present experimental results on images of handwritten digits and fashion items. Our results highlight how capsules use their internal distributed representations to learn more accurate classifiers. In section 4, we discuss issues that are deserving of further study, such as regularization, scaling, and model building. Finally, in the appendixes, we fill in the technical details and design choices that were omitted from the main development.

## 2 Model

Our model can be visualized as the belief network with latent variables shown in Figure 1, and our mathematical development is based closely on this representation. Section 2.1 gives an overview of the model and its EM algorithm for parameter estimation; here, we cover what is necessary to understand the model at a high level, though not all of what is required to implement it in practice. The later sections fill in these gaps. Section 2.2 focuses on the problem of inference; we show how to calculate likelihoods and statistics of the posterior distribution of our model. Section 2.3 focuses on the problem of learning; we show how the EM algorithm uses the model's posterior statistics to iteratively reestimate its parameters. Finally, section 2.4 describes a simple heuristic for initializing the model parameters, based on singular value decomposition, that seems to work well in practice.

### 2.1 Overview

We use the model in Figure 1 for multinomial classification—that is, to parameterize the conditional probability $P(y|x)$ where $x\u2208Rp$ is a vector-valued input and $y\u2208{1,2,\u2026,m}$ is a class label. The model predicts the class label $y$ based on the magnitudes of the $m$ vector-valued latent variables ${h1,h2,\u2026,hm}$; in particular, for each input $x$, the most likely label $y$ is determined by whichever vector $hi$ has the largest magnitude. The model's prediction depends essentially on three constructions: how the latent variables $hi$ depend on the input $x$; how the class label $y$ depends on the magnitudes $\u2225hi\u2225$ of these latent variables; and how the prediction $P(y|x)$ depends on the distribution over these magnitudes. We now describe each of these in turn.

There is one additional model parameter in equation 2.2, namely, the variance $\sigma 2$, which determines the sharpness of the gaussian distribution and which (unlike the weight matrix) we assume is common to all the distributions $P(h1|x),\u2026,P(hm|x)$ that appear in equation 2.1. We note that the model has an especially simple behavior in the limits $\sigma 2\u21920$ and $\sigma 2\u2192\u221e$: in the former, the latent variables are deterministically specified by $hi=Wix$, while in the latter, they are completely delocalized. Though these limits are trivial, we shall see in section 2.2 that many nontrivial aspects of the model can be exactly calculated by a simple interpolation between these two regimes.

It is obvious that the model in Figure 1 is far more primitive than the multilayer capsule networks that have been explored for difficult problems in visual object recognition (Sabour et al., 2017; Hinton et al., 2018; Ahmed & Torresani, 2019; Hahn et al., 2019; Jeong et al., 2019; Venkataraman et al., 2020; Tsai et al., 2020). Nevertheless, this model does provide what is arguably the simplest expression of the raison d'etre for capsules—namely, the idea that the length of a vector of hidden unit activities can encode the probability that some pattern is present in the input. The model can also be viewed as a higher-dimensional analog of logistic/probit regression (for $m=2$ classes) or softmax regression (for $m\u22653$ classes) in which each class is modeled by a vector of hidden unit activities as opposed to a single scalar dot product.

In developing the model further, it becomes needlessly cumbersome to list the $m$ latent variables $h1,h2,\u2026,hm$ wherever they are collectively employed. In what follows, therefore, we denote the collection of these $m$ latent variables by $h=(h1,h2,\u2026,hm)$. In this way, we may simply write the factorization in equation 2.1 as $P(h|x)=\u220fiP(hi|x)$ and the squashing nonlinearity in equation 2.3 as $P(y=j|h)=\u2225hj\u22252/\u2225h\u22252$. As a similar shorthand, we also denote the collection of $m$ weight matrices in the model by $W=(W1,W2,\u2026,Wm)$.

Having demonstrated how the model makes predictions, we turn briefly to the problem of learning—that is, of estimating parameters $W$ and $\sigma $ that yield a useful model. We note that in the course of learning, the predictions of our model typically pass from a regime of high uncertainty (i.e., larger $\sigma 2$) to low uncertainty (i.e., smaller $\sigma 2$), and therefore it is exactly the intermediate regime between the two limits in equation 2.5 where we expect the bulk of learning to occur.

In sum, our model is defined by the multivariate gaussian distributions in equation 2.1 and the squashing nonlinearity in equation 2.3, and its essential parameters (namely, the weight matrices $Wi$) are reestimated by computing the posterior statistics in equation 2.8 and solving the least-squares problems in equation 2.9. The next sections provide the results that are needed to implement these steps in practice.

### 2.2 Inference

*interpolating*functions $\lambda 0(\beta )$ and $\lambda 1(\beta )$ given by

In sum, we compute the conditional probability $P(y=j|x)$ from equation 2.16, and we compute the posterior means $E[hi|x,y=j]$ from equation 2.19. We need both of these results for the EM algorithm—the first to verify that the likelihood of the data in equation 2.6 is increasing and the second to facilitate the update in equation 2.9. In the next section, we examine this update in more detail.

### 2.3 Learning

^{1}to reestimate it.

We observed earlier that the update for the weight matrix $Wi$ takes the form of the least-squares problem in equation 2.9. As Dempster et al. (1977) noted, this result is typical of models with gaussian latent variables, such as factor analysis (Rubin & Thayer, 1982; Ghahramani & Hinton, 1996) and probit regression (Liu, Rubin, & Wu, 1998). In these models, the EM procedure also yields iterative least-squares updates for ML estimation, and the update in equation 2.9 is derived in exactly the same way.

*conjugate*input by

*multiplicative*update on each capsule's weight matrix.

The update in equation 2.24 is guaranteed to increase the log-conditional likelihood in equation 2.6 except at stationary points. In practice, however, we have found it useful to modify this update in two ways. These modifications do not strictly preserve the EM algorithm's guarantee of monotonic convergence in the likelihood, but they yield other practical benefits without seeming to compromise the algorithm's stability. We discuss these next.

We emphasize that the correctly classified examples with $r\u2113\u2264\nu $ are not dropped from the update altogether; they still appear in the first sum on the right-hand side of equation 2.22. The goal is not to ignore these examples, only to reduce their influence on the model after they are correctly classified with high certainty. The thresholded update can be viewed as a heuristic for large-margin classification (Boser, Guyon, & Vapnik, 1992), focusing on incorrect examples and/or correct examples near the decision boundary. Though motivated differently, it also resembles certain incremental variants of the EM algorithm that have been explored for latent variable modeling (Neal & Hinton, 1998).

### 2.4 Initialization

EM algorithms do not converge in general to a global maximum of the likelihood, and as a result, the models they discover can depend on how the model parameters are initialized. This is true for the update in equation 2.24 as well as the modified updates in equations 2.26 and 2.27.

After fixing $\sigma 2=1$, we compared two approaches for initializing the model's weight matrices $W1,W2,\u2026,Wm$. In the first approach, we randomly sampled each matrix element from a gaussian distribution with zero mean and small variance. In the second approach, we initialized the matrices based on a singular value decomposition of the training examples in each class. We refer to the first approach as *random* initialization and the second as *subspace* initialization.

## 3 Experiments

We implemented the model described in section 2 and evaluated its performance in different architectures for multinomial classification. In some especially simple settings, we also sought to understand the latent representations learned by capsule regression. The organization of this section is as follows. In section 3.1, we describe the data sets that we used for benchmarking and the common setup for all of our experiments. In section 3.2, we present a visualization of results from the model in Figure 1 with two-dimensional capsules ($d=2$). This visualization reveals how the latent spaces of different capsules are organized in tandem by the updates of section 2.3 to learn an accurate classifier. In section 3.3, we examine the internal representations learned by the model in Figure 1 with eight-dimensional capsules ($d=8$). Here, we explore how distinct patterns of variability are encoded by different elements of the model's latent variables. Finally, in section 3.4, we present our main results—a systematic comparison of classifiers obtained from capsules of varying dimensionality as well as different capsule-based architectures (e.g., multiclass, one versus all, all versus all) for multinomial classification.

### 3.1 Setup

We experimented on two data sets of images—one of handwritten digits (LeCun, Bottou, Bengio, & Haffner, 1998) and the other of fashion items (Xiao, Rasul, & Vollgraf, 2017). Both data sets contain 60,000 training examples and 10,000 test examples of $28\xd728$ grayscale images drawn from $m=10$ classes; they have also been extensively benchmarked. To speed up our experiments, we began by reducing the dimensionality of these images by a factor of four. Specifically, for each data set, we used a singular value decomposition of the training examples to identify the linear projections with largest variance, and then we used these projections to map each image into an input $x\u2208Rp$ for capsule regression, where $p=196$. This was done for all the experiments in this letter.

We followed a single common protocol for training, validation, and testing (with one exception, mentioned at the end of the section, where we trained on a much larger data set of 1 million digit images). We trained our models on the first 50,000 examples in each training set while holding out the last 10,000 examples as a validation set. We monitored the classification error rate on the validation set in an attempt to prevent overfitting. We used the subspace initialization in equation 2.28 to seed the weight matrices before training, and we used a fixed momentum hyperparameter of $\gamma =0.9$ in the update of equation 2.27. We did not use a fixed value for the thresholding hyperparameter $\nu $. Instead, we divided the learning into five rounds with $\nu $ taking on a fixed value of 0.8 in the first round, 0.6 in the second, 0.4 in the third, 0.2 in the fourth, and 0 in the fifth. We terminated the first round when the error rate on the validation set had not improved for 128 consecutive iterations, and we terminated the next rounds when it had not improved for 64, 32, 16, and 8 consecutive iterations, respectively. Thus, each round had a fixed minimum number of iterations, though not a fixed maximum. Finally, we initialized each subsequent round by the best model obtained in the previous one (as measured by the error rate on the validation set). We present some empirical motivation for these choices of hyperparameters in appendix B.

### 3.2 Visualization of Results with $d=2$ Capsules

Figure 3 shows the corresponding result for the same model trained on images of fashion items. The test error rate (15.14%) is higher for this data set, but the same pattern is evident. From the plots in these panels, we can also see which classes of images are most confusable. For example, the black latent variables (representing images of T-shirts) have large radii not only in the upper left-most panel but also in the bottom panel, second from the left. These two panels—corresponding to the capsules for T-shirts and shirts—show that these two classes of images are among the likeliest to be confused.

Naturally it is more difficult to visualize the results from models of capsule regression with higher-dimensional ($d>2$) latent variables. But conceptually it is clear what happens: the circles of unit radius in the panels of Figures 2 and 3 are replaced by hyperspheres of unit radius in $d$ dimensions. With latent variables of higher dimensionality, we might also expect the capsules to discover richer internal representations of the variability within each class of images. This is what we explore in the next section.

### 3.3 Encoding of Variability by $d=8$ Capsules

Figure 4 shows these prototypical examples for the model of Figure 1 with capsules of dimensionality $d=8$. For the images of digits, the prototypes exhibit variations in orientation, thickness, and style (e.g., the presence of a loop in the digit 2 or an extra horizontal bar in the digit 7). For the images of fashion items, the prototypes exhibit variations in brightness, girth, and basic design. The examples in the figure are suggestive of the prototypes discovered by vector quantization (Lloyd, 1957), but in this case, they have emerged from internal representations of discriminatively trained capsules. It is clear that higher-dimensional capsules can represent a greater diversity of prototypes, and by doing so, they might be expected to yield more accurate classifiers. This is what we explore in the next section.

### 3.4 Results for Classification

Our main experiments investigated the effect of the capsule dimensionality ($d$) on the model's performance as a classifier. We experimented with three types of architectures: (1) the multiclass-capsule architecture shown in Figure 1, in which we jointly train $m$ capsules to recognize patterns from $m$ different classes; (2) a one-versus-all architecture, in which we train $m$ binary-capsule architectures in parallel and label inputs based on the most certain of their individual predictions; and (3) an all-versus-all architecture, in which we train $m(m-1)/2$ binary-capsule architectures in parallel and label inputs based on a majority vote. We note that these architectures employ different numbers of capsules, and therefore they have different numbers of weight matrices and learnable parameters even when all their capsules have the same dimensionality. In particular, the one-versus-all architecture has twice as many learnable parameters as the multiclass architecture, while the all-versus-all architecture has $m-1$ times as many learnable parameters.

The data sets of MNIST hand-written digits and fashion items have been extensively benchmarked, so we can also compare these results to those of other classifiers (LeCun et al., 1998; Xiao et al., 2017). The one-dimensional $(d=1)$ capsule architectures in Figure 5 have test error rates comparable to those of other simple linear models; for example, Xiao et al. (2017) report test error rates of 8.3% and 15.8% for one-versus-all logistic regression on the data sets of digits and fashion items, respectively. Likewise, the models with higher-dimensional capsules offer similar improvements as other nonlinear approaches, such as $k$-nearest neighbors, random forests (Breiman, 2001), and fully connected neural nets; standardized implementations of these algorithms in Python's scikit-learn yield test error rates of roughly 2% to 3% on digits and 10% to 12% on fashion items (Xiao et al., 2017). However, none of the models in Figure 5 classify as well as the best-performing support vector machines (Cortes & Vapnik, 1995) or convolutional neural nets (LeCun et al., 1998). We believe that this gap in performance is mainly due to overfitting as opposed to an inability to learn sufficiently complex decision boundaries.

To test this hypothesis, we conducted another set of experiments where we trained the all-versus-all digit classifiers in Figure 5 on a much larger data set of 1 million training images (Loosli, Cani, & Bottou, 2007). The 950,000 additional images for training were obtained from distortions (e.g., rotation, thickening) of the original MNIST training set. To facilitate a direct comparison with our previous results, we also used the same validation and test sets of 10,000 images. The results of these experiments are shown at the bottom (green) curve of the left panel of Figure 5. The results show that these all-versus-all capsule architectures have the capacity to learn better classifiers from larger amounts of training data. But even these classifiers are also plagued by overfitting: for example, the best of these classifiers (with capsule dimensionality $d=16$) still exhibits a large gap between its test error rate (1.38%) on 10,000 images and its training error rate (0.0523%) on 1 million images.

## 4 Discussion

In this letter, we have introduced capsule regression as a higher-dimensional analog of simpler log-linear models such as logistic and softmax regression. We experimented with capsule regression in multiclass, one-versus-all, and all-versus-all architectures, and we showed that in all of these architectures, the model capacity grows in step with the capsule dimensionality. To learn these classifiers, we formulated capsule regression as a latent variable model, and we used the EM procedure to derive iterative least-squares updates for parameter estimation. Despite the squashing nonlinearity in our models, we showed that it remains tractable to perform exact inference over their continuous latent variables. One contribution of our work is to expand the family of tractable latent variable models that can learn meaningful distributed representations of high-dimensional inputs. Our work fits into a larger vision for probabilistic modeling: the “need to develop computationally-tractable representations of uncertainty” has been described as “one of the major open problems in classical AI” (Jordan, 2018).

Another contribution of our work is to highlight certain advantages of capsule regression for supervised learning with distributed representations. Traditional neural nets can learn more flexible decision boundaries than linear models, but this extra capacity comes at a cost: they involve much more complicated optimizations, with learning rates that must be tuned for convergence, and their internal representations (though effective for classification) can be fairly inscrutable. Models for capsule regression benefit equally from their distributed representations; as shown in Figure 5, with higher-dimensional capsules, these models acquire the capacity to learn increasingly accurate classifiers. But as shown in Figures 2 to 4, the internal representations of capsules also have a fairly interpretable structure, and as shown in section 2.3, these representations can be learned by simple least-squares updates. There are other plausible benefits to this structure that we have not yet explored. For example, we have only considered architectures in which all the capsules have the same dimensionality. But these dimensionalities could be varied for classes that exhibit more diversity in their inputs and/or have larger numbers of training examples. It is harder to see how a traditional neural net could be purposefully adapted to reflect these forms of prior knowledge.

Several issues in our work deserve further study. The first is regularization: as previously mentioned, the models in section 2 are prone to overfitting even with early stopping on a validation set. It seems worthwhile to explore $\u21131$ and/or $\u21132$ regularization of the model parameters, as is common for other types of regression and to consider those forms of regularization—such as weight sharing (Nowlan & Hinton, 2004), dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014), and reconstruction penalties (Sabour et al., 2017; Qin et al., 2020)—that have been widely used in deep learning. It should also help to incorporate some prior knowledge about images explicitly into the structure of the weight matrices, as is done in convolutional neural nets (LeCun et al., 1998).

Another issue is scaling: for very large data sets, it will be more practical to implement online or mini-batch versions of the updates in section 2.3. The least-squares form of equation 2.9 suggests certain possibilities beyond stochastic gradient descent for this purpose. There are, for example, passive-aggressive online algorithms for regression (Crammer, Dekel, Keshet, Shalev-Shwartz, & Singer, 2006) that could be adapted to this setting, with the posterior mean $E[hi|x,y]$ providing a target for the $i$th capsule's regression on the input $x$ with label $y$. These posterior means also provide the sufficient statistics for faster incremental variants of the EM algorithm (Neal & Hinton, 1998).

A final issue is model building: Is it possible to extend the ideas for capsule regression in this letter to deeper and more sophisticated architectures? We have already seen that the one-versus-all and all-versus-all architectures in section 3.4 lead to more accurate classifiers than the basic model of capsule regression in Figure 1. But even these architectures are still too primitive for modeling (say) the part-whole relationships of complex visual objects; those relationships may need to be modeled more explicitly, as in the multilayer capsule networks (Sabour et al., 2017) whose squashing nonlinearities were the motivation for our own study. For deeper architectures with such nonlinearities, we believe that the methods in this letter may serve as useful building blocks. We started this letter by noting the promise of existing capsule networks, and it seems fitting, then, that we have come full circle. We conclude on the hopeful note that this work provides yet another bridge between the traditions of latent variable modeling and deep learning.

## Appendix A: Supporting Calculations

In this appendix we present the more detailed calculations for inference that were omitted from section 2.2. In particular, in section A.1, we show how to calculate the interpolating coefficients $\lambda 0(\beta )$ and $\lambda 1(\beta )$, and in section A.2, we show how to calculate the multidimensional integrals over the model's latent variables required for inference and learning.

### A.1 Computing the Interpolating Coefficients

It is mostly straightforward to implement the EM algorithm in section 2, but some extra steps are needed to compute the interpolating coefficients, $\lambda 0(\beta )$ and $\lambda 1(\beta )$, defined in equations 2.13 and 2.14. In this section we show how to evaluate the one-dimensional integrals that appear in these definitions. We also show that $\lambda 0(\beta )$ and $\lambda 1(\beta )$ are monotonically increasing functions with values in the unit interval [0,1].

To summarize, then, we use the forward recursion in equation A.3 to compute $Is(\beta )$ when $\beta >s$, and we use the backward recursion in equation A.5 to compute $Is(\beta )$ when $\beta \u2264s$. In practice, we only invoke these recursions to compute $\lambda 0(\beta )$, setting $s=dm2$, because the forward recursion can then be used to compute $\lambda 1(\beta )$ with $s=dm2+1$. Note that the value of $s=dm2$ is fixed in advance by the capsule architecture. Thus, before the outset of learning, it may also be possible to compile lookup tables for $Is(\beta )$ and use interpolation schemes for even faster inference. However, we have not pursued that approach here.

### A.2 Integrating over the Model's Latent Variables

In this section we show how to calculate the conditional probability $P(y=j|x)$ in equation 2.4 and the posterior mean $E[hi|x,y=j]$ in equation 2.8. Both calculations involve multidimensional integrals over all of the model's latent variables, which we denote collectively by $h\u2208RD$.

## Appendix B: Supporting Experiments

We obtained the results in section 3 with the modified update in equation 2.27 and the subspace initialization in equation 2.28. Most of these results were devoted to comparing models that were identically trained but had different values of the capsule dimensionality, $d$. To make meaningful comparisons, though, it was first necessary to standardize how we initialized the models and which updates we used to train them. In this appendix, we describe some of the preliminary experiments that informed these choices. In particular, section B.1 explores the effect of different updates (with thresholding and/or momentum), and section B.2 explores the effect of random versus subspace initializations.

### B.1 Effects of Momentum and Thresholding

First we consider the effect of the hyperparameters $\nu $ and $\gamma $ on the course of learning. We used these hyperparameters to modify the EM update in equation 2.24, and these modifications led to the variants in equations 2.26 and 2.27. In this section, we consider how these modifications affect the log-conditional likelihood $L(W,\sigma 2)$ in equation 2.6 and the number of misclassified examples. To understand these effects, we experimented on the model in Figure 1 with capsule dimensionality $d=4$. We trained this model on both data sets in four different ways: using the EM update $(\gamma =\nu =0)$ in equation 2.24, using the thresholded update $(\gamma =0,\nu =0.8)$ in equation 2.26, and using the momentum update ($\gamma =0.9$) in equation 2.27 both with thresholding ($\nu =0.8$) and without thresholding ($\nu =0$). For each experiment, we initialized the model's weight matrices by equation 2.28, and we reestimated them for 1000 iterations on all 60,000 training examples.

*below*the solid curves, with a significant gap emerging after fewer than 10 iterations and growing thereafter. In particular, on both data sets, we see that the error rates converge to a significantly lower value with the thresholded update. In practice, the thresholded update appears to trade the worse likelihoods in Figure 6 for the lower error rates in Figure 7. Again, though not reproduced here, we observed these trends consistently across models of many different cardinalities ($m$) and capsule dimensionalities ($d$).

For the models in section 3, we were ultimately more interested in minimizing their error rates as classifiers than maximizing their log-conditional likelihoods. To be sure, the likelihood provides a useful surrogate for the error rate (as the former is differentiable in the model parameters whereas the latter is not). But for our main experiments—based on the above results—we adopted the modified update in equation 2.27 with the tunable hyperparameters $\nu $ and $\gamma $. Moreover, as shown in section 3, these hyperparameters did not require elaborate tuning to be effective in practice.

### B.2 Effects of Initialization

We also compared the effects of different initializations. For these comparisons, we experimented with the same $d=4$ model as in the previous section, but in addition, we trained models whose weight matrices were initialized by different levels of zero-mean gaussian noise. In particular, for each noise level, we generated 20 different initializations and trained the resulting models for 1000 iterations. Then for each of these models, we recorded the lowest classification error rate that it achieved on the 60,000 training examples over the course of learning.

## Note

^{1}

This point is more subtle than we have indicated: it has been shown that the EM algorithm can be accelerated in gaussian latent variable models by representing the variance explicitly (Liu, Rubin, & Wu, 1998). However, we do not pursue that path here.

## Acknowledgments

I am grateful to the anonymous reviewer whose suggestions improved this letter in many ways.