Quaternion Equivariant Capsule Networks for 3D Point Clouds

Quaternion Equiv arian t Capsule Net w orks for 3D P oin t Clouds Y ongheng Zhao 1 , 3 , ∗ , T olga Birdal 2 , ∗ , Jan Eric Lenssen 4 , Eman uele Menegatti 1 , Leonidas Guibas 2 , and F ederico T om bari 3 , 5 1 Univ ersity of Pado v a 2 Stanford Universit y 3 TU Munich 4 TU Dortmund 5 Go ogle Abstract. W e presen t a 3D capsule module for pro cessing point clouds that is equiv arian t to 3D rotations and translations, as well as in v ariant to p ermutations of the input p oints. The operator receives a sparse set of local reference frames, computed from an input point cloud and estab- lishes end-to-end transformation equiv ariance through a nov el dynamic routing procedure on quaternions. F urther, we theoretically connect dy- namic routing betw een capsules to the well-kno wn W eiszfeld algorithm, a scheme for solving iterative r e-weighte d least squar es (IRLS) problems with prov able conv ergence prop erties. It is shown that suc h group dy- namic routing can b e interpreted as robust IRLS rotation a veraging on capsule v otes, where information is routed based on the ﬁnal inlier scores. Based on our operator, we build a capsule netw ork that disentangles ge- ometry from p ose, pa ving the w ay for more informativ e descriptors and a structured latent space. Our architecture allows joint ob ject classiﬁcation and orien tation estimation without explicit sup ervision of rotations. W e v alidate our algorithm empirically on common benchmark datasets. W e release our sources under: https://tolgabirdal.gith ub.io/qecnetw orks. ∗ Keyw ords: 3D, equiv ariance, disen tanglement, rotation, quaternion 1 In tro duction It is no w w ell understoo d that in order to learn a compact and informativ e represen tation of the input data, one needs to resp ect the symmetries in the problem domain [17,73]. Arguably , one of the primary reasons for the success of 2D conv olutional neural netw orks (CNN) is the tr anslation-invarianc e of the 2D con volution acting on the image grid [29,36]. Recent trends aim to transfer this success into the 3D domain in order to supp ort many applications such as shap e retriev al, shape manipulation, pose estimation, 3D ob ject mo deling and detec- tion, etc. There, the data is naturally represen ted as sets of 3D p oints [55,57]. Unfortunately , an extension of CNN arc hitectures to 3D p oint clouds is non- trivial due to t wo reasons: 1) p oint clouds are irregular and unorganized, 2) the group of transformations that we are in terested in is more complex as 3D data is often observed under arbitrary non-comm utative S O (3) rotations. As a result, learning appropriate em b eddings requires 3D p oint-net w orks to b e e quivariant to these transformations, while also b eing inv ariant to p oint p ermutations. ∗ First tw o authors contributed equally to this work. 2 Y. Zhao et al. C h air (b) pai rw i se sh ape al i gn me n t (a) h i e r arch i cal v o t i n g fo r rot at i o n s ( i ) ( ii) (iii ) Fig. 1. ( a ) Our netw ork operates on lo cal reference frames (LRF) of an input point cloud ( i ). A hierarc hy of quaternion equiv arian t capsule mo dules (QEC) then p o ols the LRFs to a set of laten t capsules ( ii , iii ) disentangling the activ ations from p oses. W e can use activ ations in classiﬁcation and the capsule (quaternion) with the highest activ ation in absolute (canonical) pose estimation without needing the sup ervision of rotations. ( b ) Our siamese v ariant can also solv e for the relativ e ob ject pose b y aligning the capsules of tw o shap es with diﬀeren t point samplings. Our netw ork directly consumes p oin t sets and LRFs. Meshes are included only to ease understanding. In order to ﬁll this gap, we present a quaternion equiv arian t p oint capsule net- w ork that is suitable for processing p oint clouds and is equiv ariant to S O (3) rota- tions, compactly parameterized b y quaternions, while also preserving translation and p ermutation in v ariance. Inspired b y the lo cal group equiv ariance [40,17], we eﬃcien tly cov er S O (3) by restricting ourselv es to a sparse set of lo cal reference frames (LRFs) that collectiv ely determine the ob ject orien tation. The prop osed quaternion e quivariant c apsule (QEC) mo dule deduces equiv arian t laten t repre- sen tations by robustly combining those LRFs using the prop osed Weiszfeld dy- namic r outing with inlier scores as activ ations, so as to route information from one lay er to the next. Hence, our latent features specify to lo cal orientations and activ ations, disen tangling orien tation from evidence of ob ject existence. Such ex- plicit and factored storage of 3D information is unique to our work and allows us to p erform rotation estimation join tly with ob ject classiﬁcation. Our ﬁnal arc hitecture is a hierarc hy of QEC mo dules, where LRFs are routed from low er lev el to higher level capsules as shown in Fig. 1. W e use classiﬁcation error as the only training cue and adapt a Siamese version for regression of the relative rotations. W e neither explicitly sup ervise the netw ork with p ose annotations nor train b y augmenting rotations. In summary , our con tributions are: 1. W e propose a no vel, fully S O (3)-equiv ariant capsule module that produces in v ariant latent representations while explicitly decoupling the orien tation in to capsules. Notably , equiv ariance results hav e not b een previously achiev ed for S O (3) capsule net works. 2. W e connect dynamic routing b etw een capsules [60] and generalized W eiszfeld iterations [4]. Based on this connection, we theoretically argue for the con- v ergence of the included rotation estimation on votes and extend our under- standing of dynamic routing approaches. 3. W e prop ose a capsule netw ork that is tailored for simultaneous classiﬁcation and orientation estimation of 3D p oint clouds. W e exp erimentally demon- strate the capabilities of our netw ork on classiﬁcation and orien tation esti- mation on ModelNet10 and Mo delNet40 3D shape data. QE-Net works 3 2 Related W ork De ep le arning on p oint sets. The capability to process raw, unordered p oint clouds within a neural netw ork is in tro duced by the prosp erous P oin tNet [55] thanks to the point-wise con volutions and the p ermutation in v ariant po oling functions. Man y works hav e extended Poin tNet primarily to increase the lo- cal receptive ﬁeld size [57,42,62,71]. P oint-clouds are generally thought of as sets. This makes an y p erm utation-inv arian t net w ork that can op erate on sets an amenable choice for pro cessing p oints [81,58]. Unfortunately , common neural net work op erators in this category are solely equiv ariant to p ermutations and translations but to no other groups. Equivarianc e in neur al networks. Early attempts to achiev e inv ariant data represen tations usually inv olv ed data augmentation tec hniques to accomplish tolerance to input transformations [49,56,55]. Motiv ated b y the diﬃcult y asso- ciated with augmen tation eﬀorts and ackno wledging the imp ortance of theoreti- cally equiv ariant or inv ariant representations, the recen t y ears hav e witnessed a leap in theory and practice of equiv arian t neural net works [6,37]. While la ying out the fundamentals of the group conv olution, G-CNNs [18] guaran teed equiv ariance with resp ect to ﬁnite symmetry groups. Similarly , Steer- able CNNs [21] and its extension to 3D vo xels [75] considered discrete symmetries only . Other works opted for designing ﬁlters as a linear combination of harmonic basis functions, leading to frequency domain ﬁlters [76,74]. Apart from suﬀering from the dense cov erage of the group using group con volution, ﬁlters living in the frequency space are less interpretable and less expressive than their spatial coun terparts, as the basis does not span the full space of spatial ﬁlters. Ac hieving equiv ariance in 3D is possible b y simply generalizing the ideas of the 2D domain to 3D b y vo xelizing 3D data. Ho w ever, methods using dense grids [16,21] suﬀer from increased storage costs, even tually rendering the imple- men tations infeasible. An extensive line of w ork generalizes the harmonic basis ﬁlters to S O (3) by using e . g . , a spherical harmonic basis instead of circular har- monics [19,25,22]. In addition to the same downsides as their 2D counterparts, these approac hes hav e in common that they require their input to b e pro jected to the unit sphere [33], which p oses additional problems for unstructured point clouds. A related line of researc h are metho ds whic h deﬁne a regular structure on the sphere to prop ose equiv arian t conv olution operators [44,13]. T o learn a rotation equiv ariant representation of a 3D shap e, one can either act on the input data or on the netw ork. In the former case, one either presents augmen ted data to the net w ork [55,49] or ensures rotation-in v ariance in the input [23,24,34]. In the latter case one can enforce equiv ariance in the bottleneck so as to achiev e an inv arian t latent representation of the input [50,66,63]. F urther, equiv arian t netw orks for discrete sets of views [27] and cross-domain views [26] ha ve b een prop osed. Here, we aim for a diﬀerent wa y of embedding equiv ariance in the net work by means of an explicit latent rotation parametrization in addition to the in v arian t feature. V e ctor ﬁeld networks [47] follow ed by the 3D T ensor Field Networks (TFN) [66] are closest to our work. Based up on a geometric algebra framework, the authors 4 Y. Zhao et al. did achiev e lo calized ﬁlters that are equiv ariant to rotations, translations and p erm utations. Moreov er, they are able to co ver the contin uous groups. How ever, TFN are designed for physics applications, are memory consuming and a t ypi- cal implementation is neither lik ely to handle the datasets w e consider nor can pro vide orientations in an explicit manner. Capsule networks. The idea of capsule net works was ﬁrst mentioned by Hin- ton et al . [30], before Sab our et al . [60] prop osed the dynamic r outing by agr e ement , which started the recen t line of work in vestigating the topic. Since then, routing by agreement has been connected to several well-kno wn concepts, e.g. the EM algorithm [59], clustering with KL divergence regularization [68] and equiv ariance [40]. They ha v e b een extended to auto enco ders [38] and GANs [32]. F urther, capsule netw orks hav e b een applied for sp eciﬁc kinds of input data, e.g. graphs [78], 3D p oint clouds [83,64] or medical images [1]. 3 Preliminaries and T ec hnical Bac kground W e no w provide the necessary bac kground required for the grasp of the equiv- ariance of point clouds under the action of quaternions. 3.1 Equiv ariance Deﬁnition 1 (Equiv arian t Map) F or a G -sp ac e acting on X , the map Φ : G × X 7→ X is said to b e e quivariant if its domain and c o-domain ar e acte d on by the same symmetry gr oup [18,20]: Φ ( g 1 ◦ x ) = g 2 ◦ Φ ( x ) (1) wher e g 1 ∈ G and g 2 ∈ G . Equivalently Φ ( T ( g 1 ) x ) = T ( g 2 ) Φ ( x ) , wher e T ( · ) is a line ar r epr esentation of the gr oup G . Note that T ( · ) do es not have to c ommute. It suﬃc es for T ( · ) to b e a homomorphism: T ( g 1 ◦ g 2 ) = T ( g 1 ) ◦ T ( g 2 ) . In this p ap er we use a stricter form of e quivarianc e and c onsider g 2 = g 1 . Deﬁnition 2 (Equiv arian t Netw ork) An ar chite ctur e or network is said to b e e quivariant if al l of its layers ar e e quivariant maps. Due to the tr ansitivity of the e quivarianc e, stacking up e quivariant layers wil l r esult in glob al ly e quiv- ariant networks e.g. , r otating the input wil l pr o duc e output ve ctors which ar e tr ansforme d by the same r otation [40,37]. 3.2 The Quaternion Group H 1 The choice of 4-vector quaternions as representation for S O (3) has multiple mo- tiv ations: (1) All 3-vector formulations suﬀer from inﬁnitely many singularities as angle go es to 0, whereas quaternions av oid those, (2) 3-v ectors also suﬀer from inﬁnitely man y redundancies (the norm can gro w indeﬁnitely). Quaternions ha ve a single redundancy: q = − q that is in practice easy to enforce [9], (3) Computing the actual ‘manifold mean’ on the Lie algebra requires iterativ e techniques with subsequen t up dates on the tangent space. Such iterations are computationally and n umerically harmful for a diﬀeren tiable GPU implemen tation. QE-Net works 5 Deﬁnition 3 (Quaternion) A quaternion q is an element of Hamilton alge- br a H 1 , extending the c omplex numb ers with thr e e imaginary units i , j , k in the form: q = q 1 1 + q 2 i + q 3 j + q 4 k = ( q 1 , q 2 , q 3 , q 4 ) T , with ( q 1 , q 2 , q 3 , q 4 ) T ∈ R 4 and i 2 = j 2 = k 2 = ijk = − 1 . q 1 ∈ R denotes the sc alar p art and v = ( q 2 , q 3 , q 4 ) T ∈ R 3 , the ve ctor p art. The conjugate ¯ q of the quaternion q is given by ¯ q := q 1 − q 2 i − q 3 j − q 4 k . A unit quaternion q ∈ H 1 with 1 ! = k q k := q · ¯ q and q − 1 = ¯ q , gives a c omp act and numeric al ly stable p ar ametrization to r epr esent orientation of obje cts on the unit spher e S 3 , avoiding gimb al lo ck and singu- larities [15]. Identifying antip o dal p oints q and − q with the same element, the unit quaternions form a double c overing gr oup of S O (3) . H 1 is close d under the non-c ommutative multiplic ation or the Hamilton pr o duct: ( p ∈ H 1 ) ◦ ( r ∈ H 1 ) = [ p 1 r 1 − v p · v r ; p 1 v r + r 1 v p + v p × v r ] . (2) Deﬁnition 4 (Linear Representation of H 1 ) We fol low [12] and use the p ar- al lelizable natur e of unit quaternions ( d ∈ { 1 , 2 , 4 , 8 } wher e d is the dimension of the ambient sp ac e) to deﬁne T : H 1 7→ R 4 × 4 as: T ( q ) ,     q 1 − q 2 − q 3 − q 4 q 2 q 1 − q 4 q 3 q 3 q 4 q 1 − q 2 q 4 − q 3 q 2 q 1     . T o b e c oncise we wil l use c apital letters to r efer to the matrix r epr esentation of quaternions e.g. Q ≡ T ( q ) , G ≡ T ( g ) . Note that T ( · ) , the inje ctive homo- morphism to the orthonormal matrix ring, by c onstruction satisﬁes the c ondition in Dfn. 1 [65]: det( Q ) = 1 , Q > = Q − 1 , k Q k = k Q i, : k = k Q : ,i k = 1 and Q − q 1 I is skew symmetric: Q + Q > = 2 q 1 I . It is e asy to verify these pr op erties. T line arizes the Hamilton pr o duct or the gr oup c omp osition: g ◦ q , T ( g ) q , Gq . 3.3 3D Poin t Clouds Deﬁnition 5 (Poin t Cloud) We deﬁne a 3D surfac e to b e a diﬀer entiable 2- manifold emb e dde d in the ambient 3D Euclide an sp ac e: M 2 ∈ R 3 and a p oint cloud to b e a discr ete subset sample d on M 2 : X ∈ { x i ∈ M 2 ∩ R 3 } . Deﬁnition 6 (Lo cal Geometry) F or a smo oth p oint cloud { x i } ∈ M 2 ⊂ R N × 3 , a lo cal reference frame (LRF) is deﬁne d as an or der e d b asis of the tan- gent sp ac e at x , T x M , c onsisting of orthonormal ve ctors: L ( x ) = [ ∂ 1 , ∂ 2 , ∂ 3 ≡ ∂ 1 × ∂ 2 ] . Usual ly the ﬁrst c omp onent is deﬁne d to b e the surfac e normal ∂ 1 , n ∈ S 2 : k n k = 1 and the se c ond one is picke d ac c or ding to a heuristic. Note that recent trends, e . g . as in Cohen et al . [17], ackno wledge the ambiguit y and either emplo y a gauge (tangent frame) equiv ariant design or propagate the determination of a certain direction un til the last lay er [54]. Here, we will assume that ∂ 2 can b e uniquely and rep eatably computed, a reasonable assumption for the p oin t sets w e consider [52]. F or the cases where this do es not hold, we will rely on the robustness of the iterativ e routing pro cedures in our netw ork. W e will explain our metho d of choice in Sec. 6 and visualize LRFs of an airplane ob ject in Fig. 1. 6 Y. Zhao et al. 4 S O (3)-Equiv ariant Dynamic Routing Disen tangling orientation from representations requires guaranteed equiv ariances and in v ariances. Y et, the original capsule netw orks of Sabour et al . [60] cannot ac hieve equiv ariance to general groups. T o this end, Lenssen et al . [40] prop osed a dynamic routing pro cedure that guarantees equiv ariance and inv ariance under S O (2) actions, b y applying a manifold-mean and the geodesic distance as routing op erators. W e will extend this idea to the non-ab elian S O (3) and design cap- sule netw orks that sparsely operate on a set of LRFs computed via [53] on lo cal neigh b orho o ds of p oints. The S O (3) elements are paremeterized by quaternions similar to [82]. In the following, we b egin by introducing our nov el equiv arian t dynamic routing procedure, the main building blo ck of our arc hitecture. W e sho w the connection to the w ell kno wn W eiszfeld algorithm, broadening the un- derstanding of dynamic routing b y em b edding it into traditional computer vision metho dology . Then, w e present an example of how to stack those lay ers via a simple aggregation, resulting in an S O (3)-equiv arian t 3D capsule net work that yields inv ariant representations (or activ ations) as w ell as equiv arian t orienta- tions (laten t capsules). 4.1 Equiv arian t Quaternion Mean T o construct equiv arian t lay ers on the group of rotations, we are required to de- ﬁne a left-equiv arian t a veraging operator A that is in v arian t under permutations of the group elements, as well as a distance metric δ that remains unchanged under the action of the group [40]. F or these, w e mak e the following choices: Deﬁnition 7 (Geo desic Distance) The R iemannian (ge o desic) distanc e on the manifold of r otations le ad to the fol lowing ge o desic distanc e δ ( · ) ≡ d quat ( · ) : d ( q 1 , q 2 ) ≡ d quat ( q 1 , q 2 ) = 2 cos − 1 ( |h q 1 , q 2 i| ) (3) Deﬁnition 8 (Quaternion Mean µ ( · ) ) F or a set of Q r otations S = { q i } and asso ciate d weights w = { w i } , the weighte d me an op er ator A ( S , w ) : H 1 n × R n 7→ H 1 n is deﬁne d thr ough the fol lowing maximization pr o c e dur e [48]: ¯ q = arg max q ∈ S 3 q > Mq (4) wher e M ∈ R 4 × 4 is deﬁne d as: M , Q P i =1 w i q i q > i . The av erage quaternion ¯ q is the eigenv ector of M corresp onding to the maxi- m um eigen v alue. This op eration lends itself to both analytic [46] and automatic diﬀeren tiation [39]. The follo wing prop erties allow A ( S , w ) to b e used to build an equiv arian t dynamic routing: Theorem 1 Quaternions, the employe d me an A ( S , w ) and ge o desic distanc e δ ( · ) enjoy the fol lowing pr op erties: QE-Net works 7 Algorithm 1: Quaternion Equiv ariant Dynamic Routing 1 input : Input p oints { x 1 , ..., x K } ∈ R K × 3 , input capsules (LRFs) Q = { q 1 , . . . , q L } ∈ H 1 L , with L = N c · K , N c is the num b er of capsules p er p oint, activ ations α = ( α 1 , . . . , α L ) T , trainable transformations T = { t i,j } i,j ∈ H 1 L × M 2 output: Up dated frames ˆ Q = { ˆ q 1 , . . . , ˆ q M } ∈ H 1 M , up dated activ ations ˆ α = ( ˆ α 1 , . . . , ˆ α M ) T 3 for Al l primary (input) c apsules i do 4 for Al l latent (output) c apsules j do 5 v i,j ← q i ◦ t i,j // compute votes 6 for Al l latent (output) c apsules j do 7 ˆ q j ← A  { v 1 ,j . . . v K,j } , α  // initialize output capsules 8 for k iterations do 9 for A l l primary (input) c apsules i do 10 w i,j ← α i · sigmoid  − δ ( ˆ q j , v i,j )  // the current weight 11 ˆ q j ← A  { v 1 ,j . . . v L,j } , w : ,j  // see Eq (4) 12 ˆ α j ← sigmoid  − 1 K L P 1 δ ( ˆ q j , v i,j )  // recompute activations 1. A ( g ◦ S , w ) is left-e quivariant: A ( g ◦ S , w ) = g ◦ A ( S , w ) . 2. Op er ator A is invariant under p ermutations: A ( { q 1 , . . . , q Q } , w ) = A ( { q σ (1) , . . . , q σ ( Q ) } , w σ ) . (5) 3. The tr ansformations g ∈ H 1 pr eserve the ge o desic distanc e δ ( · ) given in Dfn. 7. Pr o of. The pro ofs are given in the supplemen tary material. W e also note that the ab ov e mean is closed form, diﬀerentiable and can b e com- puted in a batch-wise fashion. W e are now ready to construct the dynamic r outing (DR) b y agreement that is equiv arian t to S O (3) actions, thanks to Thm. 1. 4.2 Equiv arian t W eiszfeld Dynamic Routing Our routing pro cedure extends previous work [60,40] for quaternion v alued input. The core idea is to r oute from the primary c apsules that constitute the input LRF set to the latent c apsules by an iterativ e clustering of votes v i,j . A t each step, w e assign the weigh ted group mean of v otes to the resp ective output capsules. The weigh ts w ← σ ( x , y ) are inv ersely prop otional to the distance b etw een the v ote quaternions and the new quaternion (cluster cen ter). See Alg. 1 for details. In the following, w e analyze our v arian t of routing as an interesting case of the aﬃne, Riemannian W eiszfeld algorithm [4,3]. Lemma 1. F or σ ( x , y ) = δ ( x , y ) q − 2 the e quivariant r outing pr o c e dur e given in Alg. 1 is a variant of the aﬃne subsp ac e Wieszfeld algorithm [4,3] that is a r obust algorithm for c omputing the L q ge ometric me dian. 8 Y. Zhao et al. ! " #$%% & ! ' ! (& " ) * + * , & " * - #$ % (. " ) " * / ! #$ & ! " !"#$ (⋅) (0 " ) * / ! & ! " (0 " '( ) 1 & " transform - ke rne l Dynamic(Routing 2 3 4 5 " * / ! #$ & ! " $) ! 6 . " " * / ! ) ! 6 7 " * - ) ! 7 " * - #$ & ! " 8 * + Fig. 2. Our q uaternion e quiv ariant c apsule (QEC) lay er for pro cessing lo cal patc hes: Our input is a 3D p oint set X on which we query local neighborho o ds { x i } with precomputed LRFs { q i } . Essen tially , we learn the parameters of a fully connected net work that contin uously maps the canonicalized lo cal point set to transformations t i , which are used to compute h yp otheses (v otes) from input capsules. By a sp ecial dynamic routing procedure that uses the activ ations determined in a previous lay er, w e arriv e at latent capsules that are composed of a set of orien tations ˆ q i and new activ ations ˆ α i . Thanks to the decoupling of local reference frames, ˆ α i is in v ariant and orientations ˆ q i are equiv arian t to input rotations. All the op erations and hence the en tire QE-netw ork are equiv ariant achieving a guaranteed disentanglemen t of the rotation parameters. Hat symb ol ( ˆ q ) r efers to ’estimate d’. Pr o of (Pr o of Sketch). The pro of follows from the deﬁnition of W eiszfeld itera- tion [3] and the mean and distance op erators deﬁned in Sec. 4.1. W e ﬁrst sho w that computing the weigh ted mean is equiv alen t to solving the normal equa- tions in the iteratively reweigh ted least squares (IRLS) scheme [14]. Then, the inner-most lo op corresp onds to the IRLS or W eiszfeld iterations. W e pro vide the detailed proof in supplementary material. Note that, in practice one is quite free to c ho ose the w eighting function σ ( · ) as long as it is inv ersely proportional to the geo desic distance and conca ve [2]. The original dynamic routing can also be formulated as a clustering pro cedure with a KL divergence regularization. This holistic view pa ves the wa y to better routing algorithms [68]. Our persp ective is akin y et more geometric due to the group structure of the parameter space. Thanks to the connection to W eiszfeld algorithm, the con v ergence b ehavior of our dynamic routing can be directly analyzed within the theoretical framework presented by [3,4]. Theorem 2 Under mild assumptions pr ovide d in the app endix, the se quenc e of the DR-iter ates gener ate d by the inner-most lo op almost sur ely c onver ges to a critic al p oint. Pr o of (Pr o of Sketch). Pro of, given in the app endix, is a direct consequence of Lemma 1 and directly exploits the connection to the W eiszfeld algorithm. In summary , the provided theorems sho w that our dynamic routing b y agreement is in fact a v ariant of robust IRLS rotation av eraging on the predicted v otes, where reﬁned inlier scores for combinations of input/output capsules are used to route information from one lay er to the next. QE-Net works 9 Algorithm 2: Quaternion Equiv ariant Capsule Mo dule 1 input : Input p oints of one patc h { x 1 , ..., x K } ∈ R K × 3 , input capsules (LRFs) Q = { q 1 , . . . , q L } ∈ H 1 L , with L = N c · K , N c is the num b er of capsules p er p oint, activ ations α = ( α 1 , . . . , α L ) T 2 output: Up dated frames ˆ Q = { ˆ q 1 , . . . , ˆ q M } ∈ H 1 M , up dated activ ations ˆ α = ( ˆ α 1 , . . . , ˆ α M ) T 3 for Each input channel n c of al l the primary capsules channels N c do 4 µ ( n c ) ← A ( Q ( n c )) // Input quaternion average, see Eq (4) 5 for Each p oint x i of this p atch do 6 x 0 i ← µ ( n c ) − 1 ◦ x i // Rotate to a canonical orientation 7 { x 0 i } ∈ R K × N c × 3 // Points in multiple( N c ) canonical frames 8 for Each p oint x 0 i of this p atch do 9 t ← t ( x 0 i ) // Transform kernel, t ( · ) : R N c × 3 → R N c × M × 4 10 T ≡ { t i } ∈ H 1 K × N c i × M ← { t } ∈ H 1 L × M 11 ( ˆ Q , ˆ α ) ← DynamicRouting( X, Q , α , T ) // See Alg. 1 5 Equiv ariant Capsule Netw ork Arc hitecture In the following, we describ e how we lev erage the no vel dynamic routing algo- rithm to build a capsule netw ork for p oint cloud processing that is equiv ariant under S O (3) actions on the input. The essen tial ingredien t of our architecture, the quaternion e quivariant c apsule (QEC) mo dule that implements a capsule la yer with dynamic routing, is described in Sec. 5.1, b efore using it as building blo c k in the full arc hitecture, as describ ed in Sec. 5.2. 5.1 QEC Mo dule The main mo dule of our architecture, the QEC mo dule, is outlined in Fig. 2. W e also pro vide the corresp onding pseudoco de in Alg. 2. Input. The input to the mo dule is a lo cal patc h of p oin ts with co ordinates x i ⊂ R K × 3 , rotations (LRFs) attached to these points, parametrized as quaternions q i ⊂ H 1 K × N c and activ ations α i ⊂ R K × N c . W e also use q i to denote the input capsules. N c is the n umber of input capsule c hannels p er p oint and it is equal to the n umber of output capsules ( M ) from the last lay er. T r ainable tr ansformations. Recalling the original capsule netw orks of Sab our et al . [60], the trainable transformations t , which are applied to the input ro- tations to compute the v otes, lie in a grid kernel in the 2D image domain. Therefore, the pro cedure can learn to pro duce w ell-aligned votes if and only if the learned patterns in t match those in input capsule sets (agreemen t on evi- dence of ob ject existence). Since our input p oints in the local receptive ﬁeld lie in contin uous R 3 , training a discrete set of p ose transformations t i,j based on discrete lo cal coordinates is not p ossible. Instead, we use a similar approach as 10 Y. Zhao et al. QEC$ - Module ! ⊂ ℳ , QEC$ - Module QEC$ - Module QEC$ - Module QEC$ - Module QEC$ - Module Downsample Downsample Intermediate *Capsules Rotat ion ' Estimation Classification ! ! " ! Fig. 3. Our entire capsule-netw ork architecture. W e hierarchically send all the lo cal patc hes to our QEC-mo dule as shown in Fig. 2. A t each level the points are po oled in order to increase the receptiv e ﬁeld, gradually reducing the LRFs in to a single capsule p er class. W e use classiﬁcation and orien tation estimation (in the siamese case) as sup ervision cues to train the transform-k ernels t ( · ). Lenssen et al . [40] and employ a contin uous kernel t ( · ) : R N c × 3 → R M × N c × 4 that is deﬁned on the con tin uous R N c × 3 , instead of only a discrete set of po- sitions. The netw ork is shared ov er all p oints to compute the transformations t i,j = ( t ( x 0 1 ) , ..., t ( x 0 K )) i,j ⊂ R K × M × N c × 4 , whic h are used to calculate the votes for dynamic routing with v i,j = q i ◦ t i,j . The netw ork t ( · ) consists of fully- connected la yers that regresses the transformations, similar to common op er- ators for con tinuous conv olutions [61,70,28], just with quaternion output. The k ernel is able to learn pose patterns in the 3D space, whic h align the result- ing v otes if certain pose sets are presen t. Note that t ( · ) predicts quaternions by unit-normalizing the regressed output: t i,j ⊂ H 1 K × M × N c . Although Rieman- nian la y ers [7] or spherical predictions [43] can impro ve the p erformance, the simple strategy w orks reasonably for our case. In order for the k ernel to be in v arian t, it needs to b e aligned using an equiv- arian t initial orientation candidate [40]. Given p oints x i and rotations q i , we compute the mean µ i in a channel-wise manner like that of the initial candi- dates: µ i ⊂ H 1 N c . These candidates are used to bring the kernels in canonical orien tations by in v ersely rotating the input points: x 0 i = ( µ i − 1 ◦ x i ) ⊂ R K × N c × 3 . Computing the output. After computing the votes, we utilize the input ac- tiv ation α i as initialization w eights and iteratively reﬁne the output capsule rotations (robust rotation estimation on v otes) ˆ q i and activ ations ˆ α i (ﬁnal inlier scores) b y our W eiszfeld routing b y agreement as shown in Alg. 1. 5.2 Net work Arc hitecture F or processing point clouds, w e use m ultiple QEC mo dules in a hierarc hical arc hitecture as sho wn in Fig. 3. In the ﬁrst lay er, the input primary capsules are represen ted by LRFs computed with FLARE algorithm [53]. Therefore, the n umber of input capsule c hannels N c in the ﬁrst lay er is equal to 1 and activ ations are uniform. The output of a former lay er is propagated to the input of the latter, creating the hierarc hy . In order to gradually increase the receptiv e ﬁeld, w e stack QEC modules creating a deep hierarch y , where each lay er reduces the num b er of p oints and increases the receptive ﬁeld. In our exp eriments, we use a tw o level architecture, QE-Net works 11 T able 1. Classiﬁcation accuracy on ModelNet40 dataset [77] for diﬀeren t methods as w ell as ours. W e also rep ort the num b er of parameters optimized for each metho d. X / Y means that we train with X and test with Y . PN PN++ DGCNN KDT reeNet Point2Seq Sph.CNNs PRIN PPF Ours (V ar.) Ours NR/NR 88.45 89.82 92.90 86.20 92.60 - 80.13 70.16 85.27 74.43 NR/AR 12.47 21.35 29.74 8.49 10.53 43.92 68.85 70.16 11.75 74.07 #Params 3.5M 1.5M 2.8M 3.6M 1.8M 0.5M 1.5M 3.5M 0.4M 0.4M whic h receiv es N = 64 patc hes as input. W e call the centers of these patches p o oling c enters and compute them via a uniform farthest point sampling as in [11]. Pooling cen ters serve as the p ositions of output capsules of the curren t la yer. Each of those centers is linked to their immediate vicinity leading to K = 9- star lo cal connectivit y from whic h serve as input to the ﬁrst QEC mo dule to compute rotations and activ ations of 64 × 64 × 4 in termediate capsules. The second mo dule connects those intermediate capsules to the output capsules, whose num b er corresponds to the n umber of classes. Sp eciﬁcally , for lay er 1, w e use K = 9 , N l c = 1 , M l = 64 and for lay er 2, K = 64 , N l c = 64 , M l = C = 40. This wa y , the last QEC mo dule receiv es only one input patc h and po ols all capsules into a single p oint with an estimated LRF. F or further details, we refer to our source code, which we will make av ailable online before publication and pro vide in the supplemen tal materials. 6 Exp erimen tal Ev aluations Implementation details. W e implement our net work in PyT orc h and use the AD AM optimizer [35] with a learning rate of 0 . 001. Our p oint-transformation mapping net work (transform-k ernel) is implemen ted by tw o FC-la y ers comp osed of 64 hidden units. W e set the initial activ ation of the input LRF to 1 . 0. In each la yer, w e use 3 iterations of DR. F or classiﬁcation we use the spread loss [59] and the rotation loss is identical to δ ( · ). The ﬁrst axis of the LRF is the surface normal computed by local plane ﬁts [31]. W e compute the second axis, ∂ 2 , b y FLARE [53], that uses the nor- malized pro jection of the p oint with the largest distance within the p eriphery of the support, onto the tangent plane of the cen ter: ∂ 2 = p max − p k p max − p k . Using other c hoices suc h as SHOT [67] or GF rames [51] is possible. W e found FLARE to b e suﬃcien t for our exp erimen ts. Prior to all operations, we ﬂip all the LRF quaternions suc h that they lie on the northern hemisphere : { q i ∈ S 3 : q w i > 0 } . 3D shap e classiﬁc ation. W e use Mo delNet40 dataset of [77,57] to assess our classiﬁcation p erformance where eac h shap e is comp osed of 10 K p oints randomly sampled from the mesh surfaces of eac h shap e [55,57]. W e use the oﬃcial split with 9,843 shap es for training and 2,468 for testing. W e assign the LRFs to a subset of the uniformly sampled p oints, N = 512 [11]. During training, w e do not augment the dataset with random rotations. All the shap es are trained with single orientation (w ell-aligned). W e call this tr aine d 12 Y. Zhao et al. T able 2. Relativ e angular error (RAE) of rotation estimation in diﬀerent categories of Mo delNet10. Righ t side of the table denotes the ob jects with rotational symmetry , whic h w e include for completeness. PCA-S refers to running PCA only on a resampled instance, while PCA-SR applies b oth rotations and resampling. Method Avg. No Sym Chair Bed Sofa T oilet Monitor T able Desk Dresser NS Bath tub Mean LRF 0.41 0.35 0.32 0.36 0.34 0.41 0.34 0.45 0.60 0.50 0.46 0.32 PCA-S 0.40 0.42 0.60 0.53 0.46 0.32 0.12 0.47 0.23 0.33 0.43 0.55 PCA-SR 0.67 0.67 0.69 0.70 0.67 0.68 0.61 0.67 0.67 0.67 0.66 0.70 Poin tNetLK [5] 0.37 0.38 0.43 0.31 0.40 0.40 0.31 0.40 0.33 0.39 0.38 0.34 IT-Net [80] 0.27 0.19 0.10 0.22 0.17 0.20 0.28 0.31 0.41 0.44 0.40 0.39 Ours 0.27 0.17 0.11 0.20 0.16 0.18 0.19 0.43 0.40 0.48 0.33 0.31 Ours (siamese) 0.20 0.09 0.08 0.10 0.08 0.11 0.08 0.40 0.35 0.34 0.32 0.30 with NR . During testing, we randomly generate m ultiple arbitrary S O (3) rota- tions for each shap e and ev aluate the av erage performance for all the rotations. This is called test with AR . This proto col is similar to [5]’s and is used b oth for our algorithms and for the baselines. Our results are shown in T ab. 1 along with that of P ointNet (PN) [55], P ointNet++ (PN++) [55], DGCNN [72], KD- treeNet [41], Poin t2Seq [45], Spherical CNNs [25], PRIN [79] and the theoretically in v ariant PPF-F oldNet (PPF) [23]. W e also present a version of our algorithm ( V ar ) that av oids the canonicalization within the QE-netw ork. This is a non- equiv arian t netw ork that w e still train without data augmentation or orien tation sup ervision. While this version gets comparable results to the state of the art for the NR/NR case, it cannot handle random S O (3) v ariations (AR). Note that PPF uses the point-pair-feature [10] enco ding and hence creates in v ariant input representations. F or the scenario of NR/AR, our equiv ariant version out- p erforms all the other metho ds, including equiv arian t spherical CNNs [25] by a signiﬁcan t gap of at least 5% even when [25] exploits the 3D mesh. The ob ject rotational symmetries in this dataset are resp onsible for a signiﬁcant p ortion of the errors we make and we provide further details in supplementary material. It is worth men tioning that we also trained TFNs [66] for that task, but their memory demand made it infeasible to scale to this application. Computational asp e cts. As shown in T ab. 1 for Mo delNet40 our netw ork has 0 . 047 M parameters. It incurs a computational cost in the order O ( M K L ). The details are giv en in the supplemen tary material. R otation estimation in 3D p oint clouds. Our netw ork can estimate b oth the absolute and relative 3D ob ject rotations without p ose-sup ervision. T o ev al- uate this desired property , we used the w ell classiﬁed shap es on ModelNet10 dataset, a sub-dataset of Mo delnet40 [77]. This time, we use the oﬃcial Mo de- lenet10 dataset split with 3991 for training and 908 shap es for testing. During testing, w e generate multiple instances p er shap e by transforming the instance with ﬁv e arbitrary S O (3) rotations. As w e are also aﬀected b y the sampling of the point cloud, w e resample the mesh ﬁve times and generate diﬀeren t p o oling graphs across all the instances of the same shape. Our QE- arc hitecture can estimate the p ose in tw o wa ys: 1) c anonic al : by directly using the QE-Net works 13 𝑸 ⊂ ℝ 𝐶 x4 { 𝐐 𝒊 } 𝛂 ⊂ ℝ 𝐶 Sh ape Ali g n m en t Fig. 4. Shape alignment on the monitor (left) and toilet (righ t) ob jects via our siamese equiv arian t capsule arc hitecture. The shapes are assigned to the the maximally activ ated class. The corresp onding p ose capsule pro vides the rotation estimate. output capsule with the highest activ ation, 2) siamese : b y a siamese architecture that computes the relativ e quaternion b etw een the capsules that are maximally activ ated as sho wn in Fig. 4. Both mo des of op eration are free of the data augmen tation and w e give further sc hematics of the latter in our app endix. It is w orth mentioning that unlike regular p ose estimation algorithms whic h utilize the same shap e in b oth training and testing, our net work never sees the test shap es during training. This is also kno wn as c ate gory-level p ose estimation [69]. Our results against the baselines including a naive av eraging of the LRFs (Mean LRF) and principal axis alignmen t (PCA) are rep orted in T ab. 2 as the relativ e angular error (RAE). W e further include results of Poin tNetLK [5] and IT-Net [80], tw o state of the art 3D net w orks that iteratively align tw o given p oin t sets. These metho ds are in nature similar to iterative closest point (ICP) algorithm [8] but 1) do not require an initialization ( e . g . ﬁrst iteration estimates the p ose), 2) learn data driv en up dates. Metho ds that use mesh inputs such as Spherical CNNs [25] cannot be included here as the random sampling of the same surface would not aﬀect those. W e also av oid metho ds that are just in v ariant to rotations (and hence cannot estimate the p ose) such as T ensorﬁeld Net works [66]. Finally , note that , IT-net [80] and Poin tNetLK [5] need to train for a lot of ep o chs ( e . g . 500) with random S O (3) rotation augmentation in order to obtain mo dels with full co verage of S O (3), whereas w e train only for ∼ 100 ep o chs. Finally , the recen t geometric capsule net w orks [64] remains similar to PCA with an RAE of 0 . 42 on No Sym when ev aluated under identical settings. W e include more details ab out the baselines in the appendix. Relativ e Angle in Degrees (RAE) b etw een the ground truth and the pre- diction is computed as: d ( q 1 , q 2 ) /π . Note that resampling and random rota- tions render the job of all methods diﬃcult. Ho wev er, b oth of our canonical and siamese versions which try to ﬁnd a canonical and a relative alignment resp ectiv ely , are b etter than the baselines. As p ose estimation of ob jects with rotational symmetry is a challenging task due to inherent am biguities, we also rep ort results on the non-symmetric subset (No Sym). 14 Y. Zhao et al. T able 3. Ablation study on p oint densit y . LRF Input LRF-10K LRF-2K LRF-1K Dropout 50% 66% 75% 100% 100% 100% Classiﬁcation Accuracy 77.8 83.3 83.4 87.8 85.46 79.74 Angular Error 0.34 0.27 0.25 0.09 0.10 0.12 R obustness against p oint and LRF r esampling. Densit y changes in the lo cal neighborho o ds of the shap e are an imp ortan t cause of error for our net work. Hence, we ablate by applying random resampling (patch-wise drop out) ob jects in Mo delNet10 dataset and rep eating the classiﬁcation and pose estimation as describ ed ab ov e. While w e use all the classes in classiﬁcation accuracy , w e only consider the well classiﬁed non-symmetric (No Sym) ob jects for ablating on the p ose estimation. The ﬁrst part (LRF-10K) of T ab. 3 shows our ﬁndings against gradual increases of the num ber of patches. Here, w e sample 2K LRFs from the 10K LRFs computed on an input point set of cardinality 10K. 100% drop out corresp onds to 2K points in all columns. On second ablation, we reduce the amoun t of p oints on which w e compute the LRFs, to 2K and 1K respectively . As w e can see from the table, our netw ork is robust to wards the changes in the LRFs as w ell as the densit y of the points. 7 Conclusion and Discussion W e ha v e presen ted a new framew ork for ac hieving permutation inv ariant and S O (3) equiv ariant representations on 3D point clouds. Proposing a v ariant of the capsule netw orks, w e op erate on a sparse set of rotations sp eciﬁed b y the input LRFs thereby circum ven ting the eﬀort to cov er the entire S O (3). Our net- w ork natively consumes a compact representation of the group of 3D rotations - quaternions. W e ha ve theoretically shown its equiv ariance and established con- v ergence results for our W eiszfeld dynamic routing b y making connections to the literature of robust optimization. Our net work by construction disentangles the ob ject existence that is used as global features in classiﬁcation. It is among the few for having an explicit group-v alued latent space and th us naturally estimates the orien tation of the input shape, even without a sup ervision signal. Limitations. In the current form our p erformance is severely aﬀected by the shap e symmetries. The length of the activ ation v ector dep ends on the n umber of classes and for a suﬃciently descriptive laten t vector we need to hav e signiﬁcan t n umber of classes. On the other hand, this allows us to p erform with merit on problems where the n umber of classes are large. The computation of LRFs are still sensitive to the p oint densit y changes and resampling. LRFs themselves can also be ambiguous and sometimes non-unique. F utur e work. Inspired by [17] and [54] our feature w ork will in volv e exploring the Lie algebra for equiv ariances, establishing inv ariance to the tangen t direc- tions, application of our netw ork in the broader context of 6DoF ob ject detection from point sets and lo oking for equiv ariances among p oint resampling. QE-Net works 15 References 1. Afshar, P ., Mohammadi, A., Plataniotis, K.N.: Brain tumor type classiﬁcation via capsule net works. In: 2018 25th IEEE International Conference on Image Pro cess- ing (ICIP) (2018) 2. Aftab, K., Hartley , R.: Conv ergence of iterativ ely re-w eighted least squares to ro- bust m-estimators. In: Win ter Conference on Applications of Computer Vision. IEEE (2015) 3. Aftab, K., Hartley , R., T rumpf, J.: Generalized weiszfeld algorithms for lq opti- mization. IEEE transactions on pattern analysis and machine intelligence 37 (4) (2014) 4. Aftab, K., Hartley , R., T rumpf, J.: l q closest-p oin t to aﬃne subspaces using the generalized weiszfeld algorithm. In ternational Journal of Computer Vision (2015) 5. Aoki, Y., Goforth, H., Sriv atsan, R.A., Lucey , S.: Poin tnetlk: Robust & eﬃcient p oin t cloud registration using p ointnet. In: Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7163–7172 (2019) 6. Bao, E., Song, L.: Equiv arian t neural net works and equiv ariﬁcation. arXiv preprin t arXiv:1906.07172 (2019) 7. Becigneul, G., Ganea, O.E.: Riemannian adaptiv e optimization metho ds. In: In- ternational Conference on Learning Representations (2019) 8. Besl, P .J., McKay , N.D.: Metho d for registration of 3-d shap es. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. International So ciet y for Optics and Photonics (1992) 9. Birdal, T., Arb el, M., Simsekli, U., Guibas, L.J.: Sync hronizing probability mea- sures on rotations via optimal transp ort. In: Pro ceedings of the IEEE/CVF Con- ference on Computer Vision and P attern Recognition. pp. 1569–1579 (2020) 10. Birdal, T., Ilic, S.: P oin t pair features based ob ject detection and pose estima- tion revisited. In: 2015 In ternational Conference on 3D Vision. pp. 527–535. IEEE (2015) 11. Birdal, T., Ilic, S.: A p oint sampling algorithm for 3d matching of irregular geome- tries. In: IEEE/RSJ International Conference on Intelligen t Rob ots and Systems (IR OS). IEEE (2017) 12. Birdal, T., Simsekli, U., Eken, M.O., Ilic, S.: Bay esian p ose graph optimization via bingham distributions and temp ered geo desic mcmc. In: Adv ances in Neural Information Pro cessing Systems. pp. 308–319 (2018) 13. Bo omsma, W., F rellsen, J.: Spherical conv olutions and their application in molec- ular mo delling. In: Adv ances in Neural Information Pro cessing Systems 30. pp. 3433–3443 (2017) 14. Burrus, C.S.: Iterativ e rew eighted least squares. OpenStax CNX. Av ailable online: h ttp://cnx. org/con tents/92b90377-2b34-49e4-b26f-7fe572db78a1 12 (2012) 15. Busam, B., Birdal, T., Na v ab, N.: Camera p ose ﬁltering with lo cal regression geo desics on the riemannian manifold of dual quaternions. In: IEEE In ternational Conference on Computer Vision W orkshop (ICCVW) (Octob er 2017) 16. Chakrab ort y , R., Banerjee, M., V emuri, B.C.: H-cnns: Conv olutional neural net- w orks for riemannian homogeneous spaces. arXiv preprint arXiv:1805.05487 (2018) 17. Cohen, T., W eiler, M., Kicanaoglu, B., W elling, M.: Gauge equiv arian t conv olu- tional net w orks and the icosahedral CNN. In: Pro ceedings of the 36th In ternational Conference on Machine Learning. pp. 1321–1330 (2019) 18. Cohen, T., W elling, M.: Group equiv ariant conv olutional netw orks. In: Interna- tional conference on machine learning. pp. 2990–2999 (2016) 19. Cohen, T.S., Geiger, M., K¨ ohler, J., W elling, M.: Spherical cnns. In: 6th Interna- tional Conference on Learning Representations, (ICLR) (2018) 16 Y. Zhao et al. 20. Cohen, T.S., Geiger, M., W eiler, M.: A general theory of equiv ariant cnns on ho- mogeneous spaces. In: Adv ances in Neural Information Pro cessing Systems. pp. 9145–9156 (2019) 21. Cohen, T.S., W elling, M.: Steerable cnns. International Conference on Learning Represen tations (ICLR) (2017) 22. Cruz-Mota, J., Bogdanov a, I., P aquier, B., Bierlaire, M., Thiran, J.P .: Scale in- v arian t feature transform on the sphere: Theory and applications. In ternational Journal of Computer Vision 98 (2), 217–241 (Jun 2012) 23. Deng, H., Birdal, T., Ilic, S.: Ppf-foldnet: Unsupervised learning of rotation inv ari- an t 3d lo cal descriptors. In: Europ ean Conference on Computer Vision (ECCV) (2018) 24. Deng, H., Birdal, T., Ilic, S.: Ppfnet: Global con text aw are local features for robust 3d p oint matc hing. In: Conference on Computer Vision and P attern Recognition (2018) 25. Estev es, C., Allen-Blanc hette, C., Mak adia, A., Daniilidis, K.: Learning so (3) equiv arian t represen tations with spherical cnns. In: Proceedings of the Europ ean Conference on Computer Vision (ECCV). pp. 52–68 (2018) 26. Estev es, C., Sud, A., Luo, Z., Daniilidis, K., Mak adia, A.: Cross-domain 3d equiv arian t image embeddings. In: In ternational Conference on Machine Learn- ing (ICML) (2019) 27. Estev es, C., Xu, Y., Allen-Blanchette, C., Daniilidis, K.: Equiv arian t multi-view net works. In: Proceedings of the IEEE International Conference on Computer Vi- sion. pp. 1568–1577 (2019) 28. F ey , M., Eric Lenssen, J., W eic hert, F., M ¨ uller, H.: Splinecnn: F ast geometric deep learning with contin uous b-spline kernels. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018) 29. Giles, C.L., Maxw ell, T.: Learning, inv ariance, and generalization in high-order neural netw orks. Applied optics 26 (23), 4972–4978 (1987) 30. Hin ton, G.E., Krizhevsky , A., W ang, S.D.: T ransforming auto-enco ders. In: In ter- national Conference on Artiﬁcial Neural Net works. pp. 44–51. Springer (2011) 31. Hopp e, H., DeRose, T., Duchamp, T., McDonald, J., Stuetzle, W.: Surface recon- struction from unorganized p oints, v ol. 26.2. A CM (1992) 32. Jaisw al, A., Ab dAlmageed, W., W u, Y., Natara jan, P .: Capsulegan: Generative adv ersarial capsule net work. In: Computer Vision – ECCV 2018 W orkshops. pp. 526–535. Springer International Publishing (2019) 33. Jiang, C.M., Huang, J., Kashinath, K., Prabhat, Marcus, P ., Niessner, M.: Spher- ical CNNs on unstructured grids. In: International Conference on Learning Repre- sen tations (2019) 34. Khoury , M., Zhou, Q.Y., Koltun, V.: Learning compact geometric features. In: Pro ceedings of the IEEE In ternational Conference on Computer Vision. pp. 153– 161 (2017) 35. Kingma, D.P ., Ba, J.: Adam: A method for sto chastic optimization. arXiv preprin t arXiv:1412.6980 (2014) 36. Kondor, R., Lin, Z., T riv edi, S.: Clebsch–gordan nets: a fully fourier space spher- ical conv olutional neural net work. In: Adv ances in Neural Information Pro cessing Systems (2018) 37. Kondor, R., T riv edi, S.: On the generalization of equiv ariance and conv olution in neural netw orks to the action of compact groups. In: In ternational Conference on Mac hine Learning. pp. 2747–2755 (2018) 38. Kosiorek, A., Sabour, S., T eh, Y.W., Hin ton, G.E.: Stac ked capsule auto enco ders. In: Adv ances in Neural Information Pro cessing Systems. pp. 15512–15522 (2019) QE-Net works 17 39. Laue, S., Mitterreiter, M., Giesen, J.: Computing higher order deriv atives of matrix and tensor expressions. In: Adv ances in Neural Information Pro cessing Systems (2018) 40. Lenssen, J.E., F ey , M., Libuschewski, P .: Group equiv ariant capsule netw orks. In: Adv ances in Neural Information Pro cessing Systems. pp. 8844–8853 (2018) 41. Li, J., Chen, B.M., Hee Lee, G.: So-net: Self-organizing netw ork for p oint cloud analysis. In: Pro ceedings of the IEEE conference on computer vision and pattern recognition (2018) 42. Li, Y., Bu, R., Sun, M., W u, W., Di, X., Chen, B.: P ointcnn: Conv olution on x- transformed points. In: Adv ances in Neural Information Processing Systems (2018) 43. Liao, S., Gavv es, E., Sno ek, C.G.: Spherical regression: Learning viewpoints, sur- face normals and 3d rotations on n-spheres. In: Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9759–9767 (2019) 44. Liu, M., Y ao, F., Choi, C., Ay an, S., Ramani, K.: Deep learning 3d shap es using alt-az anisotropic 2-sphere conv olution. In: International Conference on Learning Represen tations (ICLR) (2019) 45. Liu, X., Han, Z., Liu, Y.S., Zwick er, M.: P oint2sequence: Learning the shap e rep- resen tation of 3d p oint clouds with an atten tion-based sequence to sequence net- w ork. In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence. vol. 33, pp. 8778–8785 (2019) 46. Magn us, J.R.: On diﬀerentiating eigenv alues and eigenv ectors. Econometric Theory 1 (2) (1985) 47. Marcos, D., V olpi, M., Komodakis, N., T uia, D.: Rotation equiv arian t vector ﬁeld net works. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017) 48. Markley , F.L., Cheng, Y., Crassidis, J.L., Oshman, Y.: Av eraging quaternions. Journal of Guidance, Control, and Dynamics 30 (4), 1193–1197 (2007) 49. Maturana, D., Sc herer, S.: V oxnet: A 3d conv olutional neural net work for real-time ob ject recognition. In: Intelligen t Rob ots and Systems (IR OS). IEEE (2015) 50. Mehr, E., Lieutier, A., Sanchez Berm udez, F., Guitteny , V., Thome, N., Cord, M.: Manifold learning in quotient spaces. In: Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9165–9174 (2018) 51. Melzi, S., Spezialetti, R., T om bari, F., Bronstein, M.M., Stefano, L.D., Ro dola, E.: Gframes: Gradien t-based lo cal reference frame for 3d shape matching. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019) 52. P etrelli, A., Di Stefano, L.: On the rep eatability of the lo cal reference frame for partial shap e matching. In: 2011 International Conference on Computer Vision. IEEE (2011) 53. P etrelli, A., Di Stefano, L.: A repeatable and eﬃcien t canonical reference for surface matc hing. In: 2012 Second In ternational Conference on 3D Imaging, Mo deling, Pro cessing, Visualization & T ransmission. pp. 403–410. IEEE (2012) 54. P oulenard, A., Ovsjaniko v, M.: Multi-directional geo desic neural netw orks via equiv arian t conv olution. In: SIGGRAPH Asia 2018 T echnical Papers. p. 236. ACM (2018) 55. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Poin tnet: Deep learning on p oint sets for 3d classiﬁcation and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 652–660 (2017) 56. Qi, C.R., Su, H., Nießner, M., Dai, A., Y an, M., Guibas, L.J.: V olumetric and m ulti-view cnns for ob ject classiﬁcation on 3d data. In: Pro ceedings of the IEEE conference on computer vision and pattern recognition. pp. 5648–5656 (2016) 18 Y. Zhao et al. 57. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: P ointnet++: Deep hierarc hical feature learn- ing on point sets in a metric space. In: Adv ances in neural information processing systems. pp. 5099–5108 (2017) 58. Rezatoﬁghi, S.H., Milan, A., Abbasnejad, E., Dic k, A., Reid, I., et al.: Deepsetnet: Predicting sets with deep neural net works. In: 2017 IEEE In ternational Conference on Computer Vision (ICCV). pp. 5257–5266. IEEE (2017) 59. Sab our, S., F rosst, N., Hinton, G.: Matrix capsules with em routing. In: 6th Inter- national Conference on Learning Representations, ICLR (2018) 60. Sab our, S., F rosst, N., Hinton, G.E.: Dynamic routing betw een capsules. In: Ad- v ances in neural information pro cessing systems. pp. 3856–3866 (2017) 61. Sc h ¨ utt, K., Kindermans, P .J., Sauceda F elix, H.E., Chmiela, S., Tk atc henko, A., M ¨ uller, K.R.: Schnet: A con tinuous-ﬁlter con volutional neural netw ork for model- ing quantum interactions. In: Adv ances in Neural Information Processing Systems (2017) 62. Shen, Y., F eng, C., Y ang, Y., Tian, D.: Mining point cloud lo cal structures by k ernel correlation and graph p o oling. In: Pro ceedings of the IEEE conference on computer vision and pattern recognition. pp. 4548–4557 (2018) 63. Sp ezialetti, R., Salti, S., Stefano, L.D.: Learning an eﬀective equiv ariant 3d descrip- tor without sup ervision. In: Pro ceedings of the IEEE In ternational Conference on Computer Vision. pp. 6401–6410 (2019) 64. Sriv asta v a, N., Goh, H., Salakhutdino v, R.: Geometric capsule autoenco ders for 3d p oin t clouds. arXiv preprint arXiv:1912.03310 (2019) 65. Steenro d, N.E.: The topology of ﬁbre bundles, vol. 14. Princeton Univ ersity Press (1951) 66. Thomas, N., Smidt, T., Kearnes, S., Y ang, L., Li, L., Kohlhoﬀ, K., Riley , P .: T ensor ﬁeld net works: Rotation-and translation-equiv ariant neural netw orks for 3d p oint clouds. arXiv preprint arXiv:1802.08219 (2018) 67. T ombari, F., Salti, S., Di Stefano, L.: Unique signatures of histograms for local surface description. In: European conference on computer vision. pp. 356–369. Springer (2010) 68. W ang, D., Liu, Q.: An optimization view on dynamic routing b etw een capsules (2018), https://openreview.net/forum?id=HJjtFYJDf 69. W ang, H., Sridhar, S., Huang, J., V alen tin, J., Song, S., Guibas, L.J.: Normalized ob ject co ordinate space for category-level 6d ob ject p ose and size estimation. In: Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2642–2651 (2019) 70. W ang, S., Suo, S., Ma, W.C., Pokro vsky , A., Urtasun, R.: Deep parametric con tin- uous con volutional neural net works. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018) 71. W ang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on p oint clouds. ACM T ransactions on Graphics (TOG) (2019) 72. W ang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on p oint clouds. ACM T ransactions on Graphics (TOG) 38 (5), 1–12 (2019) 73. W eiler, M., Geiger, M., W elling, M., Boomsma, W., Cohen, T.: 3d steerable cnns: Learning rotationally equiv ariant features in volumetric data. In: Adv ances in Neu- ral Information Pro cessing Systems. pp. 10381–10392 (2018) 74. W eiler, M., Hamprec ht, F.A., Storath, M.: Learning steerable ﬁlters for rotation equiv arian t cnns. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018) QE-Net works 19 75. W orrall, D., Brosto w, G.: Cub enet: Equiv ariance to 3d rotation and translation. In: The Europ ean Conference on Computer Vision (ECCV) (September 2018) 76. W orrall, D.E., Garbin, S.J., T urm ukhambetov, D., Brostow, G.J.: Harmonic net- w orks: Deep translation and rotation equiv ariance. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017) 77. W u, Z., Song, S., Khosla, A., Y u, F., Zhang, L., T ang, X., Xiao, J.: 3d shapenets: A deep representation for v olumetric shapes. In: Pro ceedings of the IEEE conference on computer vision and pattern recognition. pp. 1912–1920 (2015) 78. Xin yi, Z., Chen, L.: Capsule graph neural net work. In: In ternational Conference on Learning Represen tations (ICLR) (2019), openreview.net/forum?id=Byl8BnRcYm 79. Y ou, Y., Lou, Y., Liu, Q., T ai, Y.W., Ma, L., Lu, C., W ang, W.: Poin twise rotation- in v ariant netw ork with adaptiv e sampling and 3d spherical vo xel con volution. In: AAAI. pp. 12717–12724 (2020) 80. Y uan, W., Held, D., Mertz, C., Heb ert, M.: Iterative transformer netw ork for 3d p oin t cloud. arXiv preprint arXiv:1811.11209 (2018) 81. Zaheer, M., Kottur, S., Rav anbakhsh, S., P o czos, B., Salakh utdinov, R.R., Smola, A.J.: Deep sets. In: Adv ances in Neural Information Pro cessing Systems (2017) 82. Zhang, X., Qin, S., Xu, Y., Xu, H.: Quaternion product units for deep learning on 3d rotation groups. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7304–7313 (2020) 83. Zhao, Y., Birdal, T., Deng, H., T om bari, F.: 3d p oint capsule netw orks. In: Con- ference on Computer Vision and P attern Recognition (CVPR) (2019) 20 Y. Zhao et al. A Pro of of Prop osition 1 Before presen ting the proof we recall the three individual statements contained in Prop. 1: 1. A ( g ◦ S , w ) is left-equiv ariant: A ( g ◦ S , w ) = g ◦ A ( S , w ). 2. Op erator A is in v ariant under p ermutations: A ( { q σ (1) , . . . , q σ ( Q ) } , w σ ) = A ( { q 1 , . . . , q Q } , w ). 3. The transformations g ∈ H 1 preserv e the geo desic distance δ ( · ). Pr o of. W e will prov e the prop ositions in order. 1. W e start by transforming each element and replace q i b y ( g ◦ q i ) of the cost deﬁned in Eq. 4 of the main paper: q > Mq = q >  Q X i =1 w i q i q > i  q (6) = q >  Q X i =1 w i ( g ◦ q i )( g ◦ q i ) >  q (7) = q >  Q X i =1 w i Gq i q > i G >  q (8) = q >  GM 1 G > + · · · + GM Q G >  q = q > G  M 1 G > + · · · + M Q G >  q (9) = q > G  M 1 + · · · + M Q  G > q (10) = q > GMG > q (11) = p > Mp , (12) where M i = w i q i q > i and p = G > q . F rom orthogonallity of G it follows p = G − 1 q = ⇒ g ◦ p = q and hence g ◦ A ( S , w ) = A ( g ◦ S , w ). 2. The proof follows trivially from the permutation in v ariance of the symmetric summation operator ov er the outer products in Eq (9). 3. It is suﬃcien t to sho w that | q > 1 q 2 | = | ( g ◦ q 1 ) > ( g ◦ q 2 ) | for any g ∈ H 1 : | ( g ◦ q 1 ) > ( g ◦ q 2 ) | = | q > 1 G > Gq 2 | (13) = | q > 1 Iq 2 | (14) = | q > 1 q 2 | , (15) where g ◦ q ≡ Gq . The result is a direct consequence of the orthonormality of G . QE-Net works 21 B Pro of of Lemma 1 W e will b egin b y recalling some preliminary deﬁnitions and results that aid us to construct the connection betw een the dynamic routing and the W eiszfeld algorithm. Deﬁnition 9 (Aﬃne Subspace) A d -dimensional aﬃne subsp ac e of R N is obtaine d by a tr anslation of a d -dimensional line ar subsp ac e V ⊂ R N such that the origin is include d in S : S = n d +1 X i =1 α i x i | d +1 X i =1 α i = 1 o . (16) Simplest choic es for S involve p oints, lines and planes of the Euclide an sp ac e. Deﬁnition 10 (Orthogonal Pro jection onto an Aﬃne Subspace) An or- tho gonal pr oje ction of a p oint x ∈ R N onto an aﬃne subsp ac e explaine d by the p air ( A , c ) is deﬁne d as: Π i ( x ) , pr oj S ( x ) = c + A ( x − c ) . (17) c denotes the tr anslation to make origin inclusive and A is a pr oje ction matrix typic al ly deﬁne d via the orthonormal b ases of the subsp ac e. Deﬁnition 11 (Distance to Aﬃne Subspaces) Distanc e fr om a given p oint x to a set of aﬃne subsp ac es { S 1 , S 2 . . . S k } c an b e written as [4]: C ( x ) = k X i =1 d ( x , S i ) = k X i =1 k x − pr oj S i ( x ) k 2 . (18) Lemma 2. Given that al l the antip o dal c ounterp arts ar e mapp e d to the northern hemispher e, we wil l now think of the unit quaternion or versor as the unit normal of a four dimensional hyp erplane h , p assing thr ough the origin: h i ( x ) = q > i x + q d := 0 . (19) q d is an adde d term to c omp ensate for the shift. When q d = 0 the origin is incident to the hyp erplane. With this p ersp e ctive, quaternion q i forms an aﬃne subsp ac e with d = 4 , for which the pr oje ction op er ator takes the form: pr oj S i ( p ) = ( I − q i q > i ) p (20) Pr o of. W e consider Eq (20) for the case where c = 0 and A = ( I − qq > ). The former follows from the fact that our subspaces by construction pass through the origin. Th us, we only need to sho w that the matrix A = I − qq > is an orthogonal pro jection matrix on to the aﬃne subspace spanned by q . T o this end, it is suﬃcient to v alidate that A is symmetric and idemp otent: A > A = AA = A 2 = A . Note that by construction q > q is a symmetric matrix and hence A 22 Y. Zhao et al. itself. Using this prop erty and the unit-ness of the quaternion, we arriv e at the pro of: A > A = ( I − qq > ) > ( I − qq > ) (21) = ( I − qq > )( I − qq > ) (22) = I − 2 qq > + qq > qq > (23) = I − 2 qq > + qq > (24) = I − qq > , A (25) It is easy to verify that the pro jections are orthogonal to the quaternion that deﬁnes the subspace by showing pro j S ( q ) > q = 0: q > pro j S ( q ) = q > Aq = q > ( I − qq > ) q = q > ( q − qq > q ) = q > ( q − q ) = 0 . (26) Also note that this choice corresp onds to tr( qq > ) = P d +1 i =1 α i = 1. Lemma 3. The quaternion me an we suggest to use in the main p ap er [48] is e quivalent to the Euclide an Weiszfeld me an on the aﬃne quaternion subsp ac es. Pr o of. W e no w recall and summarize the L q -W eiszfeld Algorithm on aﬃne sub- spaces [4], whic h minimizes a q -norm v ariant of the cost deﬁned in Eq (18): C q ( x ) = k X i =1 d ( x , S i ) = k X i =1 k x − pro j S i ( x ) k q . (27) Deﬁning M i = I − A i , Alg. 3 summarizes the iterative pro cedure. Algorithm 3: L q W eiszfeld Algorithm on Aﬃne Subspaces [4]. 1 input: An initial guess x 0 that do es not lie any of the subspaces { S i } , Pro jection op erators Π i , the norm parameter q 2 x t ← x 0 3 while not c onver ge d do 4 Compute the weigh ts w t = { w t i } : w t i = k M i ( x t − c i ) k q − 2 ∀ i = 1 . . . k (28) 5 Solv e: x t +1 = arg min x ∈ R N k X i =1 w t i k M i ( x − c i ) k 2 (29) Note that when q = 2, the algorithm reduces to the computation of a non- w eighted mean ( w i = 1 ∀ i ), and a closed form solution exists for Eq (29) and is QE-Net works 23 giv en by the normal equations: x =  k X i =1 w i M i  − 1  k X i =1 w i M i c i  (30) F or the case of our quaternionic subspaces c = 0 and w e seek the solution that satisﬁes:  k X i =1 M i  x =  1 k k X i =1 M i  x = 0 . (31) It is w ell kno wn that the solution to this equation under the constrain t k x k = 1 lies in n ullspace of M = 1 k k P i =1 M i and can be obtained by taking the singular v ec- tor of M that corresponds to the largest singular v alue. Since M i is idemp otent, the same result can also b e obtained through the eigendecomp osition: q ? = arg max q ∈S 3 qMq (32) whic h gives us the unw eigh ted Quaternion mean [48]. C Pro of of Theorem 1 Once the Lemma 1 is prov en, we only need to apply the direct conv ergence results from the literature. Consider a set of p oints Y = { y 1 . . . y K } where K > 2 and y i ∈ H 1 . Due to the compactness, w e can sp eak of a ball B ( o , ρ ) encapsulating all y i . W e also deﬁne the D = { x ∈ H 1 | C q ( x ) < C q ( o ) } , the region where the loss decreases. W e ﬁrst state the assumptions that permit our theoretical result. These as- sumptions are required b y the w orks that establish the con vergence of such W eiszfeld algorithms [2,3] : H1. y 1 . . . y K should not lie on a single geo desic of the quaternion manifold. H2. D is b ounded and compact. The top ological structure of S O (3) imp oses a b ounded con vexit y radius of ρ < π / 2. H3. The minimizer in Eq (29) is contin uous. H4. The weigh ting function σ ( · ) is conca ve and diﬀerentiable. H5. Initial quaternion (in our netw ork chosen randomly) does not b elong to any of the subspaces. Note that H5 is not a strict requiremen t as there are multiple wa ys to circum- v ent (simplest b eing a re-initialization). Under these assumptions, the sequence pro duced b y Eq (29) will con verge to a critical p oint unless x t = y i for an y t and i [3]. F or q = 1, this critical point is on one of the subspaces sp eciﬁed in Eq (19) and th us is a geometric median. u t Note that due to the assumption H2 , we cannot conv erge from an y giv en p oin t. F or randomly initialized netw orks this is indeed a problem and do es not 24 Y. Zhao et al. QEC$ - Module ! ⊂ ℳ , QEC$ - Module QEC$ - Module QEC$ - Module QEC$ - Module QEC$ - Module Downsample Downsample Intermediate *Capsules Pose% Estimation Classification ! ! $ ⊂ ℳ 4 " ! ! " " " Classification Quaternion$Equi variant$Capsule$Arc hitecture Fig. 5. Our siamese architecture used in the estimation of relative p oses. W e use a shared net work to process tw o distinct p oint clouds ( X , Y ) to arrive at the laten t represen tations ( C X , α X ) and ( C Y , α Y ) resp ectively . W e then lo ok for the highest activ ated capsules in both p oint sets and compute the rotation from the corresp onding capsules. Thanks to the rotations disentangled into capsules, this ﬁnal step simpliﬁes to a relative quaternion calculation. guaran tee practical conv ergence. Y et, in our exp erimen ts w e ha ve not observed an y issue with the conv ergence of our dynamic routing. As our result is one of the few ones related to the analysis of DR, we still ﬁnd this to b e an important ﬁrst step. F or diﬀeren t choices of q : 1 ≤ q ≤ 2, the weigh ts tak e diﬀerent forms. In fact, this IRLS t yp e of algorithm is shown to conv erge for a larger class of weigh ting c hoices as long as the aforemen tioned conditions are met. That is wh y in practice w e use a simple sigmoid function. D F urther Discussions On c onver genc e, runtime and c omplexity. Note that while the con ver- gence basin is kno wn, to the best of our kno wledge, a conv ergence rate for a W eiszfeld algorithm in aﬃne subspaces is not established. F rom the literature of robust minimization via Riemannian gradient descen t (this is essentially the corresp onding particle optimizer), w e conjecture that suc h a rate dep ends upon the c hoice of the con vex regime (in this case 1 ≤ q ≤ 2) and is at b est linear – though we did not prov e this conjecture. In practice w e run the W eiszfeld iteration only 3 times, similar to the original dynamic routing. This is at least suﬃcien t to conv erge to a p oint goo d enough for the netw ork to explain the data at hand. QEC module summarized in the Alg. 2 of the main pap er can b e dissected in to three main steps: (i) canonicalization of the lo cal oriented p oint set, (ii) the t -k ernel and (iii) dynamic routing. Ov erall the total computational complexit y reads O ( L + K C M LP + C DR ) where C M LP and C DR are the computational costs of the MLP and the DR resp ectively: C DR = LM + M ( K + k (2 L ) + L ) = M ( K + 2( k + 1) L ) C M LP = 64 N c + 4 M N 2 c . (33) QE-Net works 25 chair sofa toilet bed monitor table night_stand bathtub dresser desk chair sofa toilet bed monitor table night_stand bathtub dresser desk 0 20 40 60 80 Classification Accuracy (%) (a) c hair monit or bed t oi l e t sof a un i t sp he r e (b) Fig. 6. (a) Confusion matrix on Mo delNet10 for classiﬁcation. (b) Distribution of initial p oses p er class. Note that Eq (33) depicts the complexity of a single QEC mo dule. In our ar- c hitecture we use a stac k of those eac h of which cause an added increase in the complexit y prop ortional to the n umber of p oints downsampled. Our weigh ted quaternion a v erage relies upon a diﬀeren tiable SVD. While not increasing the theoretical computational complexit y , when done naiv ely , this op eration can cause signiﬁcant increase in run time. Hence, we compute the SVD using CUDA k ernels in a batch-wise manner. This batch-wise SVD mak es it p ossible to av erage a large amount of quaternions with high eﬃciency . Note that w e omit the computational asp ects of LRF calculation as we consider it to b e an input to our system and diﬀeren t LRFs exhibit diﬀeren t costs. W e hav e further conducted a run time analysis in the 3D Shap e Classiﬁc ation exp erimen t on an Nvidia GeF orce R TX 2080 Ti with the net work conﬁguration men tioned in Sec. 5.2 of the main pap er. During training, each batch (where batc h size b = 8) takes 0 . 226s and 1939M of GPU memory . During inference, pro cessing eac h instance takes 0 . 036s and consumes 1107M of GPU memory . Note that the use of LRFs helps us to restrict the rotation group to certain elemen ts and th us w e can use net works with signiﬁcan tly less parameters (as lo w as 0 . 44 M ) compared to others as sho wn in T ab. 1 of the main pap er. Num- b er of parameters in our netw ork dep ends up on the num ber of classes, e.g. for Mo delNet10 w e hav e 0 . 047 M parameters. Quaternion ambiguity. Quaternions of the northern and southern hemi- spheres represen t the same exact rotation, hence one of them is r e dundant . By mapping one hemisphere to the other, we sacriﬁce the closeness of the manifold. This could slightly distort the b ehavior of the linearization op erator around the Ecuador. How ever, the rest of the op erations such as geodesic distances re- sp ect suc h antipo dality , as w e consider the Quaternionic manifold and not the sphere. When the subset of operations we develop and the nature of local refer- ence frames are concerned, w e did not ﬁnd this transformation to cause serious shortcomings. 26 Y. Zhao et al. ( a ) I np ut P oint C lou d ( b ) I ni ti al L RF s ( c ) L RF s Prior t o Q E - Net w or k - 1 ( d ) M ult i - ch an ne l LRF s Prior t o Q E - Net w or k - 2 Fig. 7. Additional intermediate results on car (ﬁrst row) and c hair (second ro w) ob jects. This ﬁgure supplements Fig. 1(a) of the main pap er. Performanc e on diﬀer ent shap es with same orientation. The NR/NR scenario in T ab. 1 of the main paper in volv es classiﬁcation on diﬀeren t shapes within a category without rotation, e . g . c hairs with diﬀerent shap es. In adden- dum, w e no w provide in Fig. 6(b) an additional insigh t in to the p ose distribution for all canonicalized ob jects within a class. T o do so, we rotate the horizontal standard basis vector e x = [1 , 0 , 0] using the predict quaternion (the most ac- tiv ated output capsule) and plot the resulting point on a unit sphere as shown in Fig. 6(b). A qualitativ e observ ation rev eals that for all ﬁv e non-symmetric classes, the p oses of all the instances within a class w ould form a cluster. This roughly holds across all classes and indicates that the relativ e p ose information is consistent within the classes. On the other hand, ob jects with symmetries form m ultiple clusters. E Our Siamese Arc hitecture F or estimation of the relativ e p ose with supervision, we b eneﬁt from a Siamese v ariation of our net w ork. In this case, latent capsule represen tations of t wo point sets X and Y jointly con tribute to the pose regression as sho wn in Fig. 5. W e sho w additional results from the computation of lo cal reference frames and the m ulti-channel capsules deduced from our netw ork in Fig. 7. F Additional Details on Ev aluations Details on the evaluation pr oto c ol. F or Mo delnet40 dataset used in T ab. 1, we stick to the oﬃcial split with 9,843 shapes for training and 2,468 diﬀerent shap es for testing. F or rotation estimation in T ab. 2, w e again used the oﬃcial Mo delenet10 dataset split with 3991 for training and 908 shap es for testing. 3D p oint clouds (10K p oints) are randomly sampled from the mesh surfaces of eac h shap e [55,57]. The ob jects in training and testing dataset are diﬀerent, but QE-Net works 27 Fig. 8. Additional pairwise shape alignment on more categories in Modelnet10 dataset. W e do not p erform any ICP and the transformations that align the t wo p oint clouds are direct results of the forw ard pass of our Siamese netw ork. they are from the same categories so that they can be orien ted meaningfully . During training, we did not augment the dataset with random rotations. All the shap es are trained with single orien tation (well-aligned). W e call this tr aine d with NR . During testing, we randomly generate m ultiple arbitrary S O (3) rotations for each shap e and ev aluate the a verage p erformance for all the rotations. This is called test with AR . This protocol is used in b oth our algorithms and the baselines. Confusion of classiﬁc ation in Mo delNet. T o provide additional insigh t in to how our activ ation features p erform, w e now rep ort the confusion matrix in the task of classiﬁcation on the all the ob jects of ModelNet10. Unique to our algorithm, the classiﬁcation and rotation estimation reinforces one another. As seen from Fig. 6(a) on the righ t, the ﬁrst ﬁve categories that exhibit less rotational symmetry has the higher classiﬁcation accuracy than their rotationally symmetric coun terparts. Distribution of err ors r ep orte d in T ab. 2. W e now pro vide more details on the errors attained by our algorithm as w ell as the state of the art. T o this end, w e rep ort, in Fig. 9 the histogram of errors that fall within quan tized ranges of 28 Y. Zhao et al. orien tation errors. It is noticeable that our Siamese architecture b ehav es b est in terms of estimating the ob jects rotation. F or completeness, we also included the results of the v arian ts presented in our ablation studies: Ours-2kLRF, Ours- 1kLRF. They ev aluate the model on the re-calculated LRFs in order to sho w the robustness to w ards to v arious point densities. W e hav e also modiﬁed IT- Net and Poin tNetLK only to predict rotation b ecause the original works predict b oth rotations and translations. Finally , note here that we do not use data augmen tation for training our net w orks (see AR), while both for Poin tNetLK and for I T-Net w e do use augmen tation. Ours Ours-sia Ours-2kLRF Ours-1kLRF IT-net PointNetLK Mean-LRF PCA < 1 5 ° 23.18% 58.16% 53.64% 41.36% 49.68% 45.40% 16.24% 29.80% < 3 0 ° 49.53% 79.12% 76.88% 66.40% 68.40% 49.28% 33.84% 32.00% < 6 0 ° 69.87% 87.48% 86.68% 78.76% 79.48% 55.84% 57.64% 33.80% 0 10 20 30 40 50 60 70 80 90 100 Percentage (%) Fig. 9. Cum ulative error histograms of rotation estimation on Mo delNet10. Each row ( < θ ◦ ) of this extended table shows the p ercentage of shapes that hav e rotation error less than θ . The colors of the bars corresp ond to the ro ws they reside in. The higher the errors are con tained in the ﬁrst bins (light blue) the b etter. Vice versa, the more the errors are clustered tow ard the 60 ◦ the worse the performance of the method.

Quaternion Equivariant Capsule Networks for 3D Point Clouds

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment