Deep Learning for 2D and 3D Rotatable Data: An Overview of Methods
Convolutional networks are successful due to their equivariance/invariance under translations. However, rotatable data such as images, volumes, shapes, or point clouds require processing with equivariance/invariance under rotations in cases where the…
Authors: Luca Della Libera, Vladimir Golkov, Yue Zhu
Deep Learning f or 2D and 3D Rotatable Data: An Over view of Methods Luca DELLA LIBERA , Vladimir GOLK O V * , Y ue ZHU , Arman MIELKE and Daniel CREMERS Computer V ision Group, T echnical Univ ersity of Munich, Germany {luca.della, vladimir .golko v , yue.zhu, arman.mielke, cremers}@tum.de Abstract Con volutional networks are successful due to their equiv ariance/in v ariance under translations. How- ev er, rotatable data such as images, v olumes, shapes, or point clouds require processing with equiv ari- ance/in variance under rotations in cases where the rotational orientation of the coordinate system does not affect the meaning of the data (e.g. object classi- fication). On the other hand, estimation/processing of rotations is necessary in cases where rotations are important (e.g. motion estimation). There has been recent progress in methods and theory in all these regards. Here we provide an ov ervie w of ex- isting methods, both for 2D and 3D rotations (and translations), and identify commonalities and links between them. 1 Introduction Rotational and translational equiv ariance play an important role in image recognition tasks. Con volutional neural networks (CNNs) are translationally equi v ariant: the con volution of a translated image with a filter is equi valent to the con volution of the untranslated image with the same filter, followed by a translation. Unfortunately , standard CNNs do not have an analogous property for rotations. A naiv e attempt to achieve rotational equi vari- ance/in variance is data augmentation. Its major problem is that rotational equi v ariance in the data but not in the network architecture forces the network to learn each object orientation “from scratch” and hampers generalization. Methods that achiev e rotational equiv ariance/inv ariance in more advanced ways ha ve appeared recently . Apart from methods that are in v ariant under rotations of the input (i.e. where rotation “must not matter”), we also include examples of methods that can return rotations as output, as well as methods that use rotations as input and/or as deep features (i.e. where rotation matters). This work is structured as follows. In Section 2 we introduce important mathematical concepts such as equi variance and steerability . In Sections 3 – 4 we present the main approaches * Contact Author used to achiev e rotational equi v ariance. In Section 5.1 we cat- egorize concrete methods that use those approaches to achie ve equiv ariance/in variance. W e also cate gorize methods that can return a rotation as output in Section 5.2 and methods that use rotations as input and/or deep features in Section 5.3. Finally , we draw conclusions in Section 6. The mathematical concepts (Section 2) serve as a foundation for the best (i.e. exact and most general) equiv ariant approach (Section 3.3). 2 F ormal Definitions 2.1 Equivariance Definition 1. A function 𝑓 : 𝒳 → 𝒴 is equivariant under a group 𝐺 (with some gr oup actions 𝜋 and 𝜓 of 𝐺 that transform 𝒳 and 𝒴 , respectiv ely) if 𝑓 ( 𝜋 𝑔 [ x ]) = 𝜓 𝑔 [ 𝑓 ( x )] ∀ 𝑔 ∈ 𝐺 ∀ x ∈ 𝒳 , (1) where 𝜋 𝑔 is the action of 𝑔 on 𝒳 , i.e. a transformation (for example rotation) of the input of 𝑓 , and 𝜓 𝑔 is the action of 𝑔 on 𝒴 , i.e. an “associated” (via the same 𝑔 ) but possibly different transformation (for example rotation of the image and of the feature space) of the output of 𝑓 . In other words, for each transformation 𝜋 𝑔 that modifies the input of 𝑓 , we kno w a transformation 𝜓 𝑔 that happens to the output of 𝑓 (due to transforming the input by 𝜋 𝑔 ), without the need to know the input x . The usage of 𝑔 ∈ 𝐺 to associate 𝜓 𝑔 with 𝜋 𝑔 is important for correct composition of se veral transformations. For e xample, if 𝜋 𝑔 is a 180 ∘ rotation, i.e. 𝜋 𝑔 𝜋 𝑔 is the identity mapping, then 𝜓 𝑔 𝜓 𝑔 should also be the identity mapping. If the content of a rectangular image is rotated (and/or trans- lated), the “field of view” changes, i.e. features that used to be in the corners disappear and new features appear in the cor- ners. This has caused some confusion as to how rectangular images can be processed in a rotation-equi variant way . The explanation is the following: An output value of the neural network is only af fected if the change of features is within its receptiv e field and the network has not learned from data that such a feature change should be irrelev ant for the output. Similarly , if two input features are rotated relatively to each other , an output value changes only if both input features are within its recepti ve field and the netw ork has not learned that such a relativ e rotation should be processed equiv ariantly . Definition 2. A special case of equiv ariance is same-equivariance [ Dieleman et al. , 2016 ] , when 𝜓 = 𝜋 . In 1 some sources, same-equi v ariance is called equi variance, and what we call equiv ariance is called cov ariance. Definition 3. A special case of equivariance is in variance , when 𝜓 = I , the identity . Definition 4. Equiv ariance is exact if Eq. (1) holds strictly , approximate (for example approximated through learning) if Eq. (1) holds approximativ ely . 2.2 Steerability Definition 5. A function 𝑓 : 𝒳 → 𝒴 is steerable if rotated versions of 𝑓 can be expressed using linear combinations of a fixed set of basis functions ℎ 𝑗 for 𝑗 = 1 , . . . , 𝑀 , that is: 𝑓 ( 𝜋 [ x ]) = 𝑀 𝑗 =1 𝑘 𝑗 ( 𝜋 ) ℎ 𝑗 ( x ) , (2) where 𝜋 is a rotation and 𝑘 𝑗 are complex-v alued rota- tion-dependent steering factors . For example, if we consider a standard non-normalized 2D Gaussian 𝐺 ( 𝑥, 𝑦 ) = 𝑒 − ( 𝑥 2 + 𝑦 2 ) , its first deriv ative 𝐺 𝑥 ( 𝑥, 𝑦 ) = 𝜕 𝐺 𝜕 𝑥 ( 𝑥, 𝑦 ) in the 𝑥 direction can be steered at an arbitrary orientation 𝜃 through a linear combination of 𝐺 0 ∘ 𝑥 ( 𝑥, 𝑦 ) = 𝐺 𝑥 ( 𝑥, 𝑦 ) = − 2 𝑥𝑒 − ( 𝑥 2 + 𝑦 2 ) and 𝐺 90 ∘ 𝑥 ( 𝑥, 𝑦 ) = 𝐺 𝑥 ( 𝜋 90 ∘ [ 𝑥, 𝑦 ]) = − 2 𝑦 𝑒 − ( 𝑥 2 + 𝑦 2 ) : 𝐺 𝜃 𝑥 ( 𝑥, 𝑦 ) = 𝐺 𝑥 ( 𝜋 𝜃 [ 𝑥, 𝑦 ]) = cos( 𝜃 ) 𝐺 0 ∘ 𝑥 ( 𝑥, 𝑦 ) + sin( 𝜃 ) 𝐺 90 ∘ 𝑥 ( 𝑥, 𝑦 ) . (3) A visualization of the case 𝜃 = 30 ∘ looks as follows: 𝐺 30 ∘ 𝑥 ( 𝑥, 𝑦 ) = cos(30 ∘ ) ∼ 0 . 87 𝐺 0 ∘ 𝑥 ( 𝑥, 𝑦 ) + sin(30 ∘ ) 0 . 5 𝐺 90 ∘ 𝑥 ( 𝑥, 𝑦 ) . (4) A useful consequence of steerability is that con volution of an image with basis filters ℎ 𝑗 is rotationally equiv ariant. The mapping 𝜓 in Eq. (1) in this case corresponds to a linear combination of the feature maps. As a side note, the mapping 𝜓 can also be a certain kind of linear combinations even if the filters are not a basis of a steerable filter , see Section 3.3. A harmonic function 𝑓 is a twice continuously differen- tiable function that satisfies Laplace’ s equation, i.e. ∇ 2 𝑓 = 0 . Circular harmonics and spherical harmonics are defined on the circle and the sphere, respectiv ely , and are similar to the Fourier series (i.e. sinusoids with dif ferent frequencies). 2D or 3D rotational equiv ariance can be hardwired in the network architecture by restricting the filters’ angular compo- nent to belong to the circular harmonic or spherical harmonic family , respecti vely . The proof utilizes steerability properties of such filter banks. The radial profile of these filters on the other hand can be learned. There are various techniques to parameterize the radial profile (with learnable parameters). 2.3 Group Con volution Definition 6. The group con volution [ Cohen and W elling, 2016, Esteves et al. , 2018a ] between a feature map F and a filter W is defined as ( F ⋆ 𝐺 W )( x ) = 𝑔 ∈ 𝐺 F ( 𝜑 𝑔 [ 𝜂 ]) W ( 𝜑 − 1 𝑔 [ x ]) d 𝑔 , (5) where 𝐺 is a group, 𝑔 ∈ 𝐺 is a group element with group action 𝜑 𝑔 , and 𝜂 is typically a canonical element in the do- main of F (e.g. the origin if the domain is R 𝑛 ). The group con volution can be shown [ Estev es et al. , 2018b, Kondor and T rivedi, 2018 ] to be equi variant. The ordinary con volution is a special case of the group con volution. 3 A pproaches that Guarantee Exact Rotational Equivariance In this section we list and briefly discuss the approaches used to achie ve e xact rotational equiv ariance/in variance. The state-of-the-art approach is described in Section 3.3. 3.1 Hardwir ed Pose Normalization A basic approach to address the problem of rotational in vari- ance consists in trying to “erase” the ef fect of rotations by rev erting the input to a canonical pose by har dwiring a re ver - sion function such as PCA. Problems are: small noise can strongly influence the result especially for objects with sym- metries; learned lo w-lev el feature detectors do not generalize to other orientations. 3.2 Handcrafted F eatures Extractors of simple rotationally inv ariant features can be handcrafted rather than learned. Examples include features based on distances between pairs of points ( SE( 𝑛 ) -in variant), and/or between each point and the origin (in v ariant under rotations around the origin). Handcrafted feature extractors are not trained, i.e. not optimal. 3.3 General Linear Equivariant Mappings and Equivariant Nonlinearities In many situations, the best approach are the most general methods that guarantee equi variance [ K ondor and Tri vedi, 2018, W eiler et al. , 2018a, Cohen et al. , 2019 ] . Sev eral formu- lations are av ailable. When using so-called irr educible repr esentations of the ro- tation group, the equiv ariant mapping corresponds to the usage of steerable filters (see Section 2.2). This enables equi variance under all infinitely many rotation angles, b ut pointwise nonlin- earities such as ReLU are not equi variant in this basis. Only other , special nonlinearities are equiv ariant. (See Section 3.3 by [ Cohen et al. , 2019 ] for an overvie w of equiv ariant non- linearities, equiv ariant batch normalization, and equiv ariant residual learning.) On the other hand, when using r e gular repr esentations , the mapping corresponds to group con volution, Eq. (5) , i.e. the usage of a finite number of rotated versions of arbitrary fil- ters. This enables equi variance only under a discrete subgroup (e.g. 45 ∘ rotations in a plane) of the rotation group, but point- wise nonlinearities like ReLU are equi variant. For 2D rotations, results are best in practice when group con volutions with small rotation angles and pointwise nonlin- earities are used [ W eiler and Cesa, 2019 ] . 2 For the most common purposes, this is arguably the best of all approaches: it guarantees exact equiv ariance (unlike Section 4), without an obligation to use untrainable feature extractors (unlike Section 3.2) or unstable pose normalization (unlike Section 3.1). Note that the exactness of the equiv ariance is slightly re- duced when data are discretized to a pixel/v oxel grid. Deeper layers might further amplify the impact of the angle between the grid and the object on the features. A part of this “missing part of equiv ariance” can be learned from training data (see Section 4.1). If the trained netw ork has such “partially learned” equi variance, it is not obvious ho w it achiev es it, i.e. the group action 𝜓 of intermediate layers is not kno wn, unlike in the case of exact equi variance. 4 A pproaches to Lear n Appr oximate Rotational Equivariance V arious approaches e xist that facilitate the learning of approxi- mated (inexact) rotational equi v ariance. Note that for datasets where exact equi variance is appropriate, approaches that pro- vide exact equi variance (Section 3.3) usually work better . 4.1 Data A ugmentation Data augmentation (i.e. random rotations of samples during training) is the most naive approach to deal with rotational equiv ariance. Such rotational equiv ariance in the training data but not in the network architecture forces the network to learn to recognize each orientation of each object part “from scratch” and hampers generalization. 4.2 Learned P ose Normalization Instead of har dwiring a pose normalization function as de- scribed in Section 3.1, it is possible to force or encourage the network to learn a reversion function directly from the train- ing data. As an example of the “encourage” case, in spatial transformer networks [ Jaderberg et al. , 2015 ] , learning a pose normalization is a facilitated b ut not a guaranteed side ef fect of learning to classify . 4.3 Soft Constraints Another approach to let the network learn rotational equi- variance/in variance is to introduce additional soft constraints , which are typically expressed by auxiliary loss functions that are added to the main loss function. For example, a similarity loss [ Coors et al. , 2018 ] can be defined, which penalizes large distances between the predictions or feature embeddings of rotated copies of the input that are simultaneously fed into separate streams of a siamese network. The adv antage of this approach is the ease of implementa- tion. Furthermore, it can be used in combination with other approaches that provide non-exact equi variance (e.g. pose nor- malization) in order to enhance it. The disadvantage is that equiv ariance/in variance is only approximativ e. The quality of the approximation depends on the loss formula, training data, network architecture and optimization algorithm. 4.4 Def ormable Con volution The deformable convolution [ Dai et al. , 2017 ] augments a CNN’ s capability of modeling geometric transformations. Input-dependent offsets are added to the sampling locations of the standard con volution. The offsets are computed by ap- plying an additional con volutional layer o ver the same input feature map. Bilinear image interpolation is used due to non- integer of fsets. The advantage of this approach is that it can learn to handle very general transformations such as rotation, scaling, and deformation, if training data encourage this. The disadvantage is that there is no guarantee of equi variance. 5 Overview of Methods In this section we list and categorize deep learning methods for handling rotatable data and rotations. This includes methods that are equiv ariant under rotations of the input, methods that output a rotation, and methods that use rotations as inputs and/or deep features. 5.1 Equivariance under Rotations of the Input Methods that are equi variant under rotations of the input are categorized in T able 1 (for 2D rotations) and T able 2 (for 3D rotations) according to the following criteria: • Input: – Pixel grid: a grid representation of 2D image data – V ox el grid: a grid representation of 3D v olumetric data – Point cloud: a set of 3D point coordinates – Spherical signal: a function defined on the sphere – Polygon mesh: a collection of vertices, edges, and faces that describes a surface consisting of polygons – dMRI (6D): six-dimensional diffusion-weighted magnetic resonance images [ Müller et al. , 2021 ] • Approach: see Definition 4 and Sections 3 – 4 • Property: equiv ariance (Definition 1) or inv ariance (Defi- nition 3) • Group: – SO(2) : the group of 2D rotations – SE(2) : the group of 2D rigid-body motions – SO(3) : the group of 3D rotations – SE(3) : the group of 3D rigid-body motions • Cardinality: continuous (entire group) or discretized to specific angles 5.2 Rotations as Output Examples of deep learning methods that output a (3D) rota- tion are categorized in T able 3 and T able 4 according to the following characteristics: • Input to the network that outputs the rotation, and accord- ing rotation-prediction task: – Image. The task is to estimate the orientation of a depicted object relativ e to the camera. 3 – Cropped stereo image: A pair of images is tak en at the same time from two cameras that are close together and point in the same direction. The images are cropped in a predetermined fashion. Each pair of cropped images constitutes one input. The position of the cameras relati ve to each other is fixed. The task is to estimate the orientation of a depicted object relativ e to the cameras. – V olumetric data. The task is to estimate the orienta- tion of an object relativ e to volume coordinates. – Slice of volumetric data. The task is to estimate the orientation of a 2D slice relative to an entire (predefined) 3D object. – V ideo: T wo or three or more images (video frames). The task is to estimate the relative rotation and trans- lation of the camera between (not necessarily con- secutiv e) frames. • Number of objects/rotations: Describes how man y rota- tions the network outputs. – One rigid object: One rotation is estimated that is associated with a rigid object. The visible “object” is the entire scene in cases where camera motion relativ e to a static scene is estimated. Other small moving objects can be additionally accounted for (for example to refine the optical-flow estimation), but their r otation is not estimated. – Hierarchy of object parts: Relativ e rotations be- tween the objects and their parts are estimated, with sev eral hierarchy lev els, i.e. an “object” consisting of parts can itself be one of several parts of a “higher - le vel” object. Among the methods listed in T ables 3– 4, only capsule networks belong to this category . • Specialization: – Specialized on one object: The network can only process one type of object on which it was trained. – Specialized on multiple objects: The network can process an object from an arbitrarily large but fix ed set of object types on which it was trained. – Generalizing to ne w objects: The network can gen- eralize to unseen types of objects. • Group: SO(3) or SE(3) • Representation (embedding) of the rotation(s): – Rotation matrix – Quaternion – Euler angles – Axis-angle representation – Discrete bins: Rotations are grouped into a finite set of bins. – T ransformation matrix – 3D coordinates of four keypoints on the object (pre- defined object-specifically , e.g. four of its corners) – Eight corners and centroid of 3D bounding box pro- jected into 2D image space Method Ap- proach Prop- erty Group Cardinality Many Learned (Data augmentation) * * * Spatial T ransformer Networks [ Jaderberg et al. , 2015 ] Learned (Learned pose normalization) In- vari- ance SE(2) Continuous Cyclic Symmetry in CNNs [ Dieleman et al. , 2016 ] Exact Equi- var . SE(2) Discretized ( 90 ∘ angles) Group Equiv ariant CNNs [ Cohen and W elling, 2016 ] Exact Equi- var . SE(2) Discretized ( 90 ∘ angles) Harmonic Networks [ W orrall et al. , 2017 ] Exact Equi- var . SE(2) Continuous V ector Field Networks [ Marcos et al. , 2017 ] Exact Equi- var . SE(2) Discretized (any angle) Oriented Response Net- works [ Zhou et al. , 2017b ] Exact Equi- var . SE(2) Discretized (any angle) Deformable CNNs [ Dai et al. , 2017 ] Learned (Deformable conv olution) Equi- var . SE(2) Continuous Polar T ransformer Networks [ Estev es et al. , 2018b ] Learned (Learned pose normalization) Equi- vari- ance SE(2) Continuous Steerable Filter CNNs [ W eiler et al. , 2018b ] Exact Equi- var . SE(2) Discretized (any angle) Learning in variance with weak supervision [ Coors et al. , 2018 ] Learned (Soft constraints) In- vari- ance SE(2) Continuous Roto-T ranslation Cov ariant CNNs [ Bekkers et al. , 2018 ] Exact In- var . SE(2) Discretized (any angle) RotDCF: Decomposition of Con volutional Filters [ Cheng et al. , 2019 ] Exact Equi- vari- ance SE(2) Discretized (any angle) RiCNN [ Chidester et al. , 2018 ] Exact In- var . SO(2) Discretized (any angle) Siamese Equiv ariant Embed- ding [ Véges et al. , 2019 ] Learned (Soft constraints) Equi- var . SO(2) Continuous CNN model of primary visual cortex [ Ecker et al. , 2019 ] Exact Equi- vari- ance SE(2) Discretized (any angle) General Steerable CNNs [ W eiler and Cesa, 2019 ] Exact Equi- var . SE(2) Continuous T able 1: Methods with equiv ariance under 2D rotations. The termi- nology is summarized in Section 5.1. The input to each method is an image. In Polar T ransformer Networks, the image is transformed to a circular signal in an intermediate layer . Methods with identi- cal cell entries differ in terms of details. Potential weaknesses are highlighted. General Steerable CNNs and their implementation in the e2cnn [ W eiler and Cesa, 2019 ] library are the “best” in that they provide various hyperparameter choices, with the other exact methods being special cases thereof (see Section 3.3). – Learned representation: The latent space (e.g. of an autoencoder) is used to represent the rotation. There are sev eral interesting aspects at play: * Learned representations allow for ambiguity: If 4 Method Input Ap- proach Prop- erty Group Cardinality Many * Learned (Data augmentation) * * * Spatial T ransformer Networks [ Jaderberg et al. , 2015 ] V oxel grid Learned (Learned pose normalization) In- vari- ance SE(3) Continuous Equiv ariant Repre- sentations [ Estev es et al. , 2018a ] Spher- ical signal Exact Equi- vari- ance SO(3) Continuous Spherical CNNs [ Cohen et al. , 2018 ] Spher- ical s. Exact Equi- var . SO(3) Continuous T ensor Field Networks [ Thomas et al. , 2018 ] Point cloud Exact Equi- vari- ance SE(3) Continuous N-body Networks [ K ondor , 2018 ] Point cloud Exact Equi- var . SO(3) Continuous CubeNet [ W orrall and Brostow , 2018 ] V oxel grid Exact Equi- var . SE(3) Discretized ( 90 ∘ angles) 3D G-CNNs [ W inkels and Cohen, 2018 ] V oxel grid Exact Equi- vari- ance SE(3) Discretized ( 90 ∘ / 180 ∘ angles) 3D Steerable CNNs [ W eiler et al. , 2018a ] V oxel grid Exact Equi- var . SE(3) Continuous PPF-FoldNet [ Deng et al. , 2018 ] Point cloud Exact (handcrafted features) In- var . SE(3) Continuous Gauge Equiv ariant Mesh CNNs [ de Haan et al. , 2021 ] Poly- gon mesh Exact Equi- vari- ance SE(3) Continuous SE(3) -Equiv ariant DL for dMRI [ Müller et al. , 2021 ] dMRI (6D) Exact Equi- vari- ance SE(3) Continuous T able 2: Methods with equiv ariance under 3D rotations. The ter- minology is summarized in Section 5.1. Potential weaknesses are highlighted. T ensor Field Networks are the “best” for point clouds in that they provide continuous e xact SE(3) -equiv ariance. Similarly , 3D Steerable CNNs are the “best” neural networks for voxel grids. On the other hand, CubeNet and 3D G-CNNs offer only discrete rotations but are compatible with nonlinearities such as ReLU. The exact methods for 3D data are a vailable via the e3nn [ Geiger et al. , 2020 ] library . an object looks very similar from two angles and the loss allows for it, the network can learn to use the same encoding to represent both rotations. On the other hand, unambiguous representations (like the ones listed abov e) would require genera- tiv e/probabilistic models to deal with ambiguity . * Certain representations are encouraged due to the ov erall netw ork architecture. For example, in cap- sule networks, learned representations of rotation are processed in a v ery specific way (multiplied by learned transformation matrices). * Features other than rotation might be entangled into the learned representation. This is not even always discouraged. For example, in capsule net- Method Input Spe- cializa- tion Group Embed- ding Loss function PoseNet [ Kendall et al. , 2015 ] Image One object SE(3) Quater- nion 𝐿 2 dist. in embed- ding space Relativ e camera pose estimation using CNNs [ Melekhov et al. , 2017 ] V ideo (two non-con- secutive frames) Gener- alizing to ne w objects SE(3) Quater- nion 𝐿 2 dis- tance in embed- ding space 3D pose regression using CNNs [ Mahendran et al. , 2017 ] Image Mul- tiple objects SO(3) Axis- angle or quater- nion Geodesic distance Real-time seamless single shot 6D object pose prediction [ T ekin et al. , 2018 ] Image Mul- tiple objects SE(3) Bound- ing box Squared 𝐿 2 dist. in embed- ding space Registration of a slice to a predefined volume [ Mohseni Salehi et al. , 2019 ] Slice of vol- ume One object SE(3) Axis- angle repre- senta- tion Geodesic distance 1 Registration of a volume to another , predefined volume [ Mohseni Salehi et al. , 2019 ] V ol- ume One object SE(3) Axis- angle repre- senta- tion Geodesic distance 1 SSD-AF [ Pande y et al. , 2018 ] Crop- ped stereo image Mul- tiple objects SE(3) V ari- ous 2 Smoothed 𝐿 1 dist. in embed- ding space Learning local RGB-to- CAD correspondences [ Georgakis et al. , 2019 ] Image and 3D model Mul- tiple objects SE(3) Rota- tion matrix Squared 𝐿 2 dist. in embed- ding space 1 Initially squared 𝐿 2 distance in embedding space (fast to com- pute); then geodesic distance for rotation and squared 𝐿 2 distance for translation. 2 Each method from the SSD-AF family uses a different embedding: discrete bins, four keypoint locations in 3D space, quaternion, Euler angles. T able 3: Examples of deep learning methods that can output a 3D rotation, where a ground truth rotation is used for training. The terminology is summarized in Section 5.2. Losses that lack rotational in variance are highlighted. works, the learned representation may also con- tain other object features such as color . • Loss function. W e distinguish the following cate gories: – Rotations are estimated at the output layer . The loss measures the similarity to ground truth rotations of training samples. These methods are listed in T able 3. * Geodesic distance between prediction and ground truth: This loss is rotationally inv ariant, i.e. the network miscalculating a rotation by 10 ∘ always 5 Method Input Spe- cializa- tion Group Embed- ding Loss function Capsule Networks [ Sabour et al. , 2017 ] Image Gener- alizing to ne w objects SE(3) T rans- forma- tion ma- trices 1 Object classifi- cation Spatial T ransformer Networks [ Jaderberg et al. , 2015 ] V ol- ume Gener- alizing to ne w objects SE(3) T rans- forma- tion ma- trix Object classifi- cation Unsupervised learning of depth and ego-motion [ Zhou et al. , 2017a ] V ideo (three consec- utive frames) Gener- alizing to ne w objects SE(3) Euler angles V iew warping Learning implicit representations of 3D object orientations from RGB [ Sundermeyer et al. , 2018 ] Image One object SO(3) Learned repre- senta- tion Auto- encoder GeoNet [ Y in and Shi, 2018 ] V ideo (several consec- utive frames) Gener- alizing to ne w objects SE(3) Euler angles V iew warping 1 Poses of lowest -lev el object parts: learned representation; part-to- object pose transformations: transformation matrices (as trainable parameters). T able 4: Examples of deep learning methods that can output a 3D rotation, where a ground truth rotation is not necessary for training. The terminology is summarized in Section 5.2. Losses are highlighted for which a good loss v alue does not guarantee a good prediction of rotations. results in the same loss value, regardless of the ground truth rotation and of the direction into which the prediction is biased. * 𝐿 𝑝 distance in embedding space: This loss value is fast to compute but not rotationally in variant, i.e. an error of 10 ∘ yields different loss v alues de- pending on the ground truth and on the prediction. Due to this “unfairness”/“arbitrarity”, such losses are highlighted in the table. – Rotations are estimated in an intermediate layer and used in subsequent layers for a “higher -level” goal of a larger system. Ground truth rotations are not required. These methods are listed in T able 4. * Object classification: Rotation prediction is trained as part of a lar ger system for object clas- sification. The estimated rotation is used to ro- tate the input or feature map (in spatial trans- former networks) or predicted poses (in capsule networks) as an intermediate processing step. It is assumed that learning to rotate to a canonical pose (in spatial transformer networks) or to let object parts vote about the ov erall object pose (in capsule networks) is beneficial for object classi- fication. The estimation of rotation is incidental and encouraged by the overall setup. Howe ver , its approximate correctness is not necessary for perfect object classification. Therefore, the “pre- dicted rotation” can be very wrong, and due to this danger this loss is highlighted in the table. * V ie w warping: At least two video frames are used to estimate the scene geometry (depth maps) and camera motion between the views (rotation, translation). These estimates are used to warp one vie w (image, and possibly depth map) to resemble another view . The loss measures this resemblance. This is a form of self-supervised learning: ground truth geometry and motion are not gi ven, b ut are estimated such that they cause warping that is consistent with the input images. The rotation estimation can be expected to be good, because it is necessary for good view synthesis. * Autoencoder reconstruction loss: The network is trained to reconstruct its input (a vie w of the ob- ject) after passing it through a lower-dimensional latent space. The output target has a neutral im- age background and lacks other objects that were visible in the input image. This allows the net- work to learn to discard the information about the background and other objects before the bot- tleneck layer . If the network is specialized on one object, then maintaining in the latent space only the information about the object pose is suf- ficient for such reconstruction. If additionally the latent space is sufficiently lo w-dimensional, then the learning is encouraged to be economic about the amount of information encoded in the latent space, i.e. to encode nothing but the pose. Estimation of 2D rotations is simpler in terms of representa- tion. Predicting the sine and cosine of the rotation angle (and normalizing the predicted vector to length 1 , because other- wise the predicted sine and cosine might slightly contradict each other , or be beyond [ − 1 , 1] ) is better than predicting the angle, because the latter requires learning a function that has a jump (from 360 ∘ to 0 ∘ ), which is not easy for (non-generati ve) neural networks. 5.3 Rotations as Input or as Deep F eatures Other uses of rotations in deep learning are to take rotations as input, or to restrict deep features to belong to SO( 𝑛 ) (without requiring them to directly approximate rotations present in the data). For example, [ Huang et al. , 2017 ] use rotation matrices as inputs and as deep features. They restrict deep features to SO(3) by using layers that map from SO(3) to SO(3) . 6 Conclusions Among the methods for equiv ariance/inv ariance, the most suc- cessful ones are based on the e xact and most general approach (Section 3.3). They are v ery effecti ve in 3D input domains as well. W ith emerging theory [ K ondor and Tri vedi, 2018, Co- hen et al. , 2019 ] for exact equiv ariance and with emerging approaches, it appears to be the perfect time to use the meth- ods in various application domains and to tune them. Existing pipelines that do not have (exact) equiv ariance yet and for 6 example rely on data augmentation are likely to benefit from incorporating exact-equi variance approaches. Acknowledgments W e thank Erik Bekkers, Maurice W eiler , Gabriele Cesa, An- tonij Golko v , T aco Cohen, Christine Allen-Blanchette, Qadeer Khan, Philip Müller , Philip Häusser , and Remco Duits for valuable discussions. This manuscript was supported by the ERC Consolidator Grant “3DReloaded”, the Munich Cen- ter for Machine Learning (Grant No. 01IS18036B), and the BMBF project MLwin. References [ Bekkers et al. , 2018 ] E. J. Bekkers, M. W . Lafarge, M. V eta, K. A. J. Eppenhof, J. P . W . Pluim, and R. Duits. Roto- translation cov ariant conv olutional networks for medical image analysis. In MICCAI , pages 440–448, 2018. 4 [ Cheng et al. , 2019 ] X. Cheng, Q. Qiu, R. Calderbank, and G. Sapiro. RotDCF: decomposition of conv olutional filters for rotation-equiv ariant deep networks. In ICLR , 2019. 4 [ Chidester et al. , 2018 ] B. Chidester , M. N. Do, and J. Ma. Rotation equiv ariance and in variance in con volutional neu- ral networks. , 2018. 4 [ Cohen and W elling, 2016 ] T . S. Cohen and M. W elling. Group equi variant con volutional networks. In ICML , pages 2990–2999, 2016. 2, 4 [ Cohen et al. , 2018 ] T . S. Cohen, M. Geiger , J. Köhler, and M. W elling. Spherical CNNs. In ICLR , 2018. 5 [ Cohen et al. , 2019 ] T . S. Cohen, M. Geiger, and M. W eiler . A general theory of equiv ariant CNNs on homogeneous spaces. In NeurIPS , pages 9142–9153, 2019. 2, 6 [ Coors et al. , 2018 ] B. Coors, A. Condurache, A. Mertins, and A. Geiger . Learning transformation inv ariant represen- tations with weak supervision. In VISAPP , pages 64–72, 2018. 3, 4 [ Dai et al. , 2017 ] J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . W ei. Deformable con volutional networks. In ICCV , pages 764–773, 2017. 3, 4 [ de Haan et al. , 2021 ] P . de Haan, M. W eiler , T . Cohen, and M. W elling. Gauge equi v ariant mesh CNNs: Anisotropic con volutions on geometric graphs. In ICLR , 2021. 5 [ Deng et al. , 2018 ] H. Deng, T . Birdal, and S. Ilic. PPF- FoldNet: unsupervised learning of rotation inv ariant 3D local descriptors. In ECCV , pages 620–638, 2018. 5 [ Dieleman et al. , 2016 ] S. Dieleman, J. De Fauw , and K. Kavukcuoglu. Exploiting cyclic symmetry in con vo- lutional neural networks. In ICML , pages 1889–1898, 2016. 1, 4 [ Ecker et al. , 2019 ] A. S. Ecker , F . H. Sinz, E. Froudarakis, P . G. Fahey , S. A. Cadena, E. Y . W alker , E. Cobos, J. Reimer, A. S. T olias, and M. Bethge. A rotation- equiv ariant con volutional neural network model of primary visual cortex. In ICLR , 2019. 4 [ Estev es et al. , 2018a ] C. Estev es, C. Allen-Blanchette, A. Makadia, and K. Daniilidis. Learning SO(3) equi vari- ant representations with spherical CNNs. In ECCV , pages 54–70, 2018. 2, 5 [ Estev es et al. , 2018b ] C. Este ves, C. Allen-Blanchette, X. Zhou, and K. Daniilidis. Polar transformer networks. In ICLR , 2018. 2, 4 [ Geiger et al. , 2020 ] M. Geiger , T . Smidt, Alby M., B. K. Miller , W . Boomsma, B. Dice, K. Lapchevskyi, M. W eiler , M. T yszkiewicz, S. Batzner , M. Uhrin, J. Frellsen, N. Jung, S. Sanborn, J. Rackers, and M. Bailey . Euclidean neural networks: e3nn, 2020. 5 [ Georgakis et al. , 2019 ] G. Georgakis, S. Karanam, Z. W u, and J. K osecka. Learning local RGB-to-CAD correspon- dences for object pose estimation. In ICCV , pages 8967– 8976, 2019. 5 [ Huang et al. , 2017 ] Z. Huang, C. W an, T . Probst, and L. V . Gool. Deep learning on Lie groups for skeleton-based action recognition. In CVPR , pages 1243–1252, 2017. 6 [ Jaderberg et al. , 2015 ] M. Jaderberg, K. Simonyan, A. Zis- serman, and K. Kavukcuoglu. Spatial transformer networks. In NeurIPS , pages 2017–2025, 2015. 3, 4, 5, 6 [ Kendall et al. , 2015 ] A. Kendall, M. K. Grimes, and R. Cipolla. PoseNet: A con volutional network for real-time 6-DOF camera relocalization. In ICCV , pages 2938–2946, 2015. 5 [ K ondor and Tri vedi, 2018 ] R. Kondor and S. T ri vedi. On the generalization of equiv ariance and conv olution in neural networks to the action of compact groups. In ICML , pages 2747–2755, 2018. 2, 6 [ K ondor, 2018 ] R. Kondor . N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials. , 2018. 5 [ Mahendran et al. , 2017 ] S. Mahendran, H. Ali, and R. V idal. 3D pose regression using con volutional neural networks. In CVPR W orkshops , pages 494–495, 2017. 5 [ Marcos et al. , 2017 ] D. Marcos, M. V olpi, N. K omodakis, and D. T uia. Rotation equiv ariant vector field networks. In ICCV , pages 5058–5067, 2017. 4 [ Melekhov et al. , 2017 ] I. Melekhov , J. Ylioinas, J. Kannala, and E. Rahtu. Relativ e camera pose estimation using con vo- lutional neural networks. In A CIVS , pages 675–687, 2017. 5 [ Mohseni Salehi et al. , 2019 ] S. S. Mohseni Salehi, S. Khan, D. Erdogmus, and A. Gholipour. Real-time deep pose estimation with geodesic loss for image-to-template rigid registration. IEEE T rans Med Imaging , pages 470–481, 2019. 5 [ Müller et al. , 2021 ] P . Müller , V . Golko v , V . T omassini, and D. Cremers. Rotation-equi v ariant deep learning for dif fu- sion MRI. arXiv pr eprint , 2021. 3, 5 [ Pande y et al. , 2018 ] R. Pandey , P . Pidlypenskyi, S. Y ang, and C. Kaeser-Chen. Efficient 6-DoF tracking of hand- held objects from an e gocentric vie wpoint. In ECCV , pages 426–441, 2018. 5 7 [ Sabour et al. , 2017 ] S. Sabour , N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NeurIPS , pages 3856–66, 2017. 6 [ Sundermeyer et al. , 2018 ] M. Sundermeyer , E. Y . Puang, Z.- C. Marton, M. Durner, and R. T riebel. Learning implicit representations of 3D object orientations from RGB. In ICRA , 2018. 6 [ T ekin et al. , 2018 ] B. T ekin, S. N. Sinha, and P . Fua. Real- time seamless single shot 6D object pose prediction. In CVPR , pages 292–301, 2018. 5 [ Thomas et al. , 2018 ] N. Thomas, T . Smidt, S. M. Kearnes, L. Y ang, L. Li, K. K ohlhoff, and P . Riley . T ensor field networks: rotation- and translation-equiv ariant neural net- works for 3D point clouds. , 2018. 5 [ Véges et al. , 2019 ] M. Véges, V . V arga, and A. L ˝ orincz. 3D human pose estimation with siamese equiv ariant embed- ding. Neur ocomputing , pages 194 – 201, 2019. 4 [ W eiler and Cesa, 2019 ] M. W eiler and G. Cesa. General E(2)-equiv ariant steerable CNNs. In NeurIPS , pages 14334– 45, 2019. 2, 4 [ W eiler et al. , 2018a ] M. W eiler , M. Geiger , M. W elling, W . Boomsma, and T . Cohen. 3D steerable CNNs: learn- ing rotationally equiv ariant features in volumetric data. In NeurIPS , pages 10381–10392, 2018. 2, 5 [ W eiler et al. , 2018b ] M. W eiler , F . A. Hamprecht, and M. Storath. Learning steerable filters for rotation equiv ari- ant CNNs. In CVPR , pages 849–858, 2018. 4 [ W inkels and Cohen, 2018 ] M. W inkels and T . S. Cohen. 3D G-CNNs for pulmonary nodule detection. In MIDL , 2018. 5 [ W orrall and Brostow , 2018 ] D. W orrall and G. Brostow . CubeNet: equiv ariance to 3D rotation and translation. In ECCV , pages 567–584, 2018. 5 [ W orrall et al. , 2017 ] D. E. W orrall, S. J. Garbin, D. T ur- mukhambetov , and G. J. Brosto w . Harmonic networks: deep translation and rotation equiv ariance. In CVPR , pages 7168–7177, 2017. 4 [ Y in and Shi, 2018 ] Z. Y in and J. Shi. GeoNet: Unsupervised learning of dense depth, optical flo w and camera pose. In CVPR , pages 1983–1992, 2018. 6 [ Zhou et al. , 2017a ] T . Zhou, M. Brown, N. Snavely, and D. G. Lo we. Unsupervised learning of depth and ego- motion from video. In CVPR , pages 6612–6619, 2017. 6 [ Zhou et al. , 2017b ] Y . Zhou, Q. Y e, Q. Qiu, and J. Jiao. Ori- ented response networks. In CVPR , pages 4961–70, 2017. 4 8
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment