MoGlow: Probabilistic and controllable motion synthesis using normalising flows

MoGlow: Pr obabilistic and Controllable Motion Synthesis Using Normalising Flows GUST A V EJE HEN TER ∗ , SIMON ALEXANDERSON ∗ , and JONAS BESK O W, Division of Speech, Music and Hearing, KTH Royal Institute of T echnology, Sweden Fig. 1. Pr obabilistic motion generation. Random samples from our method can give many distinct output motions even if the input signal is the same . Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and contr ollable motion-data models based on normalising ows. Models of this kind can describe highly complex distributions, yet can be trained eciently using exact maximum likelihood, unlike GANs or V AEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly , is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specic assump- tions regarding the motion or the character morphology . W e evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture. CCS Concepts: • Computing methodologies → Animation ; Neural net- works ; Motion capture . Additional K ey W ords and Phrases: Generative models, machine learning, normalising ows, Glow , footstep analysis, data dropout A CM Reference Format: Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows. A CM Trans. Graph. 39, 4, Article 236 (November 2020), 14 pages. https: //doi.org/10.1145/3414685.3417836 ∗ Gustav Eje Henter and Simon Alexanderson contributed equally and are joint rst authors. Authors’ address: Gustav Eje Henter, ghe@kth.se; Simon Alexanderson, simonal@ kth.se; Jonas Beskow, beskow@kth.se, Division of Speech, Music and Hearing, KTH Royal Institute of T echnology, Stockholm, Sweden. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Cop yrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). © 2020 Copyright held by the owner/author(s). 0730-0301/2020/11- ART236 https://doi.org/10.1145/3414685.3417836 1 IN TRODUCTION A recurring problem in elds such as computer animation, video games, and articial agents is how to generate convincing motion conditioned on high-level, “weak” control parameters. Video-game characters, for example, should be able to display a wide range of mo- tions controlled by game-pad inputs, and embodied agents should generate complex non-verbal behaviours based on, e.g., semantic and prosodic cues. The advent of deep learning and the growing availability of large motion-capture databases have increased the interest in data-driven, statistical mo dels for generating motion. Given that the control signal is w eak, a fundamental challenge for such models is to handle the large variation of possible outputs – the limbs of a real person walking the same path twice will always follow dierent trajectories. Deterministic models of motion, which return a single predicted motion, suer from regr ession to the mean pose and produce artefacts like fo ot sliding in the case of gait. They also lack motion diversity , leading to repetitive and non-engaging characters in applications. T aken together , we are led to conclude that for motion generated from the model to be perceived as real- istic, it cannot be completely deterministic, but the mo del should instead generate dierent motions upon each subsequent inv ocation, given the same control signal. In other words, a stochastic model is required. Furthermore, r eal-time interactive systems such as video games require models with the lowest possible latency . This paper introduces MoGlow , a novel autoregressiv e architec- ture for generating motion-data sequences based on normalising ows [ Deco and Brauer 1994 ; Dinh et al . 2015 , 2017 ; Huang et al . 2018 ; Kingma and Dhariwal 2018 ]. This new modelling paradigm has the following principal advantages: (1) It is probabilistic , meaning that it endeavours to describe not just one motion, but all possible motions, and how likely each possibility is. Plausible motion samples can then be generated also in the absence of conclusive control-signal input (Fig. 1 ). (2) It uses an implicit model structure [ Mohame d and Lakshmin- arayanan 2016 ] to parameterise distributions. This makes it fast to sample from without assuming that obser ved values ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020. 236:2 • Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow follow r estrictive, low-degree-of-freedom parametric families such as Gaussians or their mixtures, as done in, e .g., Fragki- adaki et al. [ 2015 ]; Uria et al. [ 2015 ]. (3) It allows exact and tractable probability computation, unlike variational autoencoders (V AEs) [ Kingma and W elling 2014 ; Rezende et al . 2014 ], and can be traine d to maximise likeli- hood directly , unlike generative adversarial networks (GANs) [ Goodfellow 2016 ; Goodfellow et al. 2014 ]. (4) It is task-agnostic – that is, it does not rely on restrictive, situational assumptions such as characters being bipedal or motion being quasi-p eriodic (unlike, e.g., Holden et al . [ 2017 ]). (5) It generates output sequentially and permits control schemes for the output motion with no algorithmic latency . (6) It is capable of generating high-quality motion both in object- ive terms and as judged by human observers. T o the best of our knowledge, our proposal is the rst motion model based on normalising ows. W e evaluate our method on lo comotion synthesis for two radically dierent morphologies – humans and dogs – since lo comotion makes it easy to quantify artefacts and sp ot poor adherence to the control. A video presentation of our work is available on Y ouT ub e , with more information on our project page . 2 BA CKGROUND AND PRIOR W ORK Mathematically , motion generation requires creating a sequence of poses from control input. W e here re view (Sec. 2.1 ) probabilistic machine-learning models of se quences, and then describe (Secs. 2.2 and 2.3 ) prior work on machine learning for motion synthesis. 2.1 Probabilistic generative sequence models Probabilistic sequence models for continuous-value d data have a long history , with linear autoregressive models being an early ex- ample [ Y ule 1927 ]. Model exibility impr oved with the introduction of hidden-state models like HMMs [ Rabiner 1989 ] and Kalman l- ters [ W elch and Bishop 1995 ], both of which still allow ecient probability computation ( inference ). Deep learning extended autore- gressive models of continuous-valued data further by enabling highly nonlinear dependencies on previous obser vations, for ex- ample Fragkiadaki et al . [ 2015 ]; Graves [ 2013 ]; Uria et al . [ 2015 ]; Zen and Senior [ 2014 ], as well as nonlinear (continuous-valued) hidden- state evolution through recurrent neural netw orks, e.g., Hochreiter and Schmidhuber [ 1997 ]. All of these model classes have been ex- tensively applied to sequence-modelling tasks, but have consistently failed to produce high-quality random samples for complicated data such as motion and speech. W e attribute this shortcoming to the explicit distributional assumptions (e .g., Gaussianity ) common to all these models – real data, e.g., motion capture , is seldom Gaussian. Three methods for relaxing the above distributional constraints have gaine d recent interest. The rst is to quantise the data and then t a discrete model to it. Deep autoregressiv e models on quant- ised data, such as Kalchbrenner et al . [ 2018 ]; Salimans et al . [ 2017 ]; van den Oord et al . [ 2016 , 2017 ]; W ang et al . [ 2018 ], are the state of the art in many low-dimensional ( R 3 or less) sequence-modelling problems. Howev er , it is not clear if these approaches scale up to motion data, with 50 or more dimensions. Quantisation may also introduce perceptual artefacts. A second approach is variational au- toencoders [ Kingma and W elling 2014 ; Rezende et al . 2014 ], which optimise a variational lower bound on model likelihood while sim- ultaneously learning to perform approximate inference. The gap between the true maximum likelihood and that achieved by V AEs has been found to be signicant [ Cremer et al. 2018 ]. The third approach is GANs [ Goodfellow 2016 ; Goodfellow et al . 2014 ], that generate samples from complicated distributions impli- citly , by passing simple random noise thr ough a nonlinear neural network. As GAN architectures do not allow infer ence, they are in- stead trained via a game against an adversary . GANs hav e pr oduced some very impr essive results in image generation [ Brock et al . 2019 ], illustrating the power of implicit sample generation, but their optim- isation is fraught with diculty [ Lucic et al . 2018 ; Mescheder et al . 2018 ]. GAN output quality usually improv es by articially reducing the generator entr opy during sampling, compared to sampling from the distribution actually learned fr om the data, cf. Brock et al . [ 2019 ]. This is often referred to as “reducing the temperature ” . While V AEs in principle have a partially-implicit generator struc- ture, an issue dubbed “posterior collapse” means that V AEs with strong decoders , that can repr esent highly e xible distributions given the latent variable , tend to learn models where latent variables hav e little impact on the output distribution [ Chen et al . 2017 ; Huszár 2017 ; Rubenstein 2019 ]. This largely nullies the benets of the implicit parts of the generator , leading to blurry and noisy output. This article considers a less explored methodology called normal- ising ows [ Deco and Brauer 1994 ; Dinh et al . 2015 , 2017 ; Huang et al . 2018 ] (no relation to optical ow ), espe cially a variant called Glow [ Kingma and Dhariwal 2018 ], which, like GANs and quant- isation, gained attention for highly realistic-looking image samples. W e b eliev e normalising ows oer the best of both worlds, com- bining a basis in likelihood maximisation and ecient inference like V AEs with purely implicit generator structur es like GANs. Con- sequently , our pap er presents one of the rst Glow-based sequence models, and the rst to our knowledge to combine autoregression and control, as well as to integrate long memory via a hidden state. The most closely-related methods are W aveGlow [ Prenger et al . 2019 ] and Flo W aveNet [ Kim et al . 2019 ] for audio waveforms and VideoFlow [ Kumar et al . 2020 ] for video. W e extend these in sev- eral novel directions: Unlike Kim et al . [ 2019 ]; Prenger et al . [ 2019 ], our architecture is autoregr essive (“ close d-loop ”), avoiding costly dilated convolutions and continuity issues (e .g., blocking artefacts) common in open-loop systems, cf. Juvela et al . [ 2019 ]. Unlike Kumar et al . [ 2020 ], our architecture permits output control. In contrast to all three models, we add a recurrent hidden state to enable long memory , which signicantly improves the model. W e also consider data dropout to increase adherence to the control signal. 2.2 Deterministic data-driven motion synthesis While traditional motion synthesis uses concatenative appr oaches such as motion graphs [ Arikan and Forsyth 2002 ; K ovar and Gleicher 2004 ; Kovar et al . 2002 ], there has be en a strong trend towards statist- ical approaches. These can roughly b e categorise d into deterministic and probabilistic methods. Deterministic methods yield a single pre- diction for a given scenario, whereas probabilistic methods attempt ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: N ov ember 2020. MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows • 236:3 to describe a range of possible motions. Deterministically predicte d pose sequences usually quickly regress towar ds the mean pose, cf. Ferstl et al . [ 2019 ]; Fragkiadaki et al . [ 2015 ], since that is the a-priori (i.e., no-information) minimiser of the MSE. Such methods thus require additional information to disambiguate pose predictions. Sometimes adding an external control signal suces – lip motion is for example highly predictable from sp eech and has b een suc- cessfully modelled with deterministic methods [ Karras et al . 2017 ; Suwajanakorn et al . 2017 ; T aylor et al . 2017 ]. Locomotion genera- tion represents a more challenging task, where path-based motion control do es not suce to unambiguously dene the overall motion, and simple MSE minimisation results in characters that “oat” along the control path. Proposals to ov ercome this issue in deterministic models include learning and predicting foot contacts [ Holden et al . 2016 ], or the phase [ Holden et al. 2017 ] or pace [ Pavllo et al. 2018 ] of the gait cycle. Starke et al . [ 2020 ] generalised the idea of motion phase to complex motion by letting each bone in a character follow a separate motion phase. Autoregr essively fee ding in previously- generated poses might help combat regression to the mean, and has been used in motion generation without control inputs [ Bütepage et al . 2017 ; Fragkiadaki et al . 2015 ; Zhou et al . 2018 ]. Zhang et al . [ 2018 ] use a similar appr oach to generate controllable quadruped motion, letting autoregressive and contr ol information modify net- work weights, and demonstrate successful generation of both cyclic motion (gait) and simple non-cyclic motion such as jumping. For many typ es of motion, no information is readily available that successfully disambiguates motion predictions. One example is co-speech gestures like head and hand motion, where the motion is unstructured and aperio dic and the dependence on the control signal (speech acoustics or transcriptions) is weak and nonlinear . The absence of strongly predictive input information means that deterministic motion-generation methods such as Ding et al . [ 2015 ]; Hasegawa et al . [ 2018 ]; Kucherenko et al . [ 2019 ]; Y oon et al . [ 2019 ] largely fail to produce distinct and lifelike motion. 2.3 Probabilistic data-driven motion synthesis Probabilistic models represent another path to avoid collapsing on a mean pose: By building models of all plausible p ose sequences given the available information (prior poses and/or control inputs), any randomly-sampled output sequence should represent convin- cing motion. As discussed in Se c. 2.1 , many older models assume a Gaussian or Gaussian mixture distribution for p oses given the state of the process, for example the ( hidden) LSTM state. Condi- tional restricted Boltzmann machines (cRBMs) [ T aylor and Hinton 2009 ; T aylor et al . 2011 ] are one example of this. The hidden state can also be made probabilistic. Examples include the SHMMs used for motion generation in Brand and Hertzmann [ 2000 ], locally lin- ear models like switching linear dynamic systems (SLDSs) [ Breg- ler 1997 ; Murphy 1998 ], Gaussian processes latent-variable models (GP-LVMs) [ Lawrence 2005 ], and V AEs [ Kingma and W elling 2014 ; Rezende et al . 2014 ]. Locally linear models were use d for for motion synthesis in Chai and Hodgins [ 2005 ]; Pavlović et al . [ 2000 ], but have primarily been applie d in recognition tasks. GP-LVMs and the closely related Gaussian process dynamical models (GPDMs) hav e been extensively studied in motion generation [ Grochow et al . 2004 ; Levine et al . 2012 ; W ang et al . 2008 ] but they – along with other kernel-based motion-generation methods such as the radial basis functions (RBFs) in Kovar and Gleicher [ 2004 ]; Mukai and Kuriyama [ 2005 ]; Rose et al . [ 1998 ] – are unattractive in the big-data era since their memory and computation demands scale quadratically (or worse) in the number of training examples. V AEs circumvent com- putational issues by using a variational and amortised (see Cremer et al . [ 2018 ]) approximation of the likelihoo d for training. They have been applied to model controllable human locomotion [ Habibie et al . 2017 ; Ling et al . 2020 ] and to generate head motion from spe ech [ Greenwood et al . 2017a , b ]. Ling et al . [ 2020 ] describ es an auto- gregressive unconditional motion mo del base d on V AEs, using a deterministic decoder based on the mixture-of-experts architecture from Zhang et al . [ 2018 ]. 𝛽 - V AEs [ Higgins et al . 2016 ] are used to mitigate posterior collapse, while scheduled sampling [ Bengio et al . 2015 ] is necessary to stabilise long-term motion generation. Rein- forcement learning is used to enable character control, although response time is somewhat sluggish. Notably , many V AE methods either generate noisy motion samples (e .g., T aylor et al . [ 2011 ]) or choose to not sample from the (Gaussian) observation distribution given the latent state of the process, instead generating the mean of the conditional Gaussian only [ Greenwoo d et al . 2017a , b ; Ling et al . 2020 ]. This risks re-introducing mean collapse and articially reduces output entr opy . W e take this as evidence that these methods failed to learn an accurate and convincing motion distribution. V ariations of GANs [ Sadoughi and Busso 2018 ] and adv ersarial training [ Ferstl et al . 2019 ; Starke et al . 2020 ; W ang et al . 2019 ] have also been applied for motion generation and the related task of generating speech-driven video of talking faces [ Pham et al . 2018 ; Pumarola et al . 2018 ; V ougioukas et al . 2018 , 2020 ]. In contrast to GANs and V AEs, Starke et al . [ 2020 ] add latent-space noise to motion only at synthesis time (not during training), to obtain more varied motion, alb eit at the expense of deviating from the desired input control. This approach also means that the distribution of the motion is not learned, and need not match that of natural motion. Unlike previously-cited probabilistic motion-generation meth- ods, GANs do not assume that obser vations are Gaussian given the state of the data-generating process. This avoids both regression towards the mean and Gaussian noise in output samples. The same goes for the discretisation-based approach in Sadoughi and Busso [ 2019 ], which learns a probabilistic model that triggers motion se- quences from a xed motion library . W e consider another method for avoiding Gaussian assumptions, by introducing the rst prob- abilistic motion model based on normalising ows. In contrast to MV AEs [ Ling et al . 2020 ], our method can model conditional motion distributions, and so has controllability built in. 3 METHOD This section introduces our new probabilistic motion mo del. The basic idea is to treat motion as a series of p oses, and model these poses using an autoregressive model. In other words, we describ e the conditional probability distribution of the next pose in the sequence as a function of previous p oses and relevant control inputs. Like in a conditional GAN, the next p ose of the motion is generated by drawing a random sample from a simple distribution such as a ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020. 236:4 • Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow Linear Actnorm x b n a n z n Affine coupli ng Step of f low N step s c z s t b lo b b hi z lo z z hi + × Net A Fig. 2. Glo w steps 𝒇 − 1 𝑛 during inference. Detail of coupling layer on right. Gaussian, and then nonlinearly transforming that sample by passing it through a neural network. This has the eect of reshaping the simple starting distribution into a more comple x distribution that ts the distribution of the next pose in data. How ever , unlike a GAN, the neural network w e use is invertible , which allows us to directly compute and maximise the likelihoo d of the data under the model. This makes the model stable to train. W e now introduce basic notation and (in Se c. 3.1 ) describe how to construct normalising ows. Se cs. 3.2 and 3.3 then detail, step by step, how to build a controllable autoregressiv e sequence model out of such ows. For notation, we write vectors, and sequences thereof, in bold font. Upper case is used for random variables and matrices, and lower case for deterministic quantities or specic outcomes of the random variables. In particular , 𝑿 typically represents randomly- distributed motion with 𝒙 ∈ R 𝐷 × 𝑇 being an outcome of the same, while 𝒄 ∈ R 𝐶 × 𝑇 represents the matching control-signal inputs , which in our experiments are relative and rotational velocities that describe motion along path on the ground plane. Non-bold capital letters generally denote indexing ranges, with matching lower-case letters representing the indices themselves, e.g., 𝑡 ∈ { 1 , . . . , 𝑇 } . Indices into sequences extract specic time frames, for example individual poses 𝒙 𝑡 ∈ R 𝐷 , or sub-sequences 𝒙 1 : 𝑡 = [ 𝒙 1 , . . . , 𝒙 𝑡 ] . Each pose parameterises the positions and orientations of objects such as a whole body , parts of a bo dy , or keypoints on a body or face. In this paper , the pose vector 𝒙 𝑡 is created by concatenating vectors that represent either joint positions or joint rotations on a 3D skeleton. 3.1 Normalising f lows and Glow Normalising ows are exible generative mo dels that allow b oth ecient sampling and ecient inference. The idea is to subject samples from a simple , xed base (or latent ) distribution 𝒁 on R 𝐷 to an invertible and dierentiable nonlinear transformation 𝒇 : R 𝐷 → R 𝐷 , in order to produce samples from a new , more complex distribution 𝑿 . If this transformation has many degrees of freedom, a wide variety of dierent distributions can be described. Flows construct expr essive transformations 𝒇 by chaining to- gether numerous simpler nonlinear transformations { 𝒇 𝑛 } 𝑁 𝑛 = 1 , each of them parameterised by a 𝜽 𝑛 such that 𝜽 = { 𝜽 𝑛 } 𝑁 𝑛 = 1 . W e dene the observable random variable 𝑿 , the latent random variable 𝒁 ∼ N ( 0 , 𝑰 ) , and intermediate distributions 𝒁 𝑛 as follows: 𝒛 = 𝒛 𝑁 𝒇 𝑁 → 𝒛 𝑁 − 1 𝒇 𝑁 − 1 → . . . 𝒇 2 → 𝒛 1 𝒇 1 → 𝒛 0 = 𝒙 (1) 𝒙 = 𝒇 ( 𝒛 ) = 𝒇 1  𝒇 2  . . . 𝒇 𝑁 ( 𝒛 )   (2) 𝒛 𝑛 ( 𝒙 ) = 𝒇 − 1 𝑛 ◦ . . . ◦ 𝒇 − 1 1 ( 𝒙 ) . (3) The sequence of (inverse) transformations 𝒇 − 1 𝑛 in ( 3 ) is known as a normalising ow , since it transforms the distribution 𝑿 into an isotropic standard normal random variable 𝒁 . Similar to the generators in GANs, normalising ows are im- plicit probabilistic models according to the denition in Mohamed and Lakshminarayanan [ 2016 ]. While explicit models draw samples from probability density functions dene d in the space of the obser- vations, GANs and normalising ows instead generate output by drawing samples 𝒛 from a latent base distribution 𝒁 that acts as a source of entropy , and then subjecting these samples to a determin- istic, nonlinear transformation 𝒇 to obtain samples 𝒙 = 𝒇 ( 𝒛 ) from 𝑿 . Unlike GANs, howe ver , normalising ows permit fast and easy probability computation (inference), since the transformation 𝒇 is invertible: Using the change-of-variables formula, we can write the log-likelihood of a sample 𝒙 , as used in likeliho od maximisation, as ln 𝑝 𝜽 ( 𝒙 ) = ln 𝑝 N  𝒛 𝑁 ( 𝒙 )  + 𝑁  𝑛 = 1 ln     det 𝜕 𝒛 𝑛 ( 𝒙 ) 𝜕 𝒛 𝑛 − 1     , (4) where 𝜕 𝒛 𝑛 ( 𝒙 ) 𝜕 𝒛 𝑛 − 1 is the Jacobian matrix of 𝒇 − 1 𝑛 at 𝒙 , which depends on 𝜽 , and 𝑝 N is the probability density function of the 𝐷 -dimensional standard normal distribution. The general determinant in ( 4 ) has computational complexity close to O ( 𝐷 3 ) , so many improvements to normalising ows involve the development of 𝒇 𝑛 -transformations with tractable Jacobian determinants, that nonetheless yield highly exible transformations under iterated composition. An in-depth review of normalising ows and dierent ow ar chitectures can be found in Papamakarios et al . [ 2019 ]. In this work, we consider the Glow architecture [ Kingma and Dhariwal 2018 ], rst developed for images, and extend it to model controllable motion sequences. Each component transformation 𝒇 − 1 𝑛 in Glow contains three sub- steps: activation normalisation , also known as actnorm ; a linear transformation ; and a so-called ane coupling layer , together shown as a step of ow in in Fig. 2 . The rst two are ane or linear trans- formations while the latter amounts to a more powerful nonlinear transformation that is nonetheless invertible. W e will let 𝒂 𝑡 , 𝑛 and 𝒃 𝑡 , 𝑛 denote intermediate results of Glow computations for observation 𝒙 𝑡 in ow step 𝑛 , as shown in Fig. 2 . Actnorm, the rst sub-step, is an ane transformation 𝒂 𝑡 , 𝑛 = 𝒔 𝑛 ⊙ 𝒛 𝑡 , 𝑛 − 1 + 𝒕 𝑛 (with ⊙ denoting elementwise multiplication) intended as a substitute for batchnorm [ Ioe and Szegedy 2015 ]. The parameters 𝒔 𝑛 > 0 and 𝒕 𝑛 are initialised such that the output has zero mean and unit variance and then treated as trainable parameters. After actnorm follows a linear transformation 𝒃 𝑡 , 𝑛 = 𝑾 𝑛 𝒂 𝑡 , 𝑛 where 𝑾 ∈ R 𝐷 × 𝐷 . By representing 𝑾 𝑛 by an LU-decomposition 𝑾 𝑛 = 𝑳 𝑛 𝑼 𝑛 with one matrix diagonal set to one (say 𝑙 𝑛, 𝑑 𝑑 = 1 ), the Jacobian determinant of the sub-step is just the product of the diagonal elements 𝑢 𝑛, 𝑑 𝑑 , which is computable in linear time. The non-xed elements of 𝑳 𝑛 and 𝑼 𝑛 are the trainable parameters of the sub-step. The ane coupling layer is more complex. The idea is to anely transform half of the input elements based on the values of the other half. By passing those remaining elements through unchange d, it is easy to use their values to undo the transformation when reversing the computation. Mathematically , we dene 𝒃 𝑡 , 𝑛 and 𝒛 𝑡 , 𝑛 as concat- enations 𝒃 𝑡 , 𝑛 = [ 𝒃 lo 𝑡 , 𝑛 , 𝒃 hi 𝑡 , 𝑛 ] and 𝒛 𝑡 , 𝑛 = [ 𝒛 lo 𝑡 , 𝑛 , 𝒛 hi 𝑡 , 𝑛 ] . The coupling ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: N ov ember 2020. MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows • 236:5 can then be written [ 𝒛 lo 𝑡 , 𝑛 , 𝒛 hi 𝑡 , 𝑛 ] = [ 𝒃 lo 𝑡 , 𝑛 , ( 𝒃 hi 𝑡 , 𝑛 + 𝒕 ′ 𝑡 , 𝑛 ) ⊙ 𝒔 ′ 𝑡 , 𝑛 ] . (5) The scaling 𝒔 ′ 𝑛 > 0 and bias 𝒕 ′ 𝑛 terms in the ane transformation of the 𝒃 hi 𝑡 , 𝑛 are computed via a neural network, 𝐴 𝑛 , that only takes 𝒃 lo 𝑡 , 𝑛 as input. (W e use ‘ 𝐴 ’ for “ane” .) W e can therefore unambiguously invert Eq. ( 5 ) based on 𝒛 𝑡 , 𝑛 by feeding 𝒛 lo 𝑡 , 𝑛 = 𝒃 lo 𝑡 , 𝑛 into 𝐴 𝑛 to com- pute 𝒔 ′ 𝑛 > 0 and 𝒕 ′ 𝑛 . The coupling computations during inference are visualise d in Fig. 2 . The weights that dene 𝐴 𝑛 are also elements of the parameter set 𝜽 𝑛 , while the constraint 𝒔 ′ 𝑛 > 0 is enforced by using a sigmoid nonlinearity [ Nalisnick et al . 2019 , App. D]. Random weights are used for initialisation except in the output layer , which is initialised to zero [ Kingma and Dhariwal 2018 ]; this has the eect that the coupling initially is close to an identity transformation, reminiscent of Fixup initialisation [ Zhang et al. 2019 ]. Interleaved linear transformations and couplings are both ne- cessary for an expressive ow . Without couplings, a stack of ows collapses to compute a single, xe d ane transformation of 𝒁 , mean- ing that 𝑿 will be restricted to a Gaussian distribution; a stack of couplings alone will only perform a nonlinear transformation of half of 𝒁 , doing nothing to the other half. The linear layers 𝑾 𝑛 can be seen as generalised p ermutation operations between couplings, ensuring that all variables (not just one half ) can be nonlinearly transformed with respect to each other by the full ow . 3.2 MoGlow Let 𝑿 = 𝑿 1 : 𝑇 = [ 𝑿 1 , . . . , 𝑿 𝑇 ] be a sequence-valued random vari- able. Like all autoregressiv e models of time sequences, we develop our model from the decomposition 𝑝 ( 𝒙 ) = 𝑝 ( 𝒙 1 : 𝜏 ) 𝑇 Ö 𝑡 = 𝜏 + 1 𝑝 ( 𝒙 𝑡 | 𝒙 1 : 𝑡 − 1 ) . (6) W e assume the distribution 𝑿 𝑡 only depends on the 𝜏 previous values (i.e., is a Marko v chain of order 𝜏 ), except for a latent state 𝒉 𝑡 ∈ R 𝐻 that represents the eect of recurrence in a recurrent neural network (RNN) and evolves according to a relation 𝒉 𝑡 = 𝒈 ( 𝒉 𝑡 − 1 ) at each timestep. T o achieve control over the output we further condition the 𝑿 -distribution on another sequence variable 𝑪 , acting as the control signal . W e assume that, for each training-data frame 𝒙 𝑡 , the matching control-signal values 𝒄 𝑡 ∈ R 𝐶 are known. Mor eover , the experiments in this paper focus on causal control schemes, where only current and former control inputs 𝒄 1 : 𝑡 may inuence the conditional distributions from ( 6 ) at 𝑡 . (Letting the model also depend on future 𝒄 -values might improve motion quality , but inevitably introduces algorithmic latency .) Putting the Markov assumption, the hidden state, and the control together giv es our temporal model 𝑝 𝜽 ( 𝒙 | 𝒄 ) = 𝑝 ( 𝒙 1 : 𝜏 | 𝒄 1 : 𝜏 ) 𝑇 Ö 𝑡 = 𝜏 + 1 𝑝 𝜽 ( 𝒙 𝑡 | 𝒙 𝑡 − 𝜏 : 𝑡 − 1 , 𝒄 𝑡 − 𝜏 : 𝑡 , 𝒉 𝑡 − 1 ) (7) 𝒉 𝑡 = 𝒈 𝜽 ( 𝒙 𝑡 − 𝜏 : 𝑡 − 1 , 𝒄 𝑡 − 𝜏 : 𝑡 , 𝒉 𝑡 − 1 ) , (8) where we hav e decided to condition on the control signal at most 𝜏 steps back only , just like for the previous p oses. The subscript 𝑝 𝜽 indicates that the distributions depend on model parameters 𝜽 . The initial hidden state can be learned, but in our experiments we initialise 𝒉 𝜏 as 0 . 1 For the deterministic hidden-state evolution 𝒈 a straightforward choice to implement Eq. ( 8 ) is to use a recurrent neural network, here an LSTM [ Hochreiter and Schmidhuber 1997 ]. The vector 𝒉 𝑡 is then the concatenation of the LSTM cell state vectors and the LSTM-unit output vectors at time 𝑡 . Finally , we also assume stationarity , meaning that 𝒈 and the distri- butions in ( 7 ) are independent of 𝑡 . This is an exceedingly common assumption in practical se quence mo dels, since it means that all timesteps in the training data can be treated as samples from a single, time-independent distribution 𝑝 𝜽 ( 𝒙 𝑡 | 𝒙 𝑡 − 𝜏 : 𝑡 − 1 , 𝒄 𝑡 − 𝜏 : 𝑡 , 𝒉 𝑡 − 1 ) . The central innovation in this paper is to learn that controllable next-step distribution using normalising ows. T o adapt Glow to parameterise the ne xt-step distribution in the autoregressive hidden-state model in Eqs. ( 7 ) and ( 8 ) , we made a number of changes to the original image-oriente d Glow ar chitecture in Kingma and Dhariwal [ 2018 ]. There, dep endencies b etween 𝒁 𝑡 , 𝑛 - values at dierent image locations w ere introduced by making 𝐴 𝑛 a convolutional neural network. W e instead use unidirectional (causal) LSTMs inside 𝐴 𝑛 to enable dependence between timesteps, which is simpler than the dilated convolutions used in recent audio models based on Glow [ Kim et al . 2019 ; Prenger et al . 2019 ] while giving better models than making 𝐴 𝑛 a simple feedfor war d network. W e added a small epsilon 𝜀 = 0 . 05 to the sigmoids in 𝐴 𝑛 that dene the scale-factor outputs 𝒔 ′ 𝑛 , in order to bound the dynamic range of the scaling and stabilise training. This modication re- stricts the possible scale-factor values to the inter val 𝒔 𝑛 ∈ ( 𝜀 , 1 + 𝜀 ) . Unlike Dinh et al . [ 2017 ]; Kim et al . [ 2019 ] we did not use any mul- tiresolution architecture in our ow , as that did not provide any noticeable improvements in preliminary experiments, nor do we include squeeze operations, as that would add algorithmic latency . T o provide motion control and enable explicit dependence on recent pose history in Glow distributions, we take inspiration from recent sequence-to-se quence audio models Kim et al . [ 2019 ]; Prenger et al . [ 2019 ], which feed the conditioning information (here 𝒙 𝑡 − 𝜏 : 𝑡 − 1 and 𝒄 𝑡 − 𝜏 : 𝑡 ) as additional inputs to the ane couplings 𝐴 𝑛 , these being the only neural networks in Glow . The scaling and bias terms, together with the next state ℎ 𝑡 , 𝑛 of net 𝐴 𝑛 , are then computed as [ 𝒔 ′ 𝑡 , 𝑛 , 𝒕 ′ 𝑡 , 𝑛 , 𝒉 𝑡 , 𝑛 ] = 𝐴 𝑛 ( 𝒃 lo 𝑡 , 𝑛 , 𝒙 𝑡 − 𝜏 : 𝑡 − 1 , 𝒄 𝑡 − 𝜏 : 𝑡 , 𝒉 𝑡 − 1 , 𝑛 ) . (9) W e call our proposed model structure MoGlow for motion Glow . If we let 𝒛 𝑡 , 𝑁 denote the observation 𝒙 𝑡 mapped back onto the latent space by the ( conditional) ow transformation 𝒇 − 1 , the full log-likelihood training objective of MoGlow applied to a se quence 𝒙 given the control input 𝒄 can be written ln 𝑝 𝜽 ( 𝒙 𝜏 + 1 : 𝑇 | 𝒙 1 : 𝜏 , 𝒄 ) = 𝑇  𝑡 = 𝜏 + 1 ln 𝑝 N  𝒛 𝑡 , 𝑁 ( 𝒙 1 : 𝑡 , 𝒄 1 : 𝑡 )  + 𝑁  𝑛 = 1 𝐷  𝑑 = 1 𝑇  𝑡 = 𝜏 + 1  ln 𝑠 𝑛, 𝑑 + ln 𝑢 𝑛, 𝑑 𝑑 + ln 𝑠 ′ 𝑡 , 𝑛, 𝑑 ( 𝒙 1 : 𝑡 , 𝒄 1 : 𝑡 )  , (10) where we have made explicit which terms dep end on 𝒙 and 𝒄 . A schematic illustration of MoGlo w sample generation is presented in 1 For this article, we will ignore how to model the initial distribution 𝑝 ( 𝒙 1 : 𝜏 ) from ( 7 ) . Experimentally , we found that initialisation with natural motion snippets or with a static mean pose both give competitive results. ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020. 236:6 • Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow … … c t − 3 c t − 2 c t − 1 c t c t +1 z t − 3 z t − 2 z t − 1 z t z t +1 h t − 3 h t − 2 h t − 1 h t h t +1 Latents (source of entropy) Hidden LSTM st ate Normalis ing flo w Control inputs f Autoregres sion Conditioning info … x t − 2 x t − 1 x t +1 x t x t − 3 Output p ose s equence … c t +2 h t +2 Concatenate x t − 2 x t − 1 x t − 2 x t − 1 c t − 2 c t − 1 c t c t − 2 c t − 1 z t +2 c t x t +2 Per - pose dropout c t − 4 x t − 4 Fig. 3. Schematic of autoregressive motion generation with MoGlow . Inputs are blue, outputs yello w . Dropout is only applied at training time. Fig. 3 . At generation time, latent 𝒛 𝑡 -vectors are sampled independ- ently from 𝑝 N (acting as a source of randomness for the next-step distribution) and then transformed into new poses 𝒙 𝑡 by the ow 𝒇 conditioned on 𝒙 𝑡 − 𝜏 : 𝑡 − 1 , 𝒄 𝑡 − 𝜏 : 𝑡 − 1 , and 𝒉 𝑡 − 1 . Because 𝒁 𝑡 is supported on all of R 𝐷 , so is 𝑿 𝑡 . This is a natural t for pose representations that take values on R 𝐷 , e.g., joint posi- tions in Cartesian coordinates. Pose r epresentations supported on a non-zero volume subset X ⊂ R 𝐷 , for example the exponential map [ Grassia 1998 ], can also b e used. In practise, we recommend parameterisations that minimise angular discontinuities, e.g., by ex- pressing angles relative to a T -pose and wrapping at ± 180 degrees, since the method works best for continuous density functions. 3.3 Data dropout Early MoGlow models had a pr oblem with poor adherence to the control input, where generated character motion often would walk or run even when the control input (in this case, the path follo wed by the root node) specied that no movement through space was taking place. This indicates an over-reliance on autoregressiv e pose information, compared to the control input. Such behaviour is a frequent issue with long-term prediction in pow erful autoregress- ive models (cf. Chen et al . [ 2017 ]; Liu et al . [ 2019 ]), for example in generative models of speech as in Tachibana et al . [ 2018 ]; Uria et al . [ 2015 ]; W ang et al . [ 2018 ]. Established methods to counter this failure mode include applying dropout to entire frames of autor e- gressive history inputs – conventionally called data dropout – as in Bowman et al . [ 2016 ]; W ang et al . [ 2018 ], or do wnsampling the data sequences as in Tachibana et al . [ 2018 ]. Dropout and bottlene cks in the autoregressive path can also b e combine d with a lowered frame rate, e .g., Shen et al . [ 2018 ]; W ang et al . [ 2017 ]. All of these approaches have the net eect of reducing the informational value of the most-recent autoregressive feedback, thus making the inform- ation in the current control input relatively more valuable. W e found that applying data dropout during training substantially improved the consistency between the generated motion and the control sig- nal in MoGlow models. In particular , the issue of MoGlow running in place vanished with frame dropout rates of 50% and above. 4 EXPERIMEN T AL SET UP The goal of MoGlow is to introduce a probabilistic and controllable motion model capable of delivering high-quality output without task-specic assumptions. This section presents data and systems used for comparative experiments that evaluate the quality of Mo- Glow output across dierent tasks. A ssociated evaluations and res- ults are reported in Sec. 5 , along with skinned-character experiments designed to validate the probabilistic aspects of the model. Objectively evaluating motion plausibility is dicult in the gen- eral case, as there is no single natural realisation of the motion given typical, weak control signals. Comparing low-level pr operties such as frame-wise joint positions between recorded and synthesised mo- tion is therefore not particularly informative. T o enable meaningful objective evaluation, we chose to evaluate MoGlow on locomotion synthesis, for which some perceptually-salient aspects of the motion can be studied objectively . Specically , fo ot-gr ound contacts are easy to identify as they should have zer o velocity , and foot-sliding artefacts (often attributable to mean collapse) are both pervasive in synthetic locomotion and known to greatly aect the perceived naturalness of the resulting animation. W e str ess that unlike Holden et al . [ 2017 , 2016 ]; Pavllo et al . [ 2018 ]; Starke et al . [ 2020 ], we do not use foot-contact information as part of our model, but only use it to objectively evaluate the generated output motion. 4.1 Data for objective and subjective evaluations W e considered two sources of motion-capture data in our evalu- ations, namely human ( bipedal) and animal (quadrupedal) loco- motion on at surfaces. Bipedal and quadrupedal locomotion repres- ent signicantly dierent modelling problems, and to our knowledge no method has been demonstrated to perform well on both tasks, with the exception of Starke et al . [ 2020 ], which appeared while this paper was in review . For the human data, we used the data and preprocessing code provided by Holden et al . [ 2016 , 2015 ]. 2 W e pooled this dataset with the locomotion trials from the CMU [ CMU Graphics Lab 2003 ] and HDM05 [ Müller et al . 2007 ] databases. W e held out a subset of the data with a roughly equal amount of motions in dierent categories (such as walking, running, and sidestepping) for e valuation, and used the rest for training. For the animal motion, we used the 30 minutes of dog motion capture from Zhang et al . [ 2018 ], excluding clips on uneven terrain. Quadrupedal locomotion allows more gaits than bipedal lo comotion (see Zhang et al . [ 2018 ]), but the data also contains motions like sitting, standing, idling, lying, and jumping. W e held out two sequences comprising 72 s of data. Both datasets were downsampled to 20 frames per se cond and sliced into xe d-length 4-second windows with 50% overlap for training. The lower ed frame rate both reduces computational de- mands and decreases over-reliance on autor egressive feedback, as discussed in Sec. 3.3 . The training data was subsequently augmen- ted by lateral mirroring. T o increase the amount of backwards and side-stepping motion, we further augmented the data by rev ersing it in time. This way we obtained 13,710 training sequences from the human data and 3,800 from the animal material. Preliminary com- parisons indicated that the rev erse-time augmentation substantially improved the naturalness of synthesised motion. W e used the same pose representation and contr ol scheme as in Habibie et al . [ 2017 ]. Each pose frame 𝒙 𝑡 in the data thus comprised 3D joint positions of a skeleton expressed in a oor-level (root) 2 Please see http://theorangeduck.com/page/deep-learning-framework-character- motion-synthesis-and-editing . ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: N ov ember 2020. MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows • 236:7 T able 1. Overview of system configurations considered in this paper . Numb ers with h pertain only to the human model, d to the dog. Proba- T ask- Algo. Context Hidden Pose Num. params. Training. . . Conguration ID bilistic? agnostic? latency frames state dropout Man Dog Loss func. Epochs Time GP Us Baselines Plain LSTM RNN ✗ ✓ None - LSTM - 1M 1M MSE 40 0.7 h h 8 Greenwood et al. [ 2017a ] V AE Partially ✓ Full seq. - BLSTM - 4M 4M MSE+KLD 40 6.1 h h 8 Pavllo et al. [ 2018 ] QN ✗ ✗ 1 sec. - GRU - 10M - Angl./pos.+reg. 2k/4k 10 h h 2 Zhang et al. [ 2018 ] MA ✗ ✗ 1 sec. 12 - - - 5M MSE 150 30 d h 1 MoGlow MG ✓ ✓ None 10 LSTM 95% 74M 80M Log-likelihood 291 h 26 h h 1 Ablats. No pose dropout MG-D " " " 10 " 0% 74M - " " 26 h h " No pose context MG- A " " " 10 " 100% 74M - " " 26 h h " Minimal history MG-H " " " 1 " 95% 54M - " " 23 h h " coordinate system following the character’s position and direction. The root motion was calculated by Gaussian-ltering the horizontal, oor-projected hip motion from the original data, which yielded a ( 𝑥 , 𝑧 ) trajectory on the ground together with the up ( 𝑦 ) axis rotation. The ltering is essential for generalising the synthesis to smo oth control signals as provided by an artist or from game-pad input. The human data had 21 joints ( 𝐷 = 63 degrees of freedom), while the dog data had 27 joints ( 𝐷 = 81 degrees of freedom). This was supplemented with the frame-wise delta translation and delta rota- tion (around the up-axis) of the root, which together constitute the control signal 𝒄 𝑡 ∈ R 3 for each frame. The traje ctory of the root o ver time is computed from the control signal 𝒄 𝑡 using integration, and is therefore completely determined by the sequence of control inputs 𝒄 . The end result is that the r oot is constrained to exactly follow a specic path on the ground and path-following is essentially perfect; the task of the motion-synthesis mo del is to generate a se quence of body p oses that are consistent with motion along this trajectory . Each dimension in the data and control signal was standardised to zero mean and unit variance over the training data prior to training. 4.2 Proposed model and ablations W e trained the same Py T orch implementation 3 of MoGlow on both the human and the animal data. W e use d a 𝜏 = 10 -frame time window (0.5 seconds) with 𝑁 = 16 steps of ow . The neural network in each coupling lay er comprised tw o LSTM layers (512 nodes each), followed by a linear (for 𝒕 𝑛 ) and sigmoid (for 𝒔 𝑛 ) output lay er . Mo del parameters were estimated by maximising the log-likelihood of the training-data sequences using Adam [ Kingma and Ba 2015 ] for 160k steps (human) or 80k (quadruped) with batch size 100. Both mo dels used a learning rate of 10 − 4 , but for the quadrupe d we used the Noam learning rate scheduler [ V aswani et al . 2017 ] with 1k steps of warm-up and peak learning rate 10 − 3 . The autoregressive frame dropout rate was set to 0.95 during training (no dropout was used during synthesis). W e denote this system “MG” for MoGlow . While many GANs and normalising-ow applications heuristically reduce the temperature (standar d deviation) of the latent distribution 𝒁 𝑡 at generation time, we found this to b e unnecessary , and in fact detrimental to the visual quality of motion sampled from the system. 3 Please see our project page https://simonalexanderson.github.io/MoGlow for links to code, data, and hyperparameters from the evaluation, as well as update d hyp erparameter settings that we think further improve output quality . T o assess the impact of important design decisions, we traine d three additional versions of the MoGlow architecture on the human data. In these, specic components had be en disabled from the full MG system: The rst ablate d conguration, “ MG-D ” (for “minus dropout”) turned o data dropout by setting the dropout rate to zero . As discussed in Se c. 3.3 , w e expect this system to exhibit p oor adher- ence to the control signal and establish the utility of introducing data dropout. The second, “ MG- A ” (for “minus autoregression”), instead increased the dropout rate to 100%, thereby completely disabling autoregressive feedback from recent poses 𝒙 𝑡 − 𝜏 : 𝑡 − 1 . W e expect the contrasts between MG and MG-A to sho w the utility of the autore- gressive fee dback in the mo del. The nal ablation, “ MG-H ” (for “minus history”) changed 𝜏 from ten frames (0.5 s of history inform- ation) down to a single frame. This is the minimum history length at which the model remains autoregressive; any p ose or control information older than 𝑡 − 1 must now be propagated by the LSTMs in 𝐴 𝑛 instead. (Unlike MG-D and MG-A, MG-H also aects the con- trol information, in addition to the autoregressive feedback.) W e expect this ablation to demonstrate the utility of providing the ows with an explicit memory buer of the most recent p ose and control inputs, in addition to the long-range information about past inputs propagated through the recurrent hidden state. T able 1 summarises the properties and training of the proposed system and its ablations. 4.3 Baseline systems T o put the performance of MoGlow in p erspective, we compared against a number of other motion generation approaches. The rst of these is held-out motion capture recordings, which we label “NA T” for natural. (W e prefer not to use the term “ground truth” , since there is no one true way to perform a given motion.) These motion examples function as a top line. W e also compared against two task-agnostic motion-synthesis approaches, labelled “RNN” and “V AE” . The rst of these, RNN, is a deterministic system that maps control signals 𝒄 𝑡 to poses 𝒙 𝑡 using a standard unidirectional LSTM network ( one hidden layer of 512 nodes followed by a linear output layer) and was trained to minim- ise the mean squared error (MSE). Because our path-based control signal does not suce to disambiguate the motion, we expect this generic method to exhibit considerable regression to the mean, for instance visible through fo ot-sliding. This is emblematic of task- agnostic deterministic methods. The other task-agnostic baseline, ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020. 236:8 • Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow Fig. 4. Still images cropped from videos of MG output. The path followed by the root node, which is completely determined by 𝒄 , is visualised as a blue curve projected onto the ground plane. V AE, is a reimplementation of the conditional variational autoen- coder architecture used for spe ech-driven head-motion generation in Greenwood et al . [ 2017a , b ], but in our case predicting motion 𝒙 from 𝒄 . W e used encoders and de coders with two bidirectional LSTM layers (256 nodes each way) and a linear output layer . The encoder used mean-pooling to map to a latent space with two dimensions per sequence. Due to the bidirectional LSTMs in the conditional de- coder , interactive control is not possible with this approach. Unlike the RNN baseline, V AE represents a partially probabilistic model, which should enable it to cope with motion that is random and ambiguous also when conditioned on the control signal. The model does not incorporate any assumptions sp ecic to head-motion data, and can be considered representative of the state of the art in prob- abilistic, task-agnostic motion generation. W e say that this system is “partially probabilistic” since the decoder is trained to minimise the MSE and treated as deterministic rather than stochastic at syn- thesis time. As a consequence , output samples from the system have articially reduced randomness compared to sampling from the full probabilistic model described by the tted V AE, whose decoder is a Gaussian distribution. Such reduced-entropy generation proce dur es are common in practice since they tend to impr ove subjective output quality (see Sec. 2.3 ), but also indicate that the underlying mo del has failed to convincingly model the natural variation in the data. Finally , we also compared our proposed method with a leading task-specic system in each of the two domains. Human lo comotion generation, to begin with, is a mature eld where many approaches may b e considered state-of-the-art. One example is the recently proposed QuaterNet [ Pavllo et al . 2018 ], which we included in our evaluation as system “QN” . In order not to compromise QN motion quality , we used the code, hyperparameters, and control scheme made available by the original QuaterNet authors. 4 This introduced a number of minor dierences compared to other systems. Sp ecic- ally , the QuaterNet reference implementation contains a number of preprocessing steps that change the motion: First the input path is approximated by a spline , and facing information and local mo- tion speed are replaced. This control scheme causes the character to always face the direction of motion, prev enting sidestepping or walking backwards. Short spline segments ar e then lengthene d, pre- venting the model from standing still. One goal with MoGlow is to deliver high-quality motion without such custom, task-specic 4 Please see https://github.com/facebookresearch/QuaterNet . processing steps. Finally , we resampled the output from the traine d QN system to 20 fps to match the other systems in the evaluation. For the quadruped locomotion task, w e compared with the mode- adaptive neural networks from Zhang et al . [ 2018 ]. Since they trained on the same dataset as us, we used their pretrained mo del 5 as our system “MA ” for best results. T o our knowledge, no data was held out from their training. In the absence of held-out control signals, MA was therefore only evaluated on synthetic control input. For the experiments we set the MA style input to “move ” and the correction parameter 𝜏 to 1, to make the model follow the input patch exactly , like the other systems in the evaluation. MA output was also resampled to 20 fps. In summary , RNN and V AE are task-agnostic systems – one de- terministic, one probabilistic – while QN and MA instead represent the task-specic state of the art in their respective task. W e note that, unlike RNN and the MG systems, V AE, QN, and MA are noncausal, in the sense that their output dep ends on future control-input in- formation. W e expect this ability to “see the future ” to benet the quality of the motion generated for these systems, but it comes at the cost of introducing algorithmic latency , preventing the type of responsive control that MG allows. All our models were trained on a system with 8 Nvidia 2080Ti GP Us. An overview of the dierent systems, including information such as training time, mo del size, and the number of GP Us used, is provided in T able 1 . 5 RESULTS AND DISCUSSION This section details our subjective (Secs. 5.1 thr ough 5.2 ) and object- ive (Sec. 5.3 ) evaluations of the dierent motion-generation methods from Sec. 4 , and how we interpret the results. W e then describe (Sec. 5.4 ) experiments that explore the probabilistic aspects of the model, and consider its use beyond locomotion. W e then conclude with a discussion of drawbacks and limitations (Sec. 5.5 ). 5.1 Subjective evaluation setup Since our goal is to create lifelike synthetic motion that appears convincing to human obser v ers, subjective evaluation is the gold standard. T o this end we conducted several user studies to measure motion quality on the tw o tasks. The stimuli used in both studies were short animation clips where motion was visualised using a stick gure seen from a xe d camera angle; see Fig. 4 . A cur ve on the ground marked the path taken by the gure in the clip. Clips were generated for all systems in T able 1 and from held-out motion-capture recordings (“NA T”). For MG, one second of preced- ing motion was pre-generated before the four seconds that were displayed and scored, to remove the eects of motion initialisation. Since the QuaterNet preprocessing changes the motion duration, the segmentation points for the evaluation clips ( and also the camera azimuth) dier between QN and the other systems. In addition to motion generated from held-out natural control signals (20 human, 8 dog), the evaluation also included synthetic control signals (7 human, 10 dog) with a range of motion speeds and directions, for which no natural counterpart was available. General- ising well to synthetic control is important for computer animation, video games, and similar applications. 5 A vailable at https://github.com/sebastianstarke/AI4Animation . ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: N ov ember 2020. MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows • 236:9 T able 2. Mean subjective ratings with confidence intervals. Significant dif- ferences from MG are indicated by ** ( 𝑝 < 0 . 01 ) and * ( 𝑝 < 0 . 05 ) . Human Quadruped ID Held-out 𝒄 Synthetic 𝒄 Held-out 𝒄 Synthetic 𝒄 NA T 4.27 ± 0.11 - 4.25 ± 0.06** - RNN 3.10 ± 0.15** 1.9 ± 0.2** 2.81 ± 0.10** 1.14 ± 0.04** V AE 3.95 ± 0.13 3.1 ± 0.3** 3.55 ± 0.08 2.14 ± 0.20** QN 4.21 ± 0.10 - - - MA - - - 3.78 ± 0.10 MG 4.17 ± 0.11 4.0 ± 0.2 3.71 ± 0.18 3.57 ± 0.20 MG-D 3.66 ± 0.16** 2.1 ± 0.2** - - MG- A 2.86 ± 0.16** 3.2 ± 0.3** - - MG-H 3.87 ± 0.13* 3.9 ± 0.3 - - Evaluation participants wer e recruited using the Figure Eight crowdw orker platform at the highest-quality contributor setting (allowing only the most experienced, highest-accuracy contribut- ors). For each clip, participants w ere asked to grade the perceived naturalness of the animation on a scale of integers from 1 to 5, with 1 being completely unnatural (motion could not possibly be pro- duced by a real person/dog) and 5 being completely natural (lo oks like the motion of a real person/dog). Every system in T able 1 had one stimulus generate d for every control signal considered, with a few exceptions: QN was not applied to synthetic control signals, since these containe d a large fraction of control inputs involving walking sideways, backwards, and standing still, motion that the QN reference implementation from Pavllo et al . [ 2018 ] cannot perform (instead rendering these as forwards motion). MA was not applied to our natural test inputs, since these were not held out from MA training. The ablated systems were only evaluated on the human locomotion task. This yielded a total of 202 human animations being evaluated (160 with held-out control and 42 with synthetic control) and 72 dog animations (32 held-out, 40 synthetic control). The order of the animation clips was randomised, and no information was given to the raters about which system had generated a giv en video, nor about the number of systems b eing evaluated in the test. Interspersed among the regular stimuli were a handful of clips with deliberately bad animation taken from early iterations in the training process (labelled “BAD”). These were added as “attention checks” to be able to lter out unreliable raters: Any rater that had given any one of the BAD animations a rating of 4 or ab ov e, or had given any of the NA T clips a rating below 2, was removed from the analysis. Ratings that were too fast (the rater replied b efor e the video had nished playing) were also discarded. Prior to the start of the rating phase, participants wer e trained by viewing example motion videos from the dierent conditions evaluated, as well as some of the bad examples mentioned above. Motion examples can be seen in our presentation video and in the supplementar y material, which contains all video clips from the subjective evaluation. 5.2 Analysis and discussion of subjective evaluation A total of 645 raters (296 human data/349 dog data) participated in the evaluation, of which 89 (49/40) were removed as unreliable (see above). In total, 10,355 ratings were collecte d (5,083/5,272). 1,533/983 of these were discarded due to unreliable rater (1,344/813) or too fast response time (189/170), resulting in a total of 3,550/4,289 rat- ings across 227/80 clips being evaluated (both regular and BAD), amounting to b etween 8 and 60 ratings per stimulus. The mean scores for each system conguration and control-signal class are tabulated in T able 2 . For the human motion, a one-way ANO V A revealed a main ef- fect of the naturalness rating ( 𝐹 = 223 , 𝑝 < 10 − 288 ). A post-hoc T ukey multiple-comparisons test was applied in order to identify signicant dierences between conditions ( FWER = 0 . 05 ). For the held-out control conditions, MG was rated signicantly higher than RNN and all ablations. For the synthetic control conditions, MG was rated signicantly higher than all other systems except the ablation system MG-H. The same analysis for the quadruped motion again revealed a main eect of the naturalness rating ( 𝐹 = 172 , 𝑝 < 10 − 100 for held-out 𝒄 , 𝐹 = 803 , 𝑝 < 10 − 296 for synthetic). The post-hoc T ukey multiple-comparisons test revealed signicant dierences between MG and all other systems, except b etween MG and V AE on the held-out contr ol and between MG and MA on the synthetic con- trol. 95%-condence inter vals for the mean scores based on these analyses are include d in Table 2 , which also indicates signicant dierences between MG and other systems. Among the task-agnostic methods in the experiment, MG substan- tially outperforms both RNN and V AE. Despite these MG systems being trained to predict joint positions rather than joint rotations, they are seen to respect constraints due to bone lengths, ground contacts, etc. Furthermore, the rated motion quality of MG on each task is comparable to the respective task-specic state of the art (the dierence between MG and either QN or MA is not statistically signicant), and comes within 0.1 points of natural motion for the biped. This is despite the task-spe cic systems having a full second of algorithmic latency , while MG is task-agnostic and has none. W e note that stimuli where the root is completely still are generally rated lowest for MG and MA, and not possible to generate with QN. Among other results, the performance of the ablations MG-D and MG- A versus the full MG system indicate that both autoregression and data dropout are of gr eat importance for synthesising natural motion. A longer memory length of 𝜏 = 10 frames for MG, compared to 𝜏 = 1 for MG-H, also beneted the model. It can be observed that RNN, V AE, and MG-D quality degrades substantially on synthetic control signals, creating a highly signicant dierence with respect to MG. W e hyp othesise that this, for MG-D, is due to artefacts of poor contr ol without data dropout ( such as running in place; see Se c. 3.3 ), and, for RNN and V AE, due to the systems b eing dependent on footfall cues (e.g., residual periodicity in the root-node motion) not present in the synthetic motion control. The full MoGlow model, in contrast, generalises robustly to synthetic control signals. 5.3 Objective evaluation Given the salience and imp ortance of foot-sliding artefacts in lo- comotion synthesis, we base our objective evaluation on footstep analysis, with footsteps estimated as time inter vals where the hori- zontal speed of the he el joints (bipeds) or toe joints (quadrupeds) are below a specied tolerance value 𝑣 tol . At low values of 𝑣 tol , many ground contacts exhibit too much motion (due to foot sliding ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020. 236:10 • Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 v t o l ( c m / s ) 50 100 150 200 250 300 350 Number of footsteps NAT RNN VAE QN MG 1 2 3 4 5 6 7 v t o l ( c m / s ) 100 150 200 250 300 NAT RNN VAE MG Fig. 5. Footstep count 𝑓 est as a function of sp eed tolerance 𝑣 tol (cm/s) for the human (le) and quadruped (right) datasets. Black dots identify locations used to determine 𝑣 ( 95 ) tol for each curve. or motion-capture uncertainty ), and are not classied as steps. A s the tolerance is increased, the number of footsteps identied, 𝑓 est , rst rises but then quickly plateaus at a static maximum value rep- resenting the total number of footsteps in the sequence. A mo del that produces foot-sliding artefacts will require higher tolerance before reaching its maximum. If the tolerance is increased further , the estimated number of fo otsteps ev entually begins to decrease as separate footsteps start to be merged. Plots of 𝑓 est as a function of 𝑣 tol on held-out data are provided in Fig. 5 ; the human and dog motion clips used as the basis for these plots and for the associated analysis ar e available in the supplement. (MA is not include d since no data was held out from its training.) The plots show that MG is able to stay close to NA T in both scenarios. QN, which only is available for the human data, generates slightly too many steps, but is other wise close to the natural footstep prole. The quadruped data appears to b e more challenging than the human data, with the peaked behaviour of the estimated number of footsteps 𝑓 est for RNN and V AE indicating less distinctive synthetic locomotion that is likely to exhibit substantial foot sliding. MG, in contrast, again shows an 𝑓 est -prole very similar to that of natural motion. For each model, we incremented 𝑣 tol in small steps (1.0 cm/s for human, 0.3 cm/s for quadrup ed) and extracted the rst tolerance value 𝑣 ( 95 ) tol that reached 95% of the maximum number of footsteps identied for that model in our evaluation. These points are shown as black dots on the curves in Fig. 5 . The tolerance threshold 𝑣 ( 95 ) tol essentially measures the 95th p ercentile of foot sliding in the motion. The lower this is, the crisper the motion is likely to be. T able 3 shows the total estimated number of footsteps, the sp eed threshold, and the mean and standard deviation of the duration of the steps for dierent systems when resynthesising the held-out data from the two datasets. W e note that MG almost always is the model that most closely adheres to the ground truth behaviour . Especially interesting is that MoGlow matches not only the mean but also the standard deviation of the natural step durations. Such T able 3. Results from the objective evaluations: total number of footsteps 𝑓 est , spee d tolerance 𝑣 ( 95 ) tol (cm/s) for capturing 95% of steps, mean and standard deviation of step durations ( s), and bone-length RMSE (cm). The number closest to its natural counterpart in each column is shown in b old. Human Quadruped ID 𝑓 est 𝑣 ( 95 ) tol 𝜇 𝜎 RMSE 𝑓 est 𝑣 ( 95 ) tol 𝜇 𝜎 RMSE NA T 297 5.0 0.31 0.26 - 290 3.2 0.61 0.71 - RNN 328 8.0 0.39 0.39 1.7 216 2.6 0.72 1.05 2.3 V AE 278 7.0 0.35 0.30 1.7 277 2.9 0.61 0.90 2.0 QN 318 5.0 0.23 0.19 0.07 - - - - - MG 278 5.0 0.32 0.23 0.50 295 2.9 0.57 0.75 0.51 behaviour might be expe cted from an accurate probabilistic model, whereas deterministic models, not having any randomness and thus no entropy , are fundamentally limited not to match the statistics of the natural distribution in all respects. Since the task-agnostic models in the objective evaluation were trained on joint positions, bone lengths need not be conser v ed in model output. This can lead to b one-stretching artefacts, and joints may even y apart; cf. Ling et al . [ 2020 ]. Fortunately , bone-length deviation is easy to quantify objectively . T able 3 reports the RMSE of bone length in cm, simultaneously averaged across all joints and time-frames in the test data. W e see that the error is small, meaning that bone lengths in MG output are stable and consistent. 5.4 Probabilistic aspects and further experiments Having evaluated motion quality in-depth across tasks, we now present evidence to validate the wide applicability and the probabil- istic aspects of the model. T o increase the relevance for computer- graphics applications, we here change the pose representation to joint angles and apply the synthesise d motion to a skinned character . W e note that another option for obtaining skinned characters would be to train on joint p ositions in a skeleton with virtual joints like in Smith et al . [ 2019 ], and then apply inverse kinematics to recover joint angles, although this would add another computational step. W e created a new MoGlow model designe d to investigate the ability of the method to learn from diverse motion data and repr o- duce its distribution. For this model, we constructed a new dataset by pooling the LaF AN1 dataset from Har ve y et al . [ 2020 ], along with the Kinematica dataset. 6 W e excluded trials involving wall and obstacle interaction as w ell as dancing, falling, stumbling, ght- ing, and sitting or lying on the ground. Nonetheless, this new data contains more varied motion than the data from Sec. 4.1 , including crouching, hopping, walking while aiming, etc. This yielded a total of 1 h of data at 20 Hz (augmented to 4 h as before). All motion was retargeted to a uniform skeleton and the joint angles were converted to exponential maps [ Grassia 1998 ]. The hips were expressed local to the oor-projecte d root, similar to before. For the new model, data dropout was reduce d to 60%, which prov ed to generate smooth motion without losing adherence to the control. During synthesis, the raw model output was applied dir ectly to the character , without any post-processing such as foot stabilisation. 6 The data is available at https://github.com/ubisoft/Ubisoft-LaForge-Animation-Dataset and at https://github.com/Unity- T echnologies/Kinematica_Demo , respectively . ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020. MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows • 236:11 As shown in our presentation vide o and in Fig. 1 , we nd that MoGlow not only is able to learn to pr oduce high-quality motion from the new data, but that model output also successfully reects the diversity of the material, and random samples of motion along the same path may take ver y dierent forms. MoGlow can thus produce a wide gamut of dierent motions for xed control input, as expected for a strong probabilistic model under weak control signals. This is benecial for increasing variation and naturalness, for example automatically generating sning behaviour when the dog is moving slowly . By training a similar mo del on all the human motion capture material, with no trials except climbing and running on walls excluded, even more varied output was produced, as shown at the very end of our presentation video. In situations where greater control over motion diversity is de- sired, this may be obtained by reducing the sampling temperature or by using other , stronger control signals. For example, crouching or crawling motion might be consistently recovered without manual annotation of training data by training mo dels where pelvic distance above ground is a control input instead of a model output. Nothing about MoGlow is sp ecic to locomotion. The generality of the approach is demonstrated by follow-up work [ Alexanderson et al . 2020 ], performed after the locomotion studies described in this article but publishe d before this article appeared, that shows that MoGlow successfully generalises to synthesising speech-driven gesture motion fr om speech acoustic featur es. Since gestures require time to prepare in order to b e in synchrony with spee ch, it was necessary to provide that model with 1 second of future speech. That article also investigates style control of the output motion, which provides another option for constraining motion diversity . 5.5 Drawbacks and limitations While being a p ow erful machine-learning method, MoGlow comes with some disadvantages of note in computer-graphics scenarios. Aside from the fact that machine learning aords less direct control over motion than hand animation does (and thus is more suited to high-level style control as mentioned in Sec. 5.4 ), the most relevant limitations relate to resource use at training and synthesis time . Training a mo del like MoGlow demands substantial amounts of data and computation. In many graphics applications, waiting several hours to obtain an updated model is undesirable. Iteration time during model development may be sped up by training on multiple GP Us and by using model-surger y te chniques [ Op enAI et al . 2019 ] to avoid r e-training new architectures from scratch. As for data, the various training and validation curves reported in Alexanderson and Henter [ 2020 ] suggest that the MG systems in this article are “data-limited” , and that more training data should improve held-out data likelihood. Aside fr om recording additional material or pre-training on other motion databases, one might use high-quality data-augmentation techniques like those in Lee et al . [ 2018 ] to increase training-set size. This can b e seen as a way to inject domain knowledge into the model-creation process. MoGlow requires that frames are generated in sequence. Since the method describes an entire distribution of plausible poses, models furthermore tend to be deep and large. These properties may com- plicate interactive applications such as games. In general, it is easier to make good models fast than it is to make fast models goo d, and we expect it to be entirely possible to speed up MoGlow generation, e.g., using density distillation techniques like Huang et al . [ 2020 ] to create shallower mo dels with similar accuracy as deeper ones. T o compress the model footprint, neural-network pruning techniques like those surveyed in Blalock et al . [ 2020 ] are a compelling choice. While MoGlow has performed well on the various motion tasks we have tried it on, we note that it does not contain any explicit physics model. W e have seen rare instances of physically inappr o- priate motion, such as leaning stances where a real character would fall over . Reverse-time augmentation, when use d, can giv e similar issues such as leaning forwards when running backwards at spe ed. W e expect that these issues can be mitigated by more training data (reducing the ne ed for augmentation), and by providing contact information as an input signal, but it might b e more ecient to consider methods for introducing physics directly into the model. MoGlow also does not contain any model of human b ehaviour and intent, so in the absence of external information to guide the choice of behaviour , model output may switch between diverse locomotion modes and styles in an unstructured manner . 6 CONCLUSION AND F U T URE W ORK W e have described the rst model of motion-data sequences based on normalising ows. This paradigm is attractive be cause ows 1) are probabilistic (unlike many establishe d motion models), 2) utilise powerful, implicitly dened distributions ( like GANs, but unlike classical autoregr essive models), y et 3) are traine d to directly maximise the exact data likelihood (unlike GANs and V AEs). Our model uses both autoregression and a hidden state (recurr ence) to generate output se quentially , and incorporates a control scheme without algorithmic latency . (Non-causal control is a straightfor- ward extension.) T o our knowledge, no other Glow-based sequence models combine these desirable traits, and no other such model has incorporated hidden states, nor data dropout for more consistent control. Moreover , our approach is probabilistic from the ground up and generates convincing samples without entropy-reduction schemes like those in Brock et al . [ 2019 ]; Greenwood et al . [ 2017a , b ]; Henter and Kleijn [ 2016 ]. Experimental evaluations show that the model produces high-quality synthetic lo comotion for both bipedal and quadrupedal motion-capture data, despite their disparate mor- phologies. Subjective and objective results show that our proposal signicantly outperforms task-agnostic LSTM and V AE-based ap- proaches, coming close to natural motion recordings and performing on par with task-specic state-of-the-art locomotion mo dels. In light of the quality of the synthesised motion and the generally- applicable nature of the approach, we believe that models based on normalising ows can prove valuable for a wide variety of tasks in- corporating motion data. Future work includes applying the method to additional tasks and domains, and making models lighter and faster for applied scenarios. Since models based on normalising ows allow exact and tractable inference, another interesting ap- plication would be to use the probabilities inferred by these mo dels to also enable classication. ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020. 236:12 • Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow A CKNO WLEDGMEN TS This research was partially supported by Swedish Research Council proj. 2018-05409 (StyleBot), Swedish Foundation for Strategic Re- search contract no. RI T15-0107 (EA Care), and by the W allenberg AI, A utonomous Systems and Software Program (W ASP) funded by the Knut and Alice W allenberg Foundation. REFERENCES Simon Alexanderson and Gustav Eje Henter . 2020. Robust model training and gen- eralisation with Studentising ows. In Proceedings of the W orkshop on Invertible Neural Networks, Normalizing F lows, and Explicit Likelihoo d Mo dels (INNF+’20, V ol. 2) . Article 15, 9 pages. https://arxiv .org/abs/2006.06599 Simon Alexanderson, Gustav Eje Henter , Taras Kucher enko, and Jonas Beskow . 2020. Style-controllable speech-driven gesture synthesis using normalising ows. Comput. Graph. Forum 39, 2 (2020), 487–496. https://doi.org/10.1111/cgf.13946 Okan Arikan and David A. Forsyth. 2002. Interactive motion generation from e xamples. ACM Trans. Graph. 21, 3 (2002), 483–490. https://doi.org/10.1145/566570.566606 Samy Bengio, Oriol Vinyals, Navdeep Jaitly , and Noam Shazeer . 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In A dvances in Neural Information Processing Systems (NIPS’15) . Curran Associates, Inc., Red Hook, NY, USA, 1171–1179. http://pap ers.nips.cc/paper/5956- scheduled- sampling- for- sequence- prediction- with- recurrent- neural- networks Davis Blalo ck, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. What is the state of neural netw ork pruning? . In Proceedings of the Conference on Machine Learning and Systems (MLSys’20) . 129–146. https://proceedings.mlsys.org/ paper/2020/hash/d2ddea18f00665ce8623e36bd4e3c7c5- Abstract.html Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning (CoNLL’16) . ACL, Berlin, Germany , 10–21. https://doi.org/10.18653/v1/K16- 1002 Matthew Brand and Aaron Hertzmann. 2000. Style machines. In Proceedings of the 27th A nnual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’00) . ACM Pr ess/Addison- W esley Publishing Co., USA, 183–192. https://doi.org/10.1145/ 344779.344865 Christoph Bregler . 1997. Learning and recognizing human dynamics in video sequences. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’97) . IEEE Computer Society , Los Alamitos, CA, USA, 568–574. https://doi.org/10.1109/CVPR.1997.609382 Andrew Brock, Je Donahue, and K aren Simonyan. 2019. Large scale GAN training for high delity natural image synthesis. In Proceedings of the International Confer- ence on Learning Representations (ICLR’19) . 35. https://op enre view .net/forum?id= B1xsqj09Fm Judith Bütepage, Michael J. Black, Danica Kragic, and Hedvig Kjellström. 2017. Deep representation learning for human motion prediction and classication. In Pro- ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’17) . IEEE Computer Society , Los Alamitos, CA, USA, 1591–1599. https://doi.org/10.1109/CVPR.2017.173 Jinxiang Chai and Jessica K. Hodgins. 2005. Performance animation from low- dimensional control signals. A CM Trans. Graph. 24, 3 (2005), 686–696. https: //doi.org/10.1145/1073204.1073248 Xi Chen, Diederik P. Kingma, Tim Salimans, Y an Duan, Prafulla Dhariwal, John Schul- man, Ilya Sutskever , and Pieter Abb eel. 2017. V ariational lossy autoencoder. In Proceedings of the International Conference on Learning Representations (ICLR’17) . 17. https://openreview .net/forum?id=BysvGP5ee CMU Graphics Lab. 2003. Carnegie Mellon University motion capture database. http: //mocap.cs.cmu.edu/ Chris Cremer , Xuechen Li, and David Duvenaud. 2018. Inference suboptimality in variational autoencoders. In Proceedings of the International Conference on Machine Learning (ICML’18) . PMLR, 1078–1086. http://proceedings.mlr .press/v80/cremer18a. html Gustavo Deco and Wilfried Brauer . 1994. Higher order statistical decorrelation without information loss. In Advances in Neural Information Processing Systems (NIPS’94) . MI T Press, Cambridge, MA, USA, 247–254. https://papers.nips.cc/paper/901- higher- order- statistical- decorrelation- without- information- loss Chuang Ding, Pengcheng Zhu, and Lei Xie. 2015. BLSTM neural networks for speech driven head motion synthesis. In Pr oceedings of the A nnual Conference of the Interna- tional Spee ch Communication Association (IN TERSPEECH’15) . ISCA, Grenoble, France, 3345–3349. https://www.isca- speech.org/archive/interspeech_2015/i15_3345.html Laurent Dinh, David Krueger , and Y oshua Bengio. 2015. NICE: Non-linear independent components estimation. In Proceedings of the International Conference on Learning Representations, W orkshop Track (ICLR’15) . 13. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2017. Density estimation using Real NVP. In Proceedings of the International Conference on Learning Representations (ICLR’17) . 32. https://openreview .net/forum?id=HkpbnH9lx Ylva Ferstl, Michael Ne, and Rachel McDonnell. 2019. Multi-objective adversarial gesture generation. In Proceedings of the ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG’19) . A CM, New Y ork, NY, USA, Article 3, 10 pages. https://doi.org/10.1145/3359566.3360053 Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Re current network models for human dynamics. In Proceedings of the IEEE International Con- ference on Computer Vision (ICCV’15) . IEEE Computer Society , Los Alamitos, CA, USA, 4346–4354. https://doi.org/10.1109/ICCV .2015.494 Ian Goodfellow . 2016. NIPS 2016 tutorial: Generative adversarial netw orks. arXiv: 1701.00160 Ian Goodfellow , Jean Pouget-Abadie , Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS’14) . Curran Associates, Inc., Red Hook, N Y , USA, 2672–2680. http://pap ers.nips.cc/paper/5423- generative- adversarial- nets F. Sebastian Grassia. 1998. Practical parameterization of rotations using the exponential map. J. Graph. T ools 3, 3 (1998), 29–48. https://doi.org/10.1080/10867651.1998. 10487493 Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv: 1308.0850 David Greenwood, Stephen Laycock, and Iain Matthews. 2017a. Predicting head p ose from speech with a conditional variational auto encoder . In Proce edings of the A nnual Conference of the International Speech Communication Association (INTERSPEECH’17) . ISCA, Grenoble, France, 3991–3995. https://doi.org/10.21437/Interspeech.2017- 894 David Greenwood, Stephen Laycock, and Iain Matthews. 2017b. Predicting head pose in dyadic conversation. In Proceedings of the International Conference on Intelligent Virtual Agents (IV A ’17) . Springer , Cham, Switzerland, 160–169. https://doi.org/10. 1007/978- 3- 319- 67401- 8_18 Keith Grochow , Steven L. Martin, Aar on Hertzmann, and Zoran Popović. 2004. Style- based inverse kinematics. ACM Trans. Graph. 23, 3 (2004), 522–531. https://doi.org/ 10.1145/1015706.1015755 Ikhansul Habibie, Daniel Holden, Jonathan Schwarz, Joe Y earsley , and Taku K omura. 2017. A recurrent variational autoencoder for human motion synthesis. In Proceed- ings of the British Machine Vision Conference (BMVC’17) . BMV A Press, Durham, UK, Article 119, 12 pages. https://doi.org/10.5244/C.31.119 Félix G. Har vey , Mike Y urick, Der ek Nowrouzezahrai, and Christopher Pal. 2020. Robust motion in-betweening. ACM Trans. Graph. 39, 4, Article 60 (2020), 12 pages. https: //doi.org/10.1145/3386569.3392480 Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hir oshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of spee ch-to-gesture generation using bi-directional LSTM network. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IV A ’18) . A CM, New Y ork, NY , USA, 79–86. https://doi.org/10.1145/3267851.3267878 Gustav Eje Henter and W . Bastiaan Kleijn. 2016. Minimum entropy rate simplication of stochastic processes. IEEE T . Pattern Anal. 38, 12 (2016), 2487–2500. https: //doi.org/10.1109/TP AMI.2016.2533382 Irina Higgins, Loic Matthey , Arka Pal, Christopher Burgess, X avier Glorot, Matthew Botvinick, Shakir Mohame d, and Alexander Lerchner . 2016. beta-V AE: Learning basic visual concepts with a constrained variational framework. In Proce edings of the International Conference on Learning Representations (ICLR’16) . 22. https: //openreview .net/forum?id=Sy2fzU9gl Sepp Hochreiter and Jürgen Schmidhuber . 1997. Long short-term memor y . Neural Comput. 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 Daniel Holden, T aku Komura, and Jun Saito . 2017. P hase-functioned neural networks for character control. ACM Trans. Graph. 36, 4, Article 42 (2017), 13 pages. https: //doi.org/10.1145/3072959.3073663 Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing. A CM Trans. Graph. 35, 4, Article 138 (2016), 11 pages. https://doi.org/10.1145/2897824.2925975 Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 T echnical Briefs (SA ’15) . ACM, New Y ork, NY, USA, Article 18, 4 pages. https://doi.org/10.1145/ 2820903.2820918 Chin- W ei Huang, Faruk Ahmed, Kundan Kumar , Alexandre Lacoste, and Aaron Cour- ville. 2020. Probability distillation: A caveat and alternatives. In Proceedings of the Conference on Uncertainty in A rticial Intelligence (U AI’20, V ol. 115) . PMLR, 1212– 1221. http://proceedings.mlr .press/v115/huang20c.html Chin- W ei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. 2018. Neural autoregressive ows. In Proceedings of the International Conference on Machine Learning (ICML’18) . PMLR, 2078–2087. http://proceedings.mlr .press/v80/huang18d. html Ferenc Huszár . 2017. Is maximum likelihood useful for representation learning? http: //www.infer ence.vc/maximum- likelihood- for- representation- learning- 2/ Sergey Ioe and Christian Szegedy . 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the In- ternational Conference on Machine Learning (ICML’15) . PMLR, 448–456. http: ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: N ov ember 2020. MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows • 236:13 //proceedings.mlr .press/v37/ioe15.html Lauri Juvela, Bajibabu Bollepalli, Junichi Y amagishi, and Paavo Alku. 2019. GELP: GAN-Excited Linear Prediction for Spe ech Synthesis from Mel-Spe ctrogram. In Proceedings of the A nnual Conference of the International Speech Communication Association (IN TERSPEECH’19) . ISCA, Grenoble, France, 694–698. https://doi.org/ 10.21437/Interspeech.2019- 2008 Nal Kalchbrenner , Erich Elsen, Karen Simonyan, Seb Noury , Norman Casagrande, Edward Lockhart, F lorian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Ecient Neural Audio Synthesis. In Proceedings of the International Conference on Machine Learning (ICML’18) . PMLR, 2410–2419. http://proceedings.mlr .press/v80/kalchbrenner18a.html T ero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. A udio- driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36, 4, Article 94 (2017), 12 pages. https://doi.org/10.1145/3072959.3073658 Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Y oon. 2019. FloWaveNet: A generative ow for raw audio. In Proce edings of the International Conference on Machine Learning (ICML’19) . PMLR, 3370–3378. http://proceedings. mlr .press/v97/kim19b.html Diederik P. Kingma and Jimmy Ba. 2015. Adam: A metho d for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15) . 15. http://arxiv .org/abs/1412.6980 Diederik P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative ow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems (NeurIPS’18) . Curran Associates, Inc., Red Hook, NY, USA, 10236–10245. http://papers.nips.cc/ paper/8224- glow- generative- ow- with- invertible- 1x1- con Diederik P. Kingma and Max W elling. 2014. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR’14) . 14. http://arxiv .org/abs/1312.6114 Lucas Kovar and Michael Gleicher . 2004. Automated extraction and parameterization of motions in large data sets. A CM Trans. Graph. 23, 3 (2004), 559–568. https: //doi.org/10.1145/1015706.1015760 Lucas Kovar , Michael Gleicher , and Frédéric Pighin. 2002. Motion graphs. ACM Trans. Graph. 21, 3 (2002), 473–482. https://doi.org/10.1145/566654.566605 T aras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the A CM International Conference on Intelligent Virtual Agents (I V A ’19) . ACM, New Y ork, N Y , USA, 97–104. https://doi.org/10.1145/ 3308532.3329472 Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Lev- ine, Laurent Dinh, and Durk Kingma. 2020. VideoFlow: A conditional ow-based model for stochastic video generation. In Proceedings of the International Confer- ence on Learning Representations (ICLR’20) . 18. https://op enre view .net/forum?id= rJgUf TEYvH Neil Lawrence. 2005. Probabilistic non-linear principal component analysis with Gaus- sian process latent variable models. J. Mach. Learn. Res. 6, Nov . (2005), 1783–1816. http://www.jmlr .org/papers/v6/lawrence05a.html K yungho Le e, Seyoung Lee, and Jehee Lee. 2018. Interactive character animation by learning multi-objective control. A CM Trans. Graph. 37, 6, Article 180 (2018), 10 pages. https://doi.org/10.1145/3272127.3275071 Sergey Levine, Jack M. W ang, Alexis Haraux, Zoran Popović, and Vladlen Koltun. 2012. Continuous character control with low-dimensional embeddings. A CM T rans. Graph. 31, 4, Article 28 (2012), 10 pages. https://doi.org/10.1145/2185520.2185524 Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel van de Panne. 2020. Character controllers using motion V AEs. A CM Trans. Graph. 39, 4, Article 40 (2020), 12 pages. https://doi.org/10.1145/3386569.3392422 Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, and Dong Y u. 2019. Maximizing mutual information for Tacotron. arXiv: 1909.01145 Mario Lucic, Kar ol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. 2018. Are GANs created e qual? A large-scale study . In Advances in Neural Information Processing Systems (NeurIPS’18) . Curran Associates, Inc., Red Ho ok, NY , USA, 700– 709. http://papers.nips.cc/paper/7350- are- gans- created- equal- a- large- scale- study Lars Mescheder , Andreas Geiger, and Sebastian Nowozin. 2018. Which training methods for GANs do actually converge? . In Proceedings of the International Conference on Machine Learning (ICML’18) . PMLR, 3481–3490. http://proceedings.mlr.pr ess/v80/ mescheder18a.html Shakir Mohamed and Balaji Lakshminarayanan. 2016. Learning in implicit generative models. arXiv: 1610.03483 T omohiko Mukai and Shigeru Kuriyama. 2005. Geostatistical motion interp olation. ACM Trans. Graph. 24, 3 (2005), 1062–1070. https://doi.org/10.1145/1073204.1073313 Meinard Müller , Tido Röder, Michael Clausen, Bernhar d Eberhardt, Björn Krüger , and Andreas W eber . 2007. Do cumentation Mocap Database HDM05 . Technical Report CG-2007-2. Universität Bonn, Bonn, Germany . http://resources.mpi- inf.mpg.de/ HDM05/07_MuRoClEbKrW e_HDM05.pdf Kevin P. Murphy . 1998. Switching Kalman Filters . Technical Rep ort 98-10. Compaq Cambridge Research Lab, Cambridge, MA, USA. https://ww w .cs.ubc.ca/~murphyk/ Papers/skf.ps.gz Eric Nalisnick, Akihiro Matsukawa, Y ee Whye T eh, Dilan Gorur , and Balaji Laksh- minarayanan. 2019. Do deep generative models know what they don’t kno w? . In Proceedings of the International Conference on Learning Representations (ICLR’19) . 19. https://openreview .net/forum?id=H1xwNhCcYm OpenAI et al . 2019. Dota 2 with large scale deep r einforcement learning. arXiv: 1912.06680 George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2019. Normalizing ows for probabilistic modeling and inference. arXiv: 1912.02762 Dario Pavllo, David Grangier , and Michael Auli. 2018. QuaterNet: A quaternion-based recurrent model for human motion. In Proceedings of the British Machine Vision Conference (BMVC’18) . BMV A Press, Durham, UK, 14. http://www .bmva.org/bmvc/ 2018/contents/papers/0675.pdf Vladimir Pavlović, James M. Rehg, and John MacCormick. 2000. Learning switching linear models of human motion. In Advances in Neural Information Processing Systems (NIPS’00) . MI T Press, Cambridge, MA, USA, 981–987. https://papers.nips.cc/paper/ 1892- learning- switching- linear- mo dels- of- human- motion Hai X. Pham, Yuting W ang, and Vladimir Pavlovic. 2018. Generative adversarial talking head: Bringing portraits to life with a weakly supervised neural network. arXiv: 1803.07716 Ryan Prenger, Rafael V alle, and Bryan Catanzaro. 2019. WaveGlow: A ow-based generative network for speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’19) . IEEE Signal Processing So ciety , Piscataway, NJ, USA, 3617–3621. https://doi.org/10.1109/ICASSP. 2019.8683143 Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer . 2018. GANimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV’18) . Springer , Cham, Switzerland, 835–851. https://doi.org/10.1007/978- 3- 030- 01249- 6_50 Lawrence R. Rabiner . 1989. A tutorial on hidden Markov models and sele cted applications in spe ech recognition. Proc. IEEE 77, 2 (1989), 257–286. https: //doi.org/10.1109/5.18626 Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic back- propagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning (ICML’14) . PMLR, 1278–1286. http://proceedings.mlr .press/v32/rezende14.html Charles Rose, Michael F. Cohen, and Bobby Bodenheimer. 1998. V erbs and adverbs: Multidimensional motion interpolation. IEEE Comput. Graph. 18, 5 (1998), 32–40. https://doi.org/10.1109/38.708559 Paul Rubenstein. 2019. V ariational autoencoders are not autoencoders. http://paulrubenstein.co.uk/variational-autoencoders-are-not-autoencoders/ . Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In Proceedings of the IEEE Inter- national Conference on Acoustics, Speech, and Signal Processing (ICASSP ’18) . IEEE Signal Processing Society , Piscataway, NJ, USA, 6169–6173. https://doi.org/10.1109/ ICASSP.2018.8461967 Najmeh Sadoughi and Carlos Busso. 2019. Speech-driven animation with meaningful behaviors. Spe ech Commun. 110 (2019), 90–100. https://doi.org/10.1016/j.specom. 2019.04.005 Tim Salimans, Andrej Karpathy , Xi Chen, and Diederik P. Kingma. 2017. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihoo d and other modi- cations. In Proceedings of the International Conference on Learning Representations (ICLR’17) . 10. https://openreview .net/forum?id=BJrFC6ceg Jonathan Shen, Ruoming Pang, Ron J. W eiss, Mike Schuster, Nav deep Jaitly , Zongheng Y ang, Zhifeng Chen, Yu Zhang, Yuxuan W ang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Y onghui Wu. 2018. Natural T TS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’18) . IEEE Signal Processing So ciety , Piscataway, NJ, USA, 4799–4783. https://doi.org/10.1109/ICASSP. 2018.8461368 Harrison Jesse Smith, Chen Cao, Michael Ne, and Yingying Wang. 2019. Ecient neural networks for real-time motion style transfer . Procee dings of the ACM on Computer Graphics and Interactive T e chniques 2, 2, Article 13 (2019), 17 pages. https: //doi.org/10.1145/3340254 Sebastian Starke, Yiwei Zhao, Taku Komura, and Kazi Zaman. 2020. Local motion phases for learning multi-contact character movements. A CM Trans. Graph. 39, 4, Article 54 (2020), 14 pages. https://doi.org/10.1145/3386569.3392450 Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning lip sync from audio . ACM Trans. Graph. 36, 4, Article 95 (2017), 13 pages. https://doi.org/10.1145/3072959.3073640 Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara. 2018. Eciently trainable text-to-speech system based on deep convolutional networks with guided attention. In Proceedings of the IEEE International Conference on Acoustics, Spee ch, and Signal Processing (ICASSP’18) . IEEE Signal Processing Society, Piscataway , NJ, USA, 4784–4788. https://doi.org/10.1109/ICASSP.2018.8461829 ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020. 236:14 • Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow Graham W . T aylor and Georey E. Hinton. 2009. Factored conditional restricted Boltzmann machines for modeling motion style. In Proceedings of the International Conference on Machine Learning (ICML’09) . 1025–1032. https://icml.cc/Conferences/ 2009/papers/178.pdf Graham W . T aylor, Geore y E. Hinton, and Sam T . Roweis. 2011. T wo distributed-state models for generating high-dimensional time series. J. Mach. Learn. Res. 12, 28 (2011), 1025–1068. http://jmlr .org/pap ers/v12/taylor11a.html Sarah T aylor , T aehwan Kim, Yisong Y ue, Moshe Mahler , James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Trans. Graph. 36, 4, Article 93 (2017), 11 pages. https://doi.org/10.1145/3072959.3073699 Benigno Uria, Iain Murray , Steve Renals, Cassia V alentini-Botinhao, and John Bridle. 2015. Modelling acoustic feature dependencies with articial neural networks: Trajectory-RNADE. In Proce edings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’15) . IEEE Signal Pr ocessing Society , Piscat- away , NJ, USA, 4465–4469. https://doi.org/10.1109/ICASSP.2015.7178815 Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbr enner , Andrew Senior , and Koray K avukcuoglu. 2016. WaveNet: A generative model for raw audio. arXiv: 1609.03499 Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural dis- crete representation learning. In Advances in Neural Information Processing Sys- tems (NIPS’17) . Curran Associates, Inc., Red Hook, NY, USA, 6306–6315. http: //papers.nips.cc/paper/7210- neural- discrete- representation- learning Ashish V aswani, Noam Shazeer , Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS’17) . Curran Associates, Inc., Red Hook, NY, USA, 5998–6008. https://papers.nips.cc/paper/7181- attention- is- all- you- need Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. 2018. End-to-end spee ch- driven facial animation with temporal GANs. In Proceedings of the British Machine Vision Conference (BMVC’18) . BMV A Press, Durham, UK, 12. http://www.bmva.org/ bmvc/2018/contents/papers/0539.pdf Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. 2020. Realistic spe ech- driven facial animation with GANs. Int. J. Comput. Vis. 128, 5 (2020), 1398–1413. https://doi.org/10.1007/s11263- 019- 01251- 8 Jack M. W ang, David J. Fleet, and A aron Hertzmann. 2008. Gaussian pr ocess dynamical models for human motion. IEEE T . Pattern A nal. 30, 2 (2008), 283–298. https: //doi.org/10.1109/TP AMI.2007.1167 Xin W ang, Shinji Takaki, and Junichi Y amagishi. 2018. Autor egressive neural f0 model for statistical parametric speech synthesis. IEEE/ACM T . Audio Spe ech 26, 8 (2018), 1406–1419. https://doi.org/10.1109/T ASLP.2018.2828650 Y uxuan W ang, RJ Skerr y-Ryan, Daisy Stanton, Yonghui W u, Ron J. W eiss, Navdeep Jaitly , Zongheng Y ang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Y annis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: Towards end-to- end speech synthesis. In Proceedings of the A nnual Conference of the International Speech Communication Association (IN TERSPEECH’17) . ISCA, Grenoble, France, 4006–4010. https://doi.org/10.21437/Interspeech.2017- 1452 Zhiyong W ang, Jinxiang Chai, and Shihong Xia. 2019. Combining Recurrent Neural Networks and Adv ersarial Training for Human Motion Synthesis and Control. IEEE T . Vis. Comput. Gr. (2019), 14. https://doi.org/10.1109/T V CG.2019.2938520 Greg W elch and Gar y Bishop. 1995. An Introduction to the Kalman Filter . Te chnical Report 95-041. Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. https://techreports.cs.unc.edu/papers/95- 041.pdf Y oungwoo Y oon, W oo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-spe ech gesture generation for humanoid robots. In Procee dings of the IEEE International Conference on Robotics and Automation (ICRA ’19) . IEEE Robotics and Automation Society , Piscataway, NJ, USA, 4303–4309. https://doi.org/10.1109/ICRA.2019.8793720 G. Udny Y ule. 1927. On a method of investigating periodicities disturbed series, with special reference to Wolfer’s sunspot numbers. Philos. T . R. Soc. Lond. 226, 636–646 (1927), 267–298. https://doi.org/10.1098/rsta.1927.0007 Heiga Zen and Andrew Senior . 2014. Deep mixture density networks for acoustic modeling in statistical parametric spe ech synthesis. In Proceedings of the IEEE Inter- national Conference on Acoustics, Speech, and Signal Processing (ICASSP ’14) . IEEE Signal Processing Society , Piscataway, NJ, USA, 3844–3848. https://doi.org/10.1109/ ICASSP.2014.6854321 Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. 2019. Fixup initialization: Re- sidual learning without normalization. In Proceedings of the International Confer- ence on Learning Representations (ICLR’19) . 16. https://op enre view .net/forum?id= H1gsz30cKX He Zhang, Sebastian Starke, T aku Komura, and Jun Saito. 2018. Mo de-adaptive neural networks for quadruped motion control. ACM Trans. Graph. 37, 4, Article 145 (2018), 11 pages. https://doi.org/10.1145/3197517.3201366 Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. 2018. Auto- conditioned recurrent networks for extended complex human motion synthesis. In Proceedings of the International Conference on Learning Representations (ICLR’18) . 13. https://openreview .net/forum?id=r11Q2SlRW ACM T rans. Graph., V ol. 39, No. 4, Article 236. Publication date: November 2020.

MoGlow: Probabilistic and controllable motion synthesis using normalising flows

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment