3D-Aware Scene Manipulation via Inverse Graphics

We aim to obtain an interpretable, expressive, and disentangled scene representation that contains comprehensive structural and textural information for each object. Previous scene representations learned by neural networks are often uninterpretable,…

Authors: Shunyu Yao, Tzu Ming Harry Hsu, Jun-Yan Zhu

3D-Aware Scene Manipulation via Inverse Graphics
3D-A war e Scene Manipulation via In verse Graphics Shunyu Y ao ∗ IIIS, Tsinghua Univ ersity Tzu-Ming Harry Hsu ∗ MIT CSAIL Jun-Y an Zhu MIT CSAIL Jiajun W u MIT CSAIL Antonio T orralba MIT CSAIL William T . Fr eeman MIT CSAIL, Google Research Joshua B. T enenbaum MIT CSAIL Abstract W e aim to obtain an interpretable, expressi ve, and disentangled scene representation that contains comprehensiv e structural and textural information for each object. Pre vious scene representations learned by neural networks are often uninterpretable, limited to a single object, or lacking 3D kno wledge. In this work, we propose 3D scene de-rendering networks (3D-SDN) to address the abov e issues by integrating disentangled representations for semantics, geometry , and appearance into a deep generati ve model. Our scene encoder performs in verse graphics, translating a scene into a structured object-wise representation. Our decoder has two components: a dif ferentiable shape renderer and a neural texture generator . The disentanglement of semantics, geometry , and appearance supports 3D-aware scene manipulation, e.g., rotating and moving objects freely while keeping the consistent shape and texture, and changing the object appearance without af fecting its shape. Experiments demonstrate that our editing scheme based on 3D-SDN is superior to its 2D counterpart. 1 Introduction Humans are incredible at percei ving the world, but more distinguishing is our mental ability to simulate and imagine what will happen. Given a street scene as in Fig. 1 , we can ef fortlessly detect and recognize cars and their attrib utes, and more interestingly , imagine ho w cars may mo ve and rotate in the 3D world. Motiv ated by such human abilities, in this work we seek to obtain an interpretable, expressi ve, and disentangled scene representation for machines, and employ the learned representation for flexible, 3D-a ware scene manipulation. Deep generati ve models hav e led to remarkable breakthroughs in learning hierarchical representations of images and decoding them back to images [ Goodfello w et al. , 2014 ]. Howe ver , the obtained representation is often limited to a single object, hard to interpret, and missing the complex 3D structure behind visual input. As a result, these deep generati ve models cannot support manipulation tasks such as mo ving an object around as in Fig. 1 . On the other hand, computer graphics engines use a predefined, structured, and disentangled input (e.g., graphics code), often intuitiv e for scene manipulation. Howe ver , it is in general intractable to infer the graphics code from an input image. In this paper , we propose 3D scene de-rendering networks (3D-SDN) to incorporate an object-based, interpretable scene representation into a deep generativ e model. Our method employs an encoder - decoder architecture and has three branches: one for scene semantics, one for object geometry and 3D pose, and one for the appearance of objects and the background. As shown in Fig. 2 , the semantic de-renderer aims to learn the semantic segmentation of a scene. The geometric de-renderer learns to infer the object shape and 3D pose with the help of a dif ferentiable shape renderer . The textural * indicates equal contributions. The work was done when Shunyu Y ao was a visiting student at MIT CSAIL. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. (0 0 0 6 , ’f o g ’, 0 0 0 6 7 ) Semantic + T extural De- render T extural Render Geometric De- rende r < seg sky code=[0.90, 0.98, 0.43, … ]> < seg tree code=[0.93, 0.99, 0.52, … ]> ... ... Manipulate Geometric Render Figure 1: W e propose to learn a holistic scene representation that encodes scene semantics as well as 3D and textural information. Our encoder-decoder frame work learns disentangled representations for image reconstruction and 3D-aware image editing. For example, we can move cars to v arious locations with new 3D poses. de-renderer learns to encode the appearance of each object and background se gment. W e then emplo y the geometric renderer and the textural renderer to recov er the input scene using the abov e semantic, geometric, and te xtural information. Disentangling 3D geometry and pose from texture enables 3D-aware scene manipulation. For example in Fig. 1 , to mov e a car closer , we can edit its position and 3D pose, but lea ve its te xture representation untouched. Both quantitativ e and qualitativ e results demonstrate the effecti veness of our method on two datasets: V irtual KITTI [ Gaidon et al. , 2016 ] and Cityscapes [ Cordts et al. , 2016 ]. Furthermore, we create an image editing benchmark on V irtual KITTI to e valuate our editing scheme against 2D baselines. Finally , we inv estigate our model design by ev aluating the accurac y of obtained internal representation. Please check out our code and website for more details. 2 Related W ork Interpr etable image repr esentation. Our work is inspired by prior work on obtaining interpretable visual representations with neural networks [ Kulkarni et al. , 2015 , Chen et al. , 2016 ]. T o achiev e this goal, DC-IGN [ Kulkarni et al. , 2015 ] freezes a subset of latent codes while feeding images that mov e along a specific direction on the image manifold. A recurrent model [ Y ang et al. , 2015 ] learns to alter disentangled latent factors for vie w synthesis. InfoGAN [ Chen et al. , 2016 ] proposes to disentangle an image into independent factors without supervised data. Another line of approaches are built on intrinsic image decomposition [ Barrow and T enenbaum , 1978 ] and hav e shown promising results on faces [ Shu et al. , 2017 ] and objects [ Janner et al. , 2017 ]. While prior work focuses on a single object, we aim to obtain a holistic scene understanding. Our work most resembles the paper by W u et al. [ 2017a ], who propose to ‘de-render’ an image with an encoder-decoder frame work that uses a neural network as the encoder and a graphics engine as the decoder . Howe ver , their method cannot back-propagate the gradients from the graphics engine or generalize to a ne w en vironment, and the results were limited to simple game en vironments such as Minecraft. Unlike W u et al. [ 2017a ], both of our encoder and decoder are dif ferentiable, making it possible to handle more complex natural images. Deep generative models. Deep generative models [ Goodfello w et al. , 2014 ] have been used to synthesize realistic images and learn rich internal representations. Representations learned by these methods are typically hard for humans to interpret and understand, often ignoring the 3D nature of our visual world. Many recent papers have e xplored the problem of 3D reconstruction from a single color image, depth map, or silhouette [ Cho y et al. , 2016 , Kar et al. , 2015 , T atarchenko et al. , 2016 , 2 T e xt u r a l D e - re n d e re r T e xt u r a l R e n d e r e r Ge o me tri c De - re n d e re r Ge o me tri c Re n d e r e r S e m a n t i c D e - re n d e re r Figure 2: Framework overview . The de-renderer (encoder) consists of a semantic-, a textural- and a geometric branch. The textural renderer and geometric renderer then learn to reconstruct the original image from the representations obtained by the encoder modules. T ulsiani et al. , 2017 , W u et al. , 2017b , 2016b , Y an et al. , 2016b , Soltani et al. , 2017 ]. Our model builds upon and extends these approaches. W e infer the 3D object geometry with neural nets and re-render the shapes into 2D with a dif ferentiable renderer . This improv es the quality of the generated results and allows 3D-a ware scene manipulation. Deep image manipulation. Learning-based methods have enabled v arious image editing tasks, such as style transfer [ Gatys et al. , 2016 ], image-to-image translation [ Isola et al. , 2017 , Zhu et al. , 2017a , Liu et al. , 2017 ], automatic colorization [ Zhang et al. , 2016 ], inpainting [ Pathak et al. , 2016 ], attribute editing [ Y an et al. , 2016a ], interactiv e editing [ Zhu et al. , 2016 ], and denoising [ Gharbi et al. , 2016 ]. Dif ferent from prior work that operates in a 2D setting, our model allows 3D-aw are image manipulation. Besides, while the abo ve methods often require a gi ven structured representation (e.g., label map [ W ang et al. , 2018 ]) as input, our algorithm can learn an internal representation suitable for image editing by itself. Our w ork is also inspired by pre vious semi-automatic 3D editing systems [ Karsch et al. , 2011 , Chen et al. , 2013 , Kholgade et al. , 2014 ]. While these systems require human annotations of object geometry and scene layout, our method is fully automatic. 3 Method W e propose 3D scene de-rendering networks (3D-SDN) in an encoder-decoder framework. As shown in Fig. 2 , we first de-render (encode) an image into disentangled representations for semantic, textural, and geometric information. Then, a renderer (decoder) reconstructs the image from the representation. The semantic de-renderer learns to produce the semantic se gmentation (e.g. trees, sky , road) of the input image. The 3D geometric de-renderer detects and se gments objects (cars and v ans) from image, and infers the geometry and 3D pose for each object with a differentiable shape renderer . After inference, the geometric renderer computes an instance map, a pose map, and normal maps for objects in the scene for the textural branch. The textural de-renderer first fuses the semantic map generated by the semantic branch and the instance map generated by the geometric branch into an instance-le vel semantic label map, and learns to encode the color and texture of each instance (object or background semantic class) into a texture code. Finally , the textural renderer combines the instance-wise label map (from the textural de-renderer), te xtural codes (from the textural de-renderer), and 3D information (instance, normal, and pose maps from the geometric branch) to reconstruct the input image. 3.1 3D Geometric Inference Fig. 3 shows the 3D geometric inference module for the 3D-SDN. W e first segment object instances with Mask-RCNN [ He et al. , 2017 ]. For each object, we infer its 3D mesh model and other attributes from its masked image patch and bounding box. 3 Object-wise 3D Inference 0 0 0 6 / cl o n e / 0 0 0 4 8 . p n g Scale 3D Renderer Bounding Box Masked Image 3D Information Instance Map Normal Map Pose Map Mesh Model FFD coefficients Output Input Distribution Mask- RCNN Rotation Transla tion Downsampling Upsampling ROI Pooling REINFORCE Fully Connected Figure 3: 3D geometric inference. Giv en a masked object image and its bounding box, the geometric branch of the 3D-SDN predicts the object’ s mesh model, scale, rotation, translation, and the free-form deformation (FFD) coefficients. W e then compute 3D information (instance map, normal maps, and pose map) using a differentiable renderer [ Kato et al. , 2018 ]. 3D estimation. W e describe a 3D object with a mesh M , its scale s ∈ R 3 , rotation q ∈ R 4 as an unit quaternion, and translation t ∈ R 3 . For most real-world scenarios such as road scenes, objects often lie on the ground. Therefore, the quaternion has only one rotational degree of freedom: i.e., q ∈ R . As shown in Fig. 3 , given an object’ s masked image and estimated bounding box, the geometric de-renderer learns to predict the mesh M by first selecting a mesh from eight candidate shapes, and then applying a Free-Form Deformation (FFD) [ Sederber g and Parry , 1986 ] with inferred grid point coordinates φ . It also predicts the scale, rotation, and translation of the 3D object. Belo w we describe the training objectiv e for the network. 3D attrib ute pr ediction loss. The geometric de-renderer directly predicts the values of scale s and rotation q . For translation t , it instead predicts the object’ s distance to the camera t and the image-plane 2D coordinates of the object’ s 3D center , denoted as [ x 3D , y 3D ] . Giv en the intrinsic camera matrix, we can calculate t from t and [ x 3D , y 3D ] . W e parametrize t in the log-space [ Eigen et al. , 2014 ]. As determining t from the image patch of the object is under -constrained, our model predicts a normalized distance τ = t √ w h , where [ w , h ] is the width and height of the bounding box. This reparameterization improv es results as shown in later experiments (Sec. 4.2 ). For [ x 3D , y 3D ] , we follow the prior work [ Ren et al. , 2015 ] and predict the of fset e = [( x 3D − x 2D ) /w , ( y 3D − y 2D ) /h ] relativ e to the estimated bounding box center [ x 2D , y 2D ] . The 3D attribute prediction loss for scale, rotation, and translation can be calculated as L pred = k log ˜ s − log s k 2 2 +  1 − ( ˜ q · q ) 2  + k ˜ e − e k 2 2 + (log ˜ τ − log τ ) 2 , (1) where ˜ · denotes the predicted attributes. Reprojection consistency loss. W e also use a reprojection loss to ensure the 2D rendering of the predicted shape fits its silhouette S [ Y an et al. , 2016b , Rezende et al. , 2016 , W u et al. , 2016a , 2017b ]. Fig. 4a and Fig. 4b show an example. Note that for mesh selection and deformation, the reprojection loss is the only training signal, as we do not hav e a ground truth mesh model. W e use a differentiable renderer [ Kato et al. , 2018 ] to render the 2D silhouette of a 3D mesh M , according to the FFD coefficients φ and the object’ s scale, rotation and translation ˜ π = { ˜ s , ˜ q , ˜ t } : ˜ S = RenderSilhouette ( FFD φ ( M ) , ˜ π ) . W e then calculate the reprojection loss as L reproj =    ˜ S − S    . W e ignore the region occluded by other objects. 3D model selection via REINFORCE. W e choose the mesh M from a set of eight meshes to minimize the reprojection loss. As the model selection process is non-dif ferentiable, we formulate the model selection as a reinforcement learning problem and adopt a multi-sample REINFORCE paradigm [ W illiams , 1992 ] to address the issue. The network predicts a multinomial distribution ov er the mesh models. W e use the negati ve reprojection loss as the re ward. W e experimented with a single mesh without FFD in Fig. 4c . Fig. 4d shows a significant improv ement when the geometric branch learns to select from multiple candidate meshes and allows fle xible deformation. 4 (a) w/o re -projection (b) w/ re -projection (c) Single CAD w/o FFD (d) Multiple CADs w/ FFD Figure 4: (a)(b) Re-projection consistency loss: Object silhouettes rendered without and with re- projection consistency loss. (c)(d) Multiple CAD models and free form deformation (FFD): In (c), a generic car model without FFD f ails to represent the input v ans. In (d), our model learns to choose the best-fitting mesh from eight candidate meshes and allo ws FFD. As a result, we can reconstruct the silhouettes more precisely . 3.2 Semantic and T extural Inference The semantic branch of the 3D-SDN uses a semantic segmentation model DRN [ Y u et al. , 2017 , Zhou et al. , 2017 ] to obtain an semantic map of the input image. The textural branch of the 3D-SDN first obtains an instance-wise semantic label map L by combining the semantic map generated by the semantic branch and the instance map generated by the geometric branch, resolving an y conflict in fa vor of the instance map [ Kirillov et al. , 2018 ]. Built on recent work on multimodal image-to-image translation [ Zhu et al. , 2017b , W ang et al. , 2018 ], our textural branch encodes the texture of each instance into a low dimensional latent code, so that the textural renderer can later reconstruct the appearance of the original instance from the code. By ‘instance’ we mean a background semantic class (e.g., road, sky) or a foreground object (e.g., car , v an). Later, we combine the object textural code with the estimated 3D information to better reconstruct objects. Formally speaking, gi ven an image I and its instance label map L , we want to obtain a feature embedding z such that ( L , z ) can later reconstruct I . W e formulate the textural branch of the 3D-SDN under a conditional adversarial learning framework with three networks ( G, D, E ) : a textural de- renderer E : ( L , I ) → z , a texture renderer G : ( L , z ) → I and a discriminator D : ( L , I ) → [0 , 1] are trained jointly with the following objecti ves. T o increase the photorealism of generated images, we use a standard conditional GAN loss [ Goodfel- low et al. , 2014 , Mirza and Osindero , 2014 , Isola et al. , 2017 ] as: ∗ L GAN ( G, D , E ) = E L , I h log ( D ( L , I )) + log  1 − D  L , ˜ I i , (2) where ˜ I = G ( L , E ( L , I )) is the reconstructed image. T o stabilize the training, we follo w the prior work [ W ang et al. , 2018 ] and use both discriminator feature matching loss [ W ang et al. , 2018 , Larsen et al. , 2016 ] and perceptual loss [ Doso vitskiy and Brox , 2016 , Johnson et al. , 2016 ], both of which aim to match the statistics of intermediate features between generated and real images: L FM ( G, D , E ) = E L , I " T F X i =1 1 N i    F ( i ) ( I ) − F ( i )  ˜ I     1 + T D X i =1 1 M i    D ( i ) ( I ) − D ( i )  ˜ I     1 # , (3) where F ( i ) denotes the i -th layer of a pre-trained VGG network [ Simonyan and Zisserman , 2015 ] with N i elements. Similarly , for our our discriminator D , D ( i ) denotes the i -th layer with M i elements. T F and T D denote the number of layers in network F and D . W e fix the network F during our training. Finally , we use a pixel-wise image reconstruction loss as: L Recon ( G, E ) = E L , I h    I − ˜ I    1 i . (4) The final training objectiv e is formulated as a minimax game between ( G, E ) and D : G ∗ , E ∗ = arg min G,E  max D  L GAN ( G, D , E )  + λ FM L FM ( G, D , E ) + λ Recon L Recon ( G, E )  , (5) where λ FM and λ Recon control the relativ e importance of each term. ∗ W e denote E L , I , E ( L , I ) ∼ p data( L , I ) for simplicity . 5 Decoupling geometry and texture. W e observe that the textural de-renderer often learns not only texture b ut also object poses. T o further decouple these two factors, we concatenate the inferred 3D information (i.e., pose map and normal map) from the geometric branch to the texture code map z and feed both of them to the textural renderer G . Also, we reduce the dimension of the texture code so that the code can focus on texture as the 3D geometry and pose are already provided. These two modifications help encode textural features that are independent of the object geometry . It also resolves ambiguity in object poses: e.g., cars share similar silhouettes when facing forward or backward. Therefore, our renderer can synthesize an object under dif ferent 3D poses. (See Fig. 5b and Fig. 7b for example). 3.3 Implementation Details Semantic branch. Our semantic branch adopts Dilated Residual Networks (DRN) for semantic segmentation [ Y u et al. , 2017 , Zhou et al. , 2017 ]. W e train the network for 25 epochs. Geometric branch. W e use Mask-RCNN for object proposal generation [ He et al. , 2017 ]. For object meshes, we choose eight CAD models from ShapeNet [ Chang et al. , 2015 ] including cars, vans, and buses. Gi ven an object proposal, we predict its scale, rotation, translation, 4 3 FFD grid point coefficients, and an 8 -dimensional distribution across candidate meshes with a ResNet-18 network [ He et al. , 2015 ]. The translation t can be recovered using the estimated offset e , the normalized distance log τ , and the ground truth focal length of the image. They are then fed to a differentiable renderer [ Kato et al. , 2018 ] to render the instance map and normal map. W e empirically set λ reproj = 0 . 1 . W e first train the network with L pred using Adam [ Kingma and Ba , 2015 ] with a learning rate of 10 − 3 for 256 epochs and then fine-tune the model with L pred + λ reproj L reproj and REINFORCE with a learning rate of 10 − 4 for another 64 epochs. T extural branch. W e first train the semantic branch and the geometric branch separately and then train the textural branch using the input from the abov e two branches. W e use the same architecture as in W ang et al. [ 2018 ]. W e use two discriminators of different scales and one generator . W e use the VGG network [ Simon yan and Zisserman , 2015 ] as the feature extractor F for loss λ FM (Eqn. 3 ). W e set the dimension of the texture code as 5 . W e quantize the object’ s rotation into 24 bins with one-hot encoding and fill each rendered silhouette of the object with its rotation encoding, yielding a pose map of the input image. Then we concatenate the pose map, the predicted object normal map, the texture code map z , the semantic label map, and the instance boundary map together, and feed them to the neural textural renderer to reconstruct the input image. W e set λ FM = 5 and λ Recon = 10 , and train the textural branch for 60 epochs on V irtual KITTI and 100 epochs on Cityscapes. 4 Results W e report our results in two parts. First, we present how the 3D-SDN enables 3D-aware image editing. For quantitati ve comparison, we compile a V irtual KITTI image editing benchmark to contrast 3D-SDNs and baselines without 3D knowledge. Second, we analyze our design choices and ev aluate the accuracy of representations obtained by different v ariants. The code and full results can be found at our website . Datasets. W e conduct experiments on two street scene datasets: V irtual KITTI [ Gaidon et al. , 2016 ] and Cityscapes [ Cordts et al. , 2016 ]. V irtual KITTI serves as a proxy to the KITTI dataset [ Geiger et al. , 2012 ]. The dataset contains fiv e virtual worlds, each rendered under ten dif ferent conditions, leading to a sum of 21 , 260 images. For each world, we use either the first or the last 80% consecuti ve frames for training and the rest for testing. For object-wise e v aluations, we use objects with more than 256 visible pixels, a < 70% occlusion ratio, and a < 70% truncation ratio, follo wing the ratios defined in Gaidon et al. [ 2016 ]. In our e xperiments, we do wnscale V irtual KITTI images to 624 × 192 and Cityscapes images to 512 × 256 . W e have also b uilt the V irtual KITTI Image Editing Benchmark , allowing us to e valuate image editing algorithms systematically . The benchmark contains 92 pairs of images in the test set with the camera either stationary or almost still. Fig. 1 shows an example pair . For each pair , we formulate the edit with object-wise operations. Each operation is parametrized by a starting position ( x src 3D , y src 3D ) , an 6 Original image E dited images (a) (b) (c) (d) ( 0 0 0 6 ,% o v e r c a s t ,% 0 0 0 6 7 ) Wh i t e : % ( 0 0 0 1 ,% 3 0 - de g - l e f t ,% 0 0 4 2 0 ) Bl a c k : ( 0 0 2 0 ,% o v e r c a s t ,% 0 0 1 0 1 ) ( 0 0 2 0 ,% c l o n e ,% 0 0 0 7 3 ) ( 0 0 1 8 ,% f o g ,% 0 0 3 2 4 ) ( 0 0 0 6 ,% 3 0 - de g - r i g h t ,% 0 0 0 4 3 ) Figure 5: Example user editing results on V irtual KITTI. (a) W e mov e a car closer to the camera, keeping the same texture. (b) W e can synthesize the same car with different 3D poses. The same texture code is used for different poses. (c) W e modify the appearance of the input red car using new texture codes. Note that its geometry and pose stay the same. W e can also change the en vironment by editing the background texture codes. (d) W e can inpaint occluded regions and remo ve objects. 3D-SDN (ours) 2D 2D+ LPIPS (whole) 0.1280 0.1316 0.1317 LPIPS (all) 0.1444 0.1782 0.1799 LPIPS (largest) 0.1461 0.1795 0.1813 (a) Perception similarity scores 2D 2D+ 3D-SDN (ours) 76.88% 74.28% (b) Human study results T able 1: Evaluations on V irtual KITTI editing benchmark. (a) W e ev aluate the perceptual similarity [ Zhang et al. , 2018 ] on the whole image (whole), all edited regions (all) of the image, and the largest edited region (largest) of the image, respectiv ely . Lower scores are better . (b) Human subjects compare our method against two baselines. The percentage shows ho w often they prefer 3D-SDNs to the baselines. Our method outperforms previous 2D approaches consistently . ending position  x tgt 3D , y tgt 3D  (both are object’ s 3D center in image plane), a zoom-in factor ρ , and a rotation ∆ r y with respect to the y -axis of the camera coordinate system. The Cityscapes dataset contains 2 , 975 training images with pixel-le vel semantic segmentation and instance segmentation ground truth, but with no 3D annotations, making the geometric inference more challenging. Therefore, giv en each image, we first predict 3D attributes with our geometric branch pre-trained on V irtual KITTI dataset; we then optimize both attributes and mesh parameters π and φ by minimizing the reprojection loss L reproj . W e use the Adam solver [ Kingma and Ba , 2015 ] with a learning rate of 0 . 03 for 16 iterations. 4.1 3D-A ware Image Editing The semantic, geometric, and textural disentanglement pro vides an expressi ve 3D image manipulation scheme. W e can modify the 3D attrib utes of an object to translate, scale, or rotate it in the 3D world, while keeping the consistent visual appearance. W e can also change the appearance of the object or the background by modifying the texture code alone. 7 Original image E dited images (a) (b) (c) ! " # $ % & ' () * * * * + + () * * * * , - . "/ '' () * * * * 0 , () * * * * , - . "& # % # 1 # % 2 () * * * * * * () * , 3 4 5 + . 67 ' & 8 9 () * * * * * 4 () * * * * , - . (d) Figure 6: Example user editing results on Cityscapes. (a) W e move two cars closer to the camera. (b) W e rotate the car with different angles. (c) W e recover a tin y and occluded car and move it closer . Our model can synthesize the occluded region as well as vie w the occluded car from the side. (d) W e mov e a small car closer and then change its locations. Methods. W e compare our 3D-SDNs with the following two baselines: • 2D: Giv en the source and target positions, the naïve 2D baseline only applies the 2D translation and scaling, discarding the ∆ r y rotation. • 2D+: The 2D+ baseline includes the 2D operations above and rotates the 2D silhouette (instead of the 3D shape) along the y -axis according to the rotation ∆ r y in the benchmark. Metrics. The pixel-le vel distance might not be a meaningful similarity metric, as tw o visually similar images may ha v e a lar ge L1/L2 distance [ Isola et al. , 2017 ]. Instead, we adopt the Learned Perceptual Image Patch Similarity (LPIPS) metric [ Zhang et al. , 2018 ], which is designed to match the human perception. LPIPS ranges from 0 to 1, with 0 being the most similar . W e apply LPIPS on (1) the full image, (2) all edited objects, and (3) the largest edited object. Besides, we conduct a human study , where we show the target image as well as the edited results from two dif ferent methods: 3D-SDN vs. 2D and 3D-SDN vs. 2D+. W e ask 120 human subjects on Amazon Mechnical T urk which edited result looks closer to the target. For better visualization, we highlight the largest edited object in red. W e then compute, between a pair of methods, ho w often one method is preferred, across all test images. Results. Fig. 5 and Fig. 6 show qualitati ve results on V irtual KITTI and Cityscapes, respecti vely . By modifying semantic, geometric, and te xture codes, our editing interf ace enables a wide range of scene manipulation applications. Fig. 7 shows a direct comparison to a state-of-the-art 2D manipulation method pix2pixHD [ W ang et al. , 2018 ]. Quantitatively , T able 1a sho ws that our 3D-SDN outperforms both baselines by a large margin regarding LPIPS. T able 1b shows that a majority of the human subjects perfer our results to 2D baselines. 4.2 Evaluation on the Geometric Representation Methods. As described in Section 3.1 , we adopt multiple strategies to improv e the estimation of 3D attributes. As an ablation study , we compare the full 3D-SDN, which is first trained using L pred then fine-tuned using L pred + λ reproj L reproj , with its four variants: • w/o L reproj : we only use the 3D attribute prediction loss L pred . • w/o quaternion constraint: we use the full rotation space characterized by a unit quaternion q ∈ R 4 , instead of limiting to R . 8 Original image 3D -SDN (ours) pix2pixHD (a) (b) Figure 7: Comparison between 3D-SDN (ours) and pix2pixHD [ W ang et al. , 2018 ] . (a) W e successfully recov er the mask of an occluded car and mov e it closer to the camera while pix2pixHD fails. (b) W e rotate the car from back to front. W ith the texture code encoded from the back view and a frontal pose, our model can remov e the tail lights, while pix2pixHD cannot, gi ven the same instance map. Orientation similarity Distance ( × 10 − 2 ) Scale Reprojection error ( × 10 − 3 ) Mousavian et al. [ 2017 ] 0.976 4.41 0.391 9.80 w/o L reproj 0.980 3.76 0.372 9.54 w/o quaternion constraint 0.970 4.59 0.403 7.58 w/o normalized distance τ 0.979 4.27 0.420 6.42 w/o MultiCAD and FFD 0.984 3.37 0.464 4.60 3D-SDN (ours) 0.987 3.87 0.382 3.37 T able 2: Perf ormance of 3D attributes prediction on Virtual KITTI . W e compare our full model with its four v ariants. Our full model performs the best regarding most metrics. Our model obtains much lower reprojection error . Refer to the text for details about our metrics. • w/o normalized distance τ : we predict the original distance t in log space rather than the normalized distance τ . • w/o MultiCAD and FFD: we use a single CAD model without free-form deformation (FFD). W e also compare with a 3D bounding box estimation method [ Mousa vian et al. , 2017 ], which first infers the object’ s 2D bounding box and pose from input and then searches for its 3D bounding box. Metrics. W e use different metrics for dif ferent quantities. For rotation, we compute the orientation similarity (1 + cos θ ) / 2 [ Geiger et al. , 2012 ], where θ is the geodesic distance between the predicted and the ground truth rotations; for distance, we adopt an absolute logarithm error   log t − log ˜ t   ; and for scale, we adopt the Euclidean distance k s − ˜ s k 2 . In addition, we compute the per-pixel reprojection error between projected 2D silhouettes and ground truth segmentation masks. Results. T able 2 shows that our full model has significantly smaller 2D reprojection error than other variants. All of the proposed components contribute to the performance. 5 Conclusion In this work, we ha ve de veloped 3D scene de-rendering networks (3D-SDN) to obtain an interpretable and disentangled scene representation with rich semantic, 3D structural, and textural information. Though our work mainly focuses on 3D-aware scene manipulation, the learned representations could be potentially useful for various tasks such as image reasoning, captioning, and analogy-making. Future directions include better handling uncommon object appearance and pose, especially those not in the training set, and dealing with deformable shapes such as human bodies. Acknowledgements. This work is supported by NSF #1231216, NSF #1524817, ONR MURI N00014-16-1-2007, T oyota Research Institute, and Facebook. 9 References Harry G Barrow and Jay M T enenbaum. Recov ering intrinsic scene characteristics from images. Computer V ision Systems , 1978. 2 Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savv a, Shuran Song, Hao Su, Jianxiong Xiao, Li Y i, and Fisher Y u. Shapenet: An information-rich 3d model repository. , 2015. 6 T ao Chen, Zhe Zhu, Ariel Shamir, Shi-Min Hu, and Daniel Cohen-Or . 3-sweep: Extracting editable objects from a single photo. ACM T ransactions on Graphics (TOG) , 32(6):195, 2013. 3 Xi Chen, Y an Duan, Rein Houthooft, John Schulman, Ilya Sutskev er , and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generativ e adversarial nets. In NIPS , 2016. 2 Christopher B Choy , Danfei Xu, JunY oung Gw ak, K e vin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV , 2016. 2 Marius Cordts, Mohamed Omran, Sebastian Ramos, T imo Rehfeld, Markus Enzweiler , Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR , 2016. 2 , 6 Alex ey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In NIPS , 2016. 5 David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS , 2014. 4 Adrien Gaidon, Qiao W ang, Y ohann Cabon, and Eleonora V ig. V irtual worlds as proxy for multi-object tracking analysis. In CVPR , 2016. 2 , 6 Leon A Gatys, Ale xander S Ecker, and Matthias Bethge. Image style transfer using conv olutional neural networks. In CVPR , 2016. 3 Andreas Geiger , Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR , 2012. 6 , 9 Michaël Gharbi, Gaurav Chaurasia, Sylv ain Paris, and Frédo Durand. Deep joint demosaicking and denoising. A CM T ransactions on Graphics (TOG) , 35(6):191, 2016. 3 Ian Goodfello w , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. Generative adv ersarial nets. In NIPS , 2014. 1 , 2 , 5 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR , 2015. 6 Kaiming He, Georgia Gkioxari, Piotr Dollár , and Ross Girshick. Mask r -cnn. In ICCV , 2017. 3 , 6 Phillip Isola, Jun-Y an Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR , 2017. 3 , 5 , 8 Michael Janner , Jiajun W u, T ejas D. Kulkarni, Ilker Y ildirim, and Josh T enenbaum. Self-supervised intrinsic image decomposition. In NIPS , 2017. 2 Justin Johnson, Ale xandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super - resolution. In ECCV , 2016. 5 Abhishek Kar , Shubham T ulsiani, Joao Carreira, and Jitendra Malik. Category-specific object reconstruction from a single image. In CVPR , 2015. 2 Ke vin Karsch, V arsha Hedau, Da vid Forsyth, and Derek Hoiem. Rendering synthetic objects into le gacy photographs. ACM T ransactions on Graphics (TOG) , 30(6):157, 2011. 3 Hiroharu Kato, Y oshitaka Ushiku, and T atsuya Harada. Neural 3d mesh renderer . In CVPR , 2018. 4 , 6 Natasha Kholgade, T omas Simon, Alexei Efros, and Y aser Sheikh. 3d object manipulation in a single photograph using stock 3d models. ACM T ransactions on Graphics (TOG) , 33(4):127, 2014. 3 Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR , 2015. 6 , 7 10 Alexander Kirillov , Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár . P anoptic segmentation. arXiv pr eprint arXiv:1801.00868 , 2018. 5 T ejas D Kulkarni, W illiam F Whitney , Pushmeet Kohli, and Josh T enenbaum. Deep conv olutional in verse graphics network. In NIPS , 2015. 2 Anders Boesen Lindbo Larsen, Søren Kaae Sønderby , and Ole Winther . Autoencoding beyond pix els using a learned similarity metric. In ICML , 2016. 5 Ming-Y u Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS , 2017. 3 Mehdi Mirza and Simon Osindero. Conditional generativ e adversarial nets. arXiv preprint , 2014. 5 Arsalan Mousavian, Dragomir Anguelov , John Flynn, and Jana Košecká. 3d bounding box estimation using deep learning and geometry . In CVPR . IEEE, 2017. 9 Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Tre vor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR , 2016. 3 Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: T ow ards real-time object detection with region proposal networks. In NIPS , 2015. 4 Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In NIPS , 2016. 4 Thomas W Sederberg and Scott R Parry . Free-form deformation of solid geometric models. A CM T ransactions on Graphics (TOG) , 20(4):151–160, 1986. 4 Zhixin Shu, Ersin Y umer , Sunil Hadap, Kalyan Sunkav alli, Eli Shechtman, and Dimitris Samaras. Neural face editing with intrinsic image disentangling. In CVPR , 2017. 2 Karen Simonyan and Andre w Zisserman. V ery deep con volutional networks for lar ge-scale image recognition. In ICLR , 2015. 5 , 6 Amir Arsalan Soltani, Haibin Huang, Jiajun W u, T ejas D Kulkarni, and Joshua B T enenbaum. Synthesizing 3d shapes via modeling multi-vie w depth maps and silhouettes with deep generativ e networks. In CVPR , 2017. 3 Maxim T atarchenko, Ale xey Doso vitskiy , and Thomas Brox. Multi-view 3d models from single images with a con volutional netw ork. In ECCV , 2016. 2 Shubham T ulsiani, T inghui Zhou, Alex ei A Efros, and Jitendra Malik. Multi-view supervision for single-vie w reconstruction via differentiable ray consistenc y . In CVPR , 2017. 3 T ing-Chun W ang, Ming-Y u Liu, Jun-Y an Zhu, Andrew T ao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR , 2018. 3 , 5 , 6 , 8 , 9 Ronald J W illiams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. MLJ , 8(3-4):229–256, 1992. 4 Jiajun W u, T ianfan Xue, Joseph J Lim, Y uandong T ian, Joshua B T enenbaum, Antonio T orralba, and William T Freeman. Single image 3d interpreter network. In ECCV , 2016a. 4 Jiajun W u, Chengkai Zhang, Tianf an Xue, W illiam T Freeman, and Joshua B T enenbaum. Learning a Probabilis- tic Latent Space of Object Shapes via 3D Generativ e-Adversarial Modeling. In NIPS , 2016b. 3 Jiajun W u, Joshua B T enenbaum, and Pushmeet K ohli. Neural scene de-rendering. In CVPR , 2017a. 2 Jiajun W u, Y ifan W ang, T ianfan Xue, Xingyuan Sun, W illiam T Freeman, and Joshua B T enenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In NIPS , 2017b. 3 , 4 Xinchen Y an, Jimei Y ang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV , 2016a. 3 Xinchen Y an, Jimei Y ang, Ersin Y umer, Y ijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS , 2016b. 3 , 4 Jimei Y ang, Scott E Reed, Ming-Hsuan Y ang, and Honglak Lee. W eakly-supervised disentangling with recurrent transformations for 3d view synthesis. In NIPS , 2015. 2 11 Fisher Y u, Vladlen K oltun, and Thomas A Funkhouser . Dilated residual netw orks. In CVPR , 2017. 5 , 6 Richard Zhang, Phillip Isola, and Alex ei A Efros. Colorful image colorization. In ECCV , 2016. 3 Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oli ver W ang. The unreasonable ef fecti veness of deep networks as a perceptual metric. In CVPR , 2018. 7 , 8 Bolei Zhou, Hang Zhao, Xa vier Puig, Sanja Fidler , Adela Barriuso, and Antonio T orralba. Scene parsing through ade20k dataset. In CVPR , 2017. 5 , 6 Jun-Y an Zhu, Philipp Krähenbühl, Eli Shechtman, and Alex ei A Efros. Generativ e visual manipulation on the natural image manifold. In ECCV , 2016. 3 Jun-Y an Zhu, T aesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adv ersarial networks. In ICCV , 2017a. 3 Jun-Y an Zhu, Richard Zhang, Deepak Pathak, Tre vor Darrell, Alexei A Efros, Oli ver W ang, and Eli Shechtman. T o ward multimodal image-to-image translation. In NIPS , 2017b. 5 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment