Convolutional sparse coding for capturing high speed video content
Video capture is limited by the trade-off between spatial and temporal resolution: when capturing videos of high temporal resolution, the spatial resolution decreases due to bandwidth limitations in the capture system. Achieving both high spatial and…
Authors: Ana Serrano, Elena Garces, Diego Gutierrez
V olume 35 ( 2016 ), Number 0 pp. 1–9 COMPUTER GRAPHICS forum Con volutional sparse coding f or capturing high speed video content Ana Serrano 1 Elena Garces 1 Diego Gutierrez 1 Belen Masia 1 , 2 1 Univ ersidad de Zaragoza - I3A 2 MPI Informatik Coded Image Frame 1 Frame 10 Frame 20 Sparse map and filter s Figure 1: Reconstruction of a high speed video sequence fr om a single, temporally-coded image using convolutional sparse coding (CSC). The sequence shows a lighter igniting. Left: Coded image, fr om which 20 individual frames will be reconstructed; inset shows a close-up of the coded temporal information. Middle: Three frames of the r econstructed video. Right: CSC models the signal of interest as a con volution between sparse feature maps and trained filter banks: The image shows a sparse feature map for one of the frames, and the inset marked in blue some of the trained filters. Abstract V ideo captur e is limited by the trade-off between spatial and temporal r esolution: when capturing videos of high temporal r esolution, the spatial resolution decr eases due to bandwidth limitations in the captur e system. Achieving both high spatial and temporal resolution is only possible with highly specialized and very expensive har dwar e, and even then the same basic trade-of f r emains. The recent intr oduction of compressive sensing and sparse reconstruction techniques allows for the capture of single-shot high-speed video, by coding the temporal information in a single frame, and then r econstructing the full video sequence fr om this single coded image and a trained dictionary of image patches. In this paper , we first analyze this appr oach, and find insights that help impro ve the quality of the reconstructed videos. W e then introduce a novel technique, based on con- volutional sparse coding (CSC), and show how it outperforms the state-of-the-art, patch-based approac h in terms of flexibility and ef ficiency , due to the con volutional natur e of its filter banks. The ke y idea for CSC high-speed video acquisition is extending the basic formulation by imposing an additional constr aint in the tempor al dimension, whic h enfor ces sparsity of the first-or der derivatives over time. Categories and Subject Descriptors (according to ACM CCS) : I.4.1 [Computer Graphics]: Digitization and Image Capture— Sampling 1. Introduction During the last years, video capture technologies hav e seen large progress, due to the necessity of acquiring information at high tem- poral and spatial resolution. Howe ver , cameras still face a basic bandwidth limitation, which poses an intrinsic trade-off between the temporal and spatial dimensions. This trade-off is mainly de- termined by hardware restrictions, such as readout and analog-to- digital conversion times of the sensors. This makes capturing high speed video at high spatial resolutions simultaneously still an open problem. Recent works try to overcome these limitations either with hardware-based approaches such as the camera array prototype pro- c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John Wile y & Sons Ltd. Published by John Wile y & Sons Ltd. A. Serrano, E. Gar ces, D. Gutierr ez, B. Masia / Convolutional spar se coding for capturing high speed video content posed by W illburn et al. [ WJV ∗ 04 ], or with software-based ap- proaches, like the work of Gupta et al. [ GBD ∗ 09 ], where the au- thors proposed combining low resolution videos with a few ke y frames at high resolution. Other approaches rely on computational imaging techniques, combining optical elements and processing al- gorithms [ WILH11 , MWDG13 , WLGH12 ]. For instance, a novel approach based on the emerging field of compressiv e sensing was presented [ LGH ∗ 13 , HGG ∗ 11 ]. This technique allows to fully re- cov er a signal even when sampled at rates lower than the Nyquist- Shannon theorem, provided that the signal is sufficiently sparse in a given domain. These works rely on this technique to selectively sample pixels at different time instants, thus coding the temporal information of a frame sequence in a single image. They then re- cov er the full frame sequence from that coded image. The key as- sumption is that the time varying appearance of the captured scenes can be represented as a sparse linear combination of elements of an ov ercomplete basis (dictionary). This representation, and the sub- sequent reconstruction, is done in a patch-based manner, which is not free of limitations. Training and reconstruction usually takes a long time; moreover , the overcomplete basis needed for recon- struction has to be made up of atoms of similar nature to the video that is being reconstructed. This in turn imposes the need to use a specialized, expensi ve camera to capture such basis. This paper represents an extended version of our previous work [ SGM15 ], where we i) performed an in-depth analysis of the main parameters defined in Liu’ s patch-based compressiv e sensing and sparse reconstruction framew ork [ LGH ∗ 13 ]; ii) introduced the Lars/lasso algorithm for training and reconstruction, and showed how it impro ved the quality of the results; iii) presented a novel al- gorithm for choosing the training blocks, which further improved performance as well as reconstruction time; and iv), we further ex- plored the e xistence of a good predictor of the quality of the recon- structed video. These contributions are now briefly summarized in Sections 4 , 6 , and 7 , and in the supplemental material. In this work, we additionally introduce a novel con volutional sparse coding (CSC) approach for high-speed video acquisition, and show how this outperforms existing patch-based sparse recon- struction techniques in sev eral aspects. In particular , both training and reconstruction times are significantly reduced, while the con- volutional nature of the atoms allows for reconstruction of videos using generic, content-agnostic dictionaries. This is due to the fact that the basis learned no longer needs to be able to reconstruct each signal block in isolation, instead allowing shiftable basis functions to discover a lower rank structure. In other words, image patches are no longer considered independent; interactions are modeled as con volutions, which translates into a more expressive basis which better reconstructs the underlying mechanics of the signal [ BL14 ]. This completely remov es the need to capture similar scenes us- ing an expensi ve high-speed camera, and to explicitly train a dic- tionary , which significantly extends the applicability of our CSC framew ork. As an example, all the results sho wn in this paper hav e been reconstructed using an existing dictionary containing images of fruits [ HHW15 ]. Note that we sample less than 15% of the pix- els, which are additionally integrated ov er time to form a single image; despite this extremely suboptimal input data, we are able to successfully reconstruct high-speed videos of good quality . W e provide source high speed videos, code, and results of our imple- mentation † . 2. Related W ork Coded exposures. Coded exposure techniques have been used to improv e certain aspects of image and video acquisition in the field of computational photography . The goal is to optically code the incoming light before it reaches the sensor , either with coded aper - tures or shutter functions. For instance, Raskar et al. [ RA T06 ] pro- posed the use of a flutter shutter to recover motion-blurred details in images. With the same purpose Gu et al. [ GHMN10 ] propose the coded r olling shutter as an improvement over the conv entional r olling shutter . Alternatively , codes in the spatial domain have been used for light field reconstruction [ VRA ∗ 07 ], high-dynamic range imaging [ NM00 ], to recover from defocus blur [ MCPG11 , MPCG12 ], or to obtain depth information [ LFDF07 , ZLN09 ]. Compressiv e sensing. The theory of compressive sensing has raised interest in the research community since its formalization in the seminal works of Candes et al. [ CR T06 ] and Donoho [ Don06 ]. Numerous recent works hav e been dev oted to applying this the- ory to sev eral fields, including image and video acquisition. In one of the most significant works, the Single Pixel Camera of W akin et al. [ WLD ∗ 06 ], the authors introduce a camera prototype with only one pixel, which allows the reconstruction of complete im- ages acquired with sev eral captures under different exposure pat- terns. Other examples in imaging include the work of Marwah et al. [ MWBR13 ], in which they achieve light field acquisition from a single coded image; high dynamic range imaging [ SBN ∗ 12 ]; or capturing hyperspectral information [ LL WD14 , JCK16 ]. Recently , compressiv e sensing was proposed to reconstruct high-speed video from a single image [ LGH ∗ 13 , HGG ∗ 11 ], combining coded expo- sure and dictionary learning. Some of the key design choices and parameters were analyzed in [ SGM15 ], leading to an improv ement of the original design. Con volutional sparse coding. Con volutional sparse coding has quickly become one of the most powerful tools in machine learn- ing, with many applications in signal processing, computer vision, or computational imaging, to name a few . Grosse et al. [ GRKN07 ] introduced conv olutional constraints to sparse coding, as well as an efficient minimization algorithm to make CSC practical, for the particular problem of 1D audio signals. Since then, many other works have extended this basic, canonical frame work, with the goal of making it faster or more ef ficient (e.g. [ BEL13 , KF14 , HHW15 ]). Bristow and Lucey [ BL14 ] recently presented a large collection of examples covering different application domains, showing the fast spread and general applicability of CSC. Some examples of ap- plication domains include learning hierarchical image representa- tions for applications in vision [ CPS ∗ 13 , SKL10 ]; decomposition of transient light transport [ HDL ∗ 14 ]; imaging in scattering me- dia [ HST ∗ 14 ]; or high dynamic range capture [ SHG ∗ 16 ]. In this paper we present a practical application of CSC for the particular case of high-speed video acquisition. † http://webdiis.unizar .es/~aserrano/projects/CSC-V ideo.html c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John W iley & Sons Ltd. A. Serrano, E. Gar ces, D. Gutierr ez, B. Masia / Convolutional spar se coding for capturing high speed video content 3. Background on sparse r econstruction Compressiv e sensing has revolutionized the field of signal process- ing by providing a means to reconstruct signals that have been sam- pled at rates lower than what the Nyquist-Shannon theorem dic- tates. In general, this undersampling or incomplete acquisition of the signal is represented as: y = Φ Φ Φ x (1) where the signal of interest (in our case a video sequence) is repre- sented by x ∈ R m , the captured signal (a coded image) is y ∈ R n , with n m , and Φ Φ Φ ∈ R n × m contains the sampling pattern and is called measurement matrix . An example of this sampling in the case of video acquisition is shown in Figure 2 . Recovering the sig- nal of interest having as input Φ Φ Φ and y requires solving an under- determined system of equations and finding a basis in which the signal of interest is sparse, and is known as the sparse reconstruc- tion problem. Scene Shutter Coded Image ɩ ɩ t t Φ x y Figure 2: V ideo acquisition via compressive sensing. The signal of inter est x is sampled using a certain pattern Φ Φ Φ which varies in time , and integr ated to a single 2D image y . Sparse reconstruction aims to r ecover x fr om y , knowing the sampling pattern Φ Φ Φ , a sever ely under determined problem. Note that the Φ Φ Φ displayed in the image is an actual example of a pattern used for our captur es. Patch-based sparse coding. T o solve the sparse reconstruction problem posed by Eq. 1 , the signal of interest x needs to be sparse, meaning that it can be represented in some alternativ e domain with only a few coef ficients. This can be expressed as: x = N ∑ i = 1 ψ ψ ψ i α i (2) where ψ i are the elements of the basis that form the alternative domain, and α i are the coefficients, which are in their majority zero or close to zero if the signal is sparse. Many natural signals, such are images or audio, can be considered sparse if represented in an adequate domain. In order to reconstruct the original signal from the undersam- pled, acquired one, we jointly consider the sampling process (Eq. 1 ) together with the representation in the sparse dictionary (Eq. 2 ), yielding the following formulation: y = Φ Φ Φ x = Φ Φ ΦΨ Ψ Ψα α α (3) where Ψ Ψ Ψ ∈ R m × q represents an o vercomplete basis (also called dic- tionary ) with q elements. If the original sequence x is s − s parse in the domain of the basis formed by the measurement matrix Φ Φ Φ and the dictionary Ψ Ψ Ψ , it can be well represented by a linear combination of at most s coefficients in α α α ∈ R q . Note that we are looking for a sparse solution; therefore, the search of the coefficients α α α has to be posed as a minimization problem. This optimization will search for the unknown α α α coef ficients, seeking a sparse solution to Eq. 3 . This is typically formulated in terms of the L 1 norm, since L 2 does not provide sparsity and L 0 presents an ill-posed problem which is difficult to solv e: min α α α k α α α k 1 sub j ect t o k y − Φ Φ ΦΨ Ψ Ψα α α k 2 2 ≤ ε (4) where ε is the residual error . Eq. 4 is usually solved in a patch-based manner , dividing the signal spatially into a series of blocks (in the case of videos, the blocks are of size p x × p y × p z ), reconstructing them individually , and merging them to yield the final reconstructed signal. Con volutional sparse coding. As an alternative, recent w orks pro- pose modeling the x as a sum of sparsely-distributed con v olutional features [ GRKN07 ]: x = K ∑ k = 1 d k ∗ z k (5) where d k are a set of conv olutional filters that conform the dictio- nary , and z k are sparse feature maps. The filters have a fix ed spatial support, and the feature maps are of the size of the signal of interest. Heide and colleagues [ HHW15 ] presented a formulation for the recov ery of a signal x , modeled as in Eq. 5 , from a degraded signal y measured as shown in Eq. 1 : argmin z 1 2 k y − Φ Φ Φ K ∑ k = 1 d k ∗ z k k 2 2 + β K ∑ k = 1 k z k k 1 (6) In their paper , the authors also propose ef ficient algorithms to train the filter bank and to solve the minimization problem. Note that training the filter bank amounts to solving the minimization in Eq. 6 optimizing for both the feature maps z k and the filters d k . Coded images from high speed video sequences. In the case of video, the measurement matrix introduced in Φ Φ Φ is implemented as a shutter function that samples dif ferent time instants for every pixel. The final image is thus formed as the integral of the light arriving to the sensor for all the temporal instants sampled with the shutter function: I ( x , y ) = T ∑ t = 1 S ( x , y , t ) X ( x , y , t ) (7) where I ( x , y ) is the captured image, S the shutter function and X the original scene. In a conv entional capture system S ( x , y , t ) = 1 ∀ x , y , t but in this case S should be such that it fulfils the mathe- matical properties of a measurement matrix suitable for sparse re- construction, as well as the constraints imposed by the hardware. An easy way to fulfil the mathematical requirements is to build a random sampling matrix. Howe ver , since a fully-random sam- pling matrix cannot be implemented in current hardware, we use the shutter function proposed by Liu et al. [ LGH ∗ 13 ], which can be easily implemented in a DMD (Digital micromirr or device) or an LCoS (Liquid Crystal on Silicon) placed before the sensor , and approximates randomness while imposing additional restrictions to c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John W iley & Sons Ltd. A. Serrano, E. Gar ces, D. Gutierr ez, B. Masia / Convolutional spar se coding for capturing high speed video content make a hardware implementation possible. In particular, the pro- posed shutter is implemented in a LCoS . Each pixel is only sam- pled once throughout the sequence, with a fixed bump length. W e refer the reader to the work of Liu et al. [ LGH ∗ 13 ] for more details about the coded shutter implementation. In Section 4 we will show how to apply traditional, patch-based sparse coding to the problem of high speed video acquisition, and provide an analysis of the most important parameters. Section 5 will then offer an alternative solution, focused on ef ficiency , by modifying the conv olutional sparse coding formulation in Eq. 6 . This alternativ e solution also lifts some of the restrictions imposed by the patch-based sparse coding approach, such as the need to build a dictionary whose atoms are similar to the videos that are going to be reconstructed. 4. Patch-based sparse coding approach This section describes the specifics of how to solve the sparse re- construction problem using a patch-based approach (described in Section 3 ) for high speed video. The mathematical formulation is giv en by Eqs. 2 to 4 ; here we describe how we train the dictio- nary Ψ Ψ Ψ , and how we solve the optimization problem in Eq. 4 . W e summarize the main ideas here, and provide more details in the supplemental material. Learning high speed video dictionaries. W e have captured a database of high-speed videos which we use for training and vali- dation of our techniques. The database consists of 14 videos cap- tured at captured at 1000 frames per second with a high-speed Photron SA2 camera, part of which are used for training, and part for testing. The Photron SA2 provides up to 4 Megapixels at a rate of 1000 fps. The acquisition setup is shown in Figure 3 . Our database provides scenes of different nature, and a wide variety of spatial and temporal features. Representative frames of the videos in the database are shown in the supplemental material. W e learn Figure 3: Setup for the acquisition of our high speed video database with a Photr on SA2 camera. In or der to capture videos with high quality the scene must be illuminated with a str ong, di- r ect light. the fundamental b uilding blocks (atoms) from our captured videos, and create an overcomplete dictionary . For training, we use the DLMRI-Lab implementation [ RB11 ] of the K-SVD [ AEB06 ] al- gorithm, which has been widely used in the compressive sensing literature. W e propose an alternative to random selection that max- imizes the presence of blocks with relev ant information, by giving higher priority to blocks with high variance. Reconstructing high speed videos. Once the dictionary Ψ Ψ Ψ is trained, and knowing the measurement matrix Φ Φ Φ , we need to solve Eq. 4 to estimate the α α α coefficients and reconstruct the sig- nal. Many algorithms hav e been dev eloped for solving this mini- mization problem for compressive sensing reconstruction. W e use the implementations av ailable in the SP Arse Modeling Software (SP AMS) [ MBPS10 ]. 5. Introducing CSC f or high-speed video acquisition The use of patch-based, con ventional sparse coding in the recov- ery of high-speed video produces good results, as we will show in Section 7 . Howe ver , convolutional sparse coding (CSC) can of- fer significant improvements: First, learning conv olutional filters allows for a richer representation of the signal, since they span a larger range of orientations and are spatially-inv ariant, as opposed to the patches learnt in a con ventional dictionary . Second, due to their con volutional nature, dictionaries made up of filter banks are more versatile, in the sense that they are content-agnostic: they do not need to contain atoms of similar nature to the signals that are to be reconstructed with them. Finally , reconstruction time is a ma- jor bottleneck in patch-based approaches, but is significantly re- duced in a con volutional framework; this is especially important when dealing with video content. Recent efficient solutions for CSC have been proposed for im- ages [ HHW15 , SHG ∗ 16 ]. In principle, adapting these solutions to video could be done simply by extending them to three dimen- sions (x-y-t), with 3D filters d k and 3D feature maps z k . The op- timization in Eq. 6 could then be solved in a manner analogous to 2D [ HHW15 , Algorithm 1]. Doing this, howe ver , is tremendously computationally expensiv e ‡ . W e therefore rev ert to a 2D formula- tion instead, which is computationally manageable, and adapt it to be able to properly deal with video content, as explained ne xt. In our proposed formulation, the sparse feature maps z k and the filters d k remain two-dimensional; thus their conv olution yields two-dimensional images, which correspond to the individ- ual frames. This per-frame reconstruction of the video could be achiev ed using the optimization in Eq. 6 . T o adapt this solution to video, we impose an additional constraint in the temporal di- mension, which enforces sparsity of the first-order deri vati ves ov er time. The resulting reconstruction process is then: argmin z 1 2 β d k y − Φ Φ Φ K ∑ k = 1 d k ∗ z k k 2 2 + β 1 K ∑ k = 1 k z k k 1 + β 2 K ∑ k = 1 k∇ t z k k 1 (8) The operator ∇ t represents the first order backward finite difference along the temporal dimension: ∇ t z k = ( z t k − z t − 1 k ) ∀ t ∈ { 2 , . . . , T } , ‡ T o giv e a practical example: 32GB of RAM allo w training a maximum of 100 filters of size 11 × 11 × 10; while this number of filters is adequate for 2D images, it is insuf ficient in the presence of the extra dimension in video. c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John W iley & Sons Ltd. A. Serrano, E. Gar ces, D. Gutierr ez, B. Masia / Convolutional spar se coding for capturing high speed video content where T is the number of frames being reconstructed. β d , β 1 and β 2 are the relative weights of the data term, the sparsity term, and the temporal smoothness term, respectiv ely . A modified ADMM algorithm can be used to solv e this problem by posing the objective as a sum of closed, conv ex functions f j as follows: argmin z J ∑ j = 1 f j ( K j z ) (9) where J = 3, f 1 ( ξ ) = 1 2 β d k y − Φ Φ Φξ k 2 2 , f 2 ( ξ ) = β 1 k ξ k 1 , and f 3 ( ξ ) = β 2 k ξ k 1 . Consequently , the matrices K j are K 1 = D , K 2 = I , and K 3 = ∇ , where D is formed by the conv olution matrices corre- sponding to the filters d k , and ∇ is the matrix corresponding to the first order backward differences in the temporal dimension, as ex- plained above. W e use the implementation of ADMM ( Alternating Dir ection Method of Multipliers ) described in Algorithm 1 to solv e Eq. 9 . More details about this implementation can be found in the literature [ AF13 , HHW15 , Algorithm 1]. Algorithm 1 1: for i = 1 to I do 2: y i + 1 = argmin y k K y − z + λ i k 2 2 3: z i + 1 j = prox f j ρ ( K j y i + 1 j + λ k j ) ∀ j ∈ { 1 , . . . , J } 4: λ i + 1 = λ i + ( K y i + 1 − z i + 1 ) 5: end for 6. Analysis In this section we discuss implementation details of the system, and we perform an analysis of the two approaches proposed in Section 4 and Section 5 , e xploring parameters of influence for both methods. W e now analyze the influence of the key parameters for both approaches: patch-based and con volutional sparse coding, and find the parameter combination yielding the best results. None of the test videos were used during training. As measures of quality , we use PSNR (Peak Signal to Noise Ratio), widely used in the signal processing literature, and the MS-SSIM metric [ WSB03 ], which takes into account visual perception. The complete analysis can be found in the supplementary material. 6.1. Patch-based sparse coding Implementation details : W e use the K-SVD algorithm [ AEB06 ] to train our dictionary and the LARS-Lasso solver [ EHJT04 ] for solving the minimization problem. In order to achieve faithful re- constructions it is important that each atom (patch) is large enough to contain significant features (such as edges or corners), but not too large for av oiding learning very specific features of the train- ing videos (and thus overfitting). W e hav e tested several patch sizes (results included in the supplemental material) and we hav e chosen the size yielding better quality in the results: 7 pixel s × 7 pixel s × 20 f rames . T raining a dictionary : The amount of blocks (3D patches) re- sulting from splitting the training videos is unmanageable for the training algorithm; thus the dimensionality of the training set has to be reduced. The straightforward solution is to randomly choose a manageable amount of blocks. Howe ver , a high percentage of these do not contain meaningful information about the scene (as in static backgrounds). W e thus explore several other ways to select the blocks for training: • V ariance sampling : W e calculate the variance for each block and bin them in three categories: high, medium and low variance. Then we randomly select the same amount of blocks for ev ery bin to ensure the presence of high variance blocks in the resulting set. • Stratified gamma sampling : W e sort the blocks by increasing variance and sample them with a gamma curve ( f ( x ) = x γ ). W e analyze the effect of γ = 0 . 7, which yields a curve closer to a linear sampling, and γ = 0 . 3. The goal of this stratification is to ensure the presence of all the strata in the final distribution. W e divide the range uniformly and calculate thresholds for the strata applying the gamma function. Then we randomly choose a sam- ple from every strata and remove that sample from the original set. This process is iterated until the number of desired samples is reached. • Gamma sampling : W e choose directly samples from the origi- nal set following a gamma curve sampling. W e also test values of γ = 0 . 7 and γ = 0 . 3. Figure 4 shows results for one of the tested videos, with dictio- naries built with the different selection methods explained above. Results for all the videos tested were consistent. Random and V ari- ance sampling clearly outperform the other methods, with the V ari- ance sampling yielding slightly better results. 36 30.5 31 31.5 34.5 32 32.5 33 33.5 34 35 35.5 Random Variance Strafied gamma Strafied g amma Gamma Gamma = 0.7 = 0.3 = 0.6 = 0.3 Figure 4: Quality of the reconstruction (in terms of PSNR) for a sample video ( PourSoda ) as a function of the method used to select training bloc ks for learning the dictionary . F or each method we show an inset with the histogram of variances of video blocks of the r esulting training set. c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John W iley & Sons Ltd. A. Serrano, E. Gar ces, D. Gutierr ez, B. Masia / Convolutional spar se coding for capturing high speed video content 6.2. Con volutional sparse coding Implementation details : W e use a trained dictionary of 100 filters of size 11 × 11 pixels. W e have performed tests with different fil- ter sizes, and we have found the reconstruction quality to be very similar between different sizes (see the supplemental material for results). The conv olutional nature of the algorithm makes it more flexible, and unlike the patch-based approach, more rob ust to wards variations in the filters size. Nevertheless, we chose to train the fil- ters with a size of 11 × 11 pixels because they yield slightly better results. Regarding the amount of filters, we found that 100 filters are enough to make the algorithm con ver ge in the training. T raining a dictionary : One of the theoretical advantages of con- volutional sparse coding over the patch-based approach is that the learned dictionaries need not be of similar nature to the signal be- ing reconstructed. T o analyze this, we compare the quality of the re- construction for two different dictionaries: A generic dictionary ob- tained from the fruits dataset (it simply contains pictures of fruits) provided by Heide et al. [ HHW15 ], and a specific dictionary trained with a set of frames from our captured video database (different from the ones we then reconstruct). Figure 5 shows that the qual- Specific filters MS-SSIM Generic filters Brain Coke Dice Flower Foreman Balloon FireHold FireSt art 0.5 0 1 Figure 5: Quality of the r econstruction (in terms of the quality met- ric MS-SSIM) for two differ ent dictionaries of filter s: A specific one trained with frames from our video database (orange bars), and a generic one trained with the fruits dataset (blue bars). F or all the videos analyzed (x-axis) the quality of the r econstruction is very similar with both dictionaries, showing our CSC-method does not r equire tr aining a specific dictionary . ity of the reconstructions using both dictionaries is very similar for all the videos analyzed. This confirms the theory for the particular case of high-speed video reconstruction; as a consequence, we no longer need to acquire specific data to train dictionaries, adapted to ev ery particular problem. W e use this generic fruit dictionary for all our results. Choosing the best parameters : W e analyze the relative weight of each of the three terms in Eq. 8 . W e set β 1 , which weights the sparsity term, to 10, and vary parameter β d , which weights the data term, and parameter β 2 , which weights the temporal smoothness term. Based on the MS-SSIM results sho wn in Figure 6 , we choose for all our reconstructions the values β d = 100 and β 2 = 1. MS-SSIM 1 10 100 1000 10000 0 1 10 100 2 d 0.82 0.8 0.78 0.76 0.74 0.72 Figure 6: Analysis of parameters β d and β 2 in Eq. 8 , which control the weights of the data term and the temporal smoothness term, r espectively . W e plot the mean MS-SSIM value from eight recon- structed videos. 7. Results and discussion In this section we show and discuss our results with both sparse coding approaches: patch-based and con volutional. In recent work, K oller et al. [ KSM ∗ 15 ] hav e performed an analysis of se veral state- of-the-art approaches for high spatio-temporal resolution video re- construction with compressed sensing. They prov e that the ap- proach proposed by Liu et al. [ LGH ∗ 13 , HGG ∗ 11 ] achieves bet- ter reconstruction qualities than other state-of-the-art approaches. Therefore we compare the results of our con volutional sparse cod- ing approach with our implementation of the framew ork proposed by Liu et al. The parameters used in all the reconstructions are deriv ed from the analysis in the previous section. The videos used for training in the patch-based approach are different from the reconstructed ones (see supplemental material), whereas for CSC we use an exist- ing, generic dictionary trained from images of fruits [ HHW15 ]. W e hav e coded 20 frames in a single image. This number yields a good trade-off between quality and speed-up of the reconstructed video. For each frame, we sample less than 15% of the pixels. Despite this huge loss of information, we are able to reconstruct high-speed videos of good quality . In general, both techniques yield results of similar quality , with a slight advantage for the patch-based approach in terms of MS- SSIM [ WSB03 ] values, of about 0.05 on average (see also Figure 8 ). Howe ver , as discussed earlier in the paper , there are two im- portant shortcomings of this approach: First, the need to train a dictionary made up of atoms containing similar structures to the reconstructed videos; otherwise the quality of the reconstruction degrades significantly . Second, training and reconstruction times are rather long (see T able 1 for reconstruction times). T o ov er- come these, we introduced a second solution, based on conv olu- tional sparse coding. This is not only significantly f aster, but it also allows us to bypass the need to capture and train a dictionary , as discussed in Section 6 . As e xplained in the paper , a naïv e 3D CSC approach quickly be- comes computationally intractable, due to the huge conv olutional matrices inv olved. On the other hand, reverting to a 2D, per-frame solution yields many artifacts in the results due to the low sampling c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John W iley & Sons Ltd. A. Serrano, E. Gar ces, D. Gutierr ez, B. Masia / Convolutional spar se coding for capturing high speed video content = 0 2 = 1 2 = 10 2 Figure 7: Effect of our pr oposed temporal smoothness term on the r econstructed videos. From left to right: Result of CSC without this term ( β 2 = 0 ), and two incr easing values for the weight of the tem- poral smoothness term in the optimization β 2 . Not including this term yields results with many artifacts due to the low sampling rate of each image separately (see also video in the supplemental mate- rial), while too high a value tends to over -blur the result. rate and the lack of temporal stability , as we sho w in Figure 7 (left- most image) and the supplemental material. W e therefore adapted the 2D con volutional sparse coding approach to our problem, tak- ing advantage of the coded temporal information by enforcing spar - sity of the deriv ativ es in time while solving the optimization. Fig- ure 7 (middle and right images) sho ws the effect of this term in the optimization, while Figures 9 and 10 show additional results with each of the two techniques (patch-based and CSC-based). Please refer to the supplemental material for the full videos. Last, T able 1 shows reconstruction times for all the videos shown in this paper; on av erage, our CSC-based approach is 14x faster than a patch-based solution. T able 1: Reconstruction times for eight videos (20 frames each) with con volutional sparse coding and with patch-based sparse cod- ing. Note the gr eat speed-up achie ved by the former . Time (in seconds) Con volutional Patch-based Brain 249 4208 Coke 239 3239 Dice 236 3175 Flower 237 3524 Foreman 237 3746 Balloon 236 3275 FireHold 237 3071 FireStart 238 3180 8. Conclusions Computational imaging aims at enhancing imaging technology by means of the co-design of optical elements and algorithms; captur- ing and displaying the full, high-dimensional plenoptic function is an open, challenging problem, for which compressiv e sensing and sparse coding techniques are already providing many useful solu- tions. In this paper we have focused on the particular case of high- speed video acquisition, and the intrinsic trade-off between tem- poral and spatial resolution imposed by bandwidth limitations. W e hav e presented two sparse coding approaches, where we code the Coded Image Figure 9: Additional video sequences reconstructed using our CSC-based appr oach. Left: coded image which serves as input to the r econstruction algorithm; inset shows a close-up. Right: T wo of the frames r econstructed for each sequence. Detail is recover ed despite the larg e loss of information under gone during sampling. Note that the blur in the coin of the bottom row is not motion blur but due to limited depth of field instead. Coded Image Figure 10: Additional video sequences reconstructed using patch- based spar se r econstruction. Left: coded image which serves as in- put to the r econstruction algorithm; inset shows a close-up. Right: T wo of the frames reconstructed for eac h sequence. temporal information by sampling different time instants at ev ery pixel. First, we hav e analyzed the ke y parameters in the patch-based sparse coding approach proposed by Liu et al. [ LGH ∗ 13 , HGG ∗ 11 ], which ha ve allowed us to of fer insights that lead to better quality in the reconstructed videos. W e then hav e introduced a nov el conv o- lutional sparse coding framew ork, customized to enforce sparsity c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John W iley & Sons Ltd. A. Serrano, E. Gar ces, D. Gutierr ez, B. Masia / Convolutional spar se coding for capturing high speed video content Coded Image Pat ch-based Convoluonal Convoluonal Pat ch-based Coded Image Figure 8: Repr esentative frames of two r econstructed videos: FireStart ( top ) and FireHold ( bottom ). The sequences show a lighter at dif fer ent stages of ignition. W e show reconstruction results for the two approac hes discussed in the paper . Left: Input coded image, from which 20 frames will be reconstructed (inset shows a close-up). Right: Thr ee sample frames of the r econstructed sequence, with the patch-based appr oach ( top ro w ) and with our CSC appr oach ( bottom ro w ). on the first-order deriv ativ es in the temporal domain. The conv olu- tional nature of the filter banks used in the reconstruction allo wed for a more flexible and ef ficient approach, compared with its patch- based counterpart. W e bypass the need to capture a database of high-speed videos and train a dictionary , while reconstruction times improv e significantly . Many exciting v enues for future research lie ahead. F or instance, our strategy to impose an additional constraint in the temporal di- mension is motivated by the fact that, due to the size of the con vo- lutional matrices required to train the dictionary , it is not feasible to deal with (x-y-t) blocks directly . This is currently the main lim- itation of our approach, and it would be interesting to inv estigate other strategies in follow up work. Last, we hope that the develop- ment of computational techniques like ours will progressively allo w the dev elopment of commercial imaging hardware with enhanced capabilities. 9. Acknowledgements W e would like to thank the Laser & Optical T echnologies depart- ment from the Aragon Institute of Engineering Research (I3A), as well as the Uni versidad Rey Juan Carlos for pro viding a high-speed camera and some of the videos used in this paper . This research has been partially funded by an ERC Consolidator Grant (project CHAMELEON), and the Spanish Ministry of Economy and Com- petitiv eness (projects LIGHTSLICE, LIGHTSPEED, BLINK, and IMA GER). Ana Serrano was supported by an FPI grant from the Spanish Ministry of Economy and Competitiveness; Elena Garces was partially supported by a grant from Gobierno de Aragon; Diego Gutierrez was additionally funded by a Google Faculty Research A ward, and the BBV A Foundation; and Belen Masia was partially supported by the Max Planck Center for V isual Computing and Communication. References [AEB06] A H A RO N M . , E L A D M . , B R U C K ST E I N A . : K-svd: An algo- rithm for designing overcomplete dictionaries for sparse representation. IEEE T ransactions on Signal Pr ocessing 54 (2006), 4311–4322. 4 , 5 [AF13] A L M E I DA M . S . C . , F I G U E I R E D O M . A . T.: Frame-based im- age deblurring with unknown boundary conditions using the alternating direction method of multipliers. In IEEE ICIP (2013), pp. 582–585. 5 c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John W iley & Sons Ltd. A. Serrano, E. Gar ces, D. Gutierr ez, B. Masia / Convolutional spar se coding for capturing high speed video content [BEL13] B R I S T OW H . , E R I K S S O N A . , L U C E Y S .: Fast Conv olutional Sparse Coding. In Pr oc. CVPR (2013), pp. 391–398. 2 [BL14] B R I S T OW H . , L U C E Y S . : Optimization Methods for Con volu- tional Sparse Coding. In arXiv:1406.2407 (2014). 2 [CPS ∗ 13] C H E N B . , P O L ATK A N G . , S A P I RO G. , B L E I D . , D U N S O N D . , C A R I N L .: Deep learning with hierarchical con volutional factor analysis. P attern Analysis and Machine Intelligence, IEEE T ransactions on 35 , 8 (Aug 2013), 1887–1901. 2 [CR T06] C A N D E S E . , R O M B E R G J . , T AO T.: Robust incertainty princi- ples: exact signal reconstruction from highly incomplete frequency infor- mation. IEEE T ransactions on Information Theory 52 (2006), 489–509. 2 [Don06] D O N O H O D . : Compressed sensing. IEEE T ransactions on In- formation Theory 52 (2006), 1289–1306. 2 [EHJT04] E F RO N B . , H A S T I E T., J O H N S T O N E I . , T I B S H I R A N I R .: Least angle regression. Annals of statistics 32 (2004), 407–499. 5 [GBD ∗ 09] G U P TA A . , B H A T P . , D O N T C H E V A M . , C U R L E S S B . , D E U S S E N O . , C O H E N M .: Enhancing and experiencing spacetime reso- lution with videos and stills. In IEEE International Confer ence on Com- putational Photography (2009). 2 [GHMN10] G U J . , H I TO M I Y . , M I T S U N AG A T., N AY A R S . : Coded rolling shutter photography: Flexible space-time sampling. In IEEE In- ternational Confer ence on Computational Photography (2010). 2 [GRKN07] G RO S S E R . B . , R A I N A R . , K WO N G H . , N G A . : Shift- in variant sparse coding for audio classification. In Pr oceedings UAI (2007), pp. 149–158. 2 , 3 [HDL ∗ 14] H U X . , D E N G Y . , L I N X . , S U O J . , D A I Q . , B A R S I C . , R A S K A R R .: Robust and accurate transient light transport decompo- sition via conv olutional sparse coding. Opt. Lett. 39 , 11 (Jun 2014), 3177–3180. 2 [HGG ∗ 11] H I T O M I Y . , G U J . , G U P TA M . , M I T S U N AG A T., N A Y A R S .: Video from a Single Coded Exposure Photograph using a Learned Over- Complete Dictionary . In IEEE International Conference on Computer V ision (ICCV) (2011), pp. 287–294. 2 , 6 , 7 [HHW15] H E I D E F., H E I D R I C H W. , W E T Z S T E I N G . : Fast and flexible con volutional sparse coding. In Pr oc. CVPR (2015). 2 , 3 , 4 , 5 , 6 [HST ∗ 14] H E I D E F. , S T E I N B E R G E R M . , T S A I Y . - T., R O U F M . , P A - JA K D . , R E D DY D . , G A L L O O . , L I U J . , H E I D R I C H W . , E G I A Z A R I A N K . , K AU T Z J . , P U L L I K . : Flexisp: A flexible camera image processing framew ork. ACM T ransactions on Gr aphics 33 , 6 (2014). 2 [JCK16] J E O N D. S . , C H O I I . , K I M M . H . : Multisampling Compressiv e V ideo Spectroscopy. Computer Graphics F orum 35 , 2 (2016). 2 [KF14] K O N G B . , F O W L K E S C . C .: F ast Convolutional Sparse Coding (FCSC) . T ech. rep., UC Irvine, 2014. 2 [KSM ∗ 15] K O L LE R R . , S C H M I D L . , M A T S U DA N . , N I E D E R B E R G E R T., S P I N O U L A S L . , C O S S A I RT O . , S C H U S T E R G . , K AT SA G G E L O S A . K .: High spatio-temporal resolution video with compressed sens- ing. Opt. Express 23 , 12 (Jun 2015), 15992–16007. doi:10.1364/ OE.23.015992 . 6 [LFDF07] L E V I N A . , F E R G U S R . , D U R A N D F., F R E E M A N W. T.: Im- age and depth from a con ventional camera with a coded aperture. ACM T ransactions on Gr aphics 26 (2007). 2 [LGH ∗ 13] L I U D . , G U J . , H I T O M I Y . , G U P TA M . , M I T S U N AG A T. , N AY A R S . : Efficient Space-Time Sampling with Pixel-wise Coded Ex- posure for High Speed Imaging. IEEE T ransactions on P attern Analysis and Machine Intelligence 36 (2013), 248–260. 2 , 3 , 4 , 6 , 7 [LL WD14] L I N X . , L I U Y . , W U J . , D A I Q . : Spatial-spectral encoded compressiv e hyperspectral imaging. ACM T ransactions on Graphics 33 (2014), 1–11. 2 [MBPS10] M A I R A L J . , B AC H F., P O N C E J . , S A P I RO G .: Online learn- ing for matrix factorization and sparse coding. Journal of Machine Learning Resear ch 11 (2010), 16–60. 4 [MCPG11] M A S I A B . , C O R R A L E S A . , P R E S A L . , G U T I E R R E Z D . : Coded apertures for defocus blurring. In Iber o-American Symposium in Computer Graphics (2011). 2 [MPCG12] M A S I A B. , P RE S A L . , C O R R A L E S A. , G U T I E R R E Z D .: Per- ceptually optimized coded apertures for defocus deblurring. In Computer Graphics F orum (2012), vol. 31, pp. 1867–1879. 2 [MWBR13] M A RWAH K . , W E T Z S T E I N G . , B A N D O Y . , R A S K A R R .: Compressiv e light field photography using overcomplete dictionaries and optimized projections. ACM T ransactions on Graphics 32 (2013), 1–11. 2 [MWDG13] M A S I A B . , W E T Z S T E I N G . , D I DY K P . , G U T I E R R E Z D .: A Survey on Computational Displays: Pushing the Boundaries of Optics, Computation, and Perception. Computers & Graphics 37 , 8 (2013), 1012 – 1038. 2 [NM00] N AY A R S. , M I T S U NA GA T.: High dynamic range imaging: spa- tially varying pixel exposures. In CVPR (2000), vol. 1, pp. 472–479 vol.1. 2 [RA T06] R A S K A R R . , A G R AW A L A . , T U M B L I N J .: Coded exposure photography: Motion deblurring using fluttered shutter . ACM T ransac- tions on Graphics 25 (2006), 795–804. 2 [RB11] R A V I S H A N K A R S. , B R E S L E R Y .: Mr image reconstruction from highly undersampled k-space data by dictionaty learning. IEEE T rans- actions on Medical Imaging 30 (2011), 1028–1041. 4 [SBN ∗ 12] S C H Ö B E R L M . , B E L Z A . , N OW A K A . , S E I L E R J . , K AU P A . , F O E S S E L S . : Building a high dynamic range video sensor with spatially nonregular optical filtering. In Pr oc. SPIE (2012), vol. 8499, pp. 84990C–84990C–11. 2 [SGM15] S E R R A N O A . , G U T I E R R E Z D . , M A S I A B. : Compressive high speed video acquisition. In CEIG (2015). 2 [SHG ∗ 16] S E R R A N O A . , H E I D E F. , G U T I E R R E Z D . , W E T Z S T E I N G . , M A S I A B . : Conv olutional sparse coding for high dynamic range imag- ing. Computer Graphics F orum (Pr oc. EUR OGRAPHICS) 35 , 2 (2016). 2 , 4 [SKL10] S Z L A M A . , K A V U K C U O G L U K . , L E C U N Y .: Con volutional matching pursuit and dictionary training. arXiv:1010.0422 (2010). 2 [VRA ∗ 07] V E E R A R AG H A V A N A . , R A S K A R R. , A G R AW AL A. , M O H A N A . , T U M B L I N J . : Dappled photography: Mask enhanced cameras for heterodyned light fields and coded aperture refocusing. A CM T rans. Graph. 26 , 3 (July 2007). 2 [WILH11] W E T Z S T E I N G . , I H R K E I . , L A N M A N D . , H E I D R I C H W.: Computational Plenoptic Imaging. Computer Graphics F orum 30 , 8 (2011), 2397–2426. 2 [WJV ∗ 04] W I L B U R N B. , J O S H I N . , V A I S H V . , L E VOY M . , H O ROW I T Z M .: High-speed videography using a dense camera array . In Computer V ision and P attern Recognition (2004), vol. 2, pp. 294–301. 2 [WLD ∗ 06] W A K I N M . , L A S K A J . , D UA RT E M . , B A R O N D . , S A R - VO T HA M S . , T AK H A R D . , K E L L Y K . , B A R A N I U K R . : Compressive imaging for video representation and coding. In Proceedings of the Pic- tur e Coding Symposium (2006). 2 [WLGH12] W E T Z S T E I N G . , L A N M A N D . , G U T I E R R E Z D . , H I R S C H M .: Computational Displays. ACM SIGGRAPH Course Notes, 2012. 2 [WSB03] W A N G Z . , S I M O N C EL L I E . , B OV I K A. : Multi-scale Structural Similarity for Image Quality Assessment. In IEEE Conf. on Signals, Systems and Computers (2003), pp. 1398–1402. 5 , 6 [ZLN09] Z H O U C . , L I N S . , N A Y A R S . K . : Coded Aperture Pairs for Depth from Defocus. In IEEE International Conference on Computer V ision (ICCV) (Oct 2009), pp. 325–332. 2 c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John W iley & Sons Ltd.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment