Camera Fingerprint Extraction via Spatial Domain Averaged Frames

IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 1 Camera Fingerprint Extraction via Spatial Domain A v eraged Frames Samet T aspinar , Manoranjan Mohanty , and Nasir Memon Abstract —Photo Response Non-Unif ormity (PRNU) based camera attribution is an effecti ve method to determine the source camera of visual media (an image or a video). T o apply this method, images or videos need to be obtained from a camera to create a “camera ﬁnger print” which then can be compared against the PRNU of the query media whose origin is under question. The ﬁngerprint extraction process can be time consuming when a large number of video frames or images hav e to be denoised. This may need to be done when the individual images have been subjected to high compression or other geometric processing such as video stabilization. This paper in vestigates a simple, yet effecti ve and efﬁcient technique to cr eate a camera ﬁngerprint when so many still images need to be denoised. The technique utilizes Spatial Domain A veraged (SDA) frames. An SDA-frame is the arithmetic mean of multiple still images. When it is used for ﬁngerprint extraction, the number of denoising operations can be signiﬁcantly decreased with little or no perf ormance loss. Experimental r esults sho w that the pr oposed method can work more than 50 times faster than con ventional methods while pr oviding similar matching results. Index T erms —PRNU , video f or ensics, camera ﬁngerprint extraction, image f or ensics. I . I N T RO D U C T I O N PRNU-based source camera attribution is a well-studied and successful method in media forensics for ﬁnding the source camera of an anonymous image or video [1]. The method is based on the unique Photo Response Non Uni- formity (PRNU) noise of a camera sensor array stemming from manufacturing imperfections. This PRNU noise can act as a camera ﬁngerprint. The PRNU approach is often used in two scenarios: camera veriﬁcation and camera identiﬁcation. Camera veriﬁcation aims to establish if a gi ven query image or a video is taken by a suspect camera. This is done by correlating the noise estimated from the query image or video with the ﬁngerprint of the camera usually is computed by taking pictures from the camera under controlled conditions. In camera identiﬁcation, the potential source camera of the query image or video is determined from a lar ge database of camera ﬁngerprints. One can vie w camera identiﬁcation as essentially the same as performing n camera v eriﬁcation tasks where n is the number of camera ﬁngerprints in the database. Howe v er , when performing identiﬁcation, it is assumed that the camera ﬁngerprints are pre-computed. In both v eriﬁcation and identiﬁcation, it is often the case that there is no camera available to create ﬁngerprints under con- Samet T aspinar (email: st89@nyu.edu) and Manoranjan Mohanty email: manoranjan.mohanty@nyu.edu) are with Center for Cyber Security , New Y ork Univ ersity Abu Dhabi, UAE. Nasir Memon (email: nm1214@nyu.edu) is with Department of Computer Science and Engineering, Ne w Y ork Uni v ersity , Ne w Y ork, USA. trolled conditions. Rather , camera ﬁngerprints are estimated from a set of publicly av ailable media assumed to be from the same camera. Such media can hav e a very div erse range of quality and content and often lacks metadata. For efﬁcient ﬁngerprint matching in large databases, various approaches hav e been proposed. Fridrich et al. [2] proposed the use of ﬁngerprint digests in which a subset of ﬁngerprint elements having the highest sensiti vity are used instead of the entire ﬁngerprint. Bayram et al. [3] introduced binarization where each ﬁngerprint element is represented by a single bit. V alsesia et al. [4] proposed the idea of applying random projections to reduce ﬁngerprint dimension. Bayram et. al. [5] introduced group testing via composite ﬁngerprint that focuses on decreasing the number of correlations rather than decreas- ing the size (storage) of a ﬁngerprint. Recently , T aspinar et al. [6] proposed a hybrid approach that utilizes both decreasing the size of a ﬁngerprint and the number of correlations. All these methods were designed and tested for images, howe v er they can also be used for videos. Although the image-centric PRNU-based method can be extended to video [7]–[9], source camera attribution with video presents a number of ne w challenges. First, a video frame is much more compressed than a typical image. Therefore, the PRNU signal extracted from a video frame is of signiﬁcantly lower quality than one obtained from an image. As a result, a larger number of video frames are required to compute the ﬁngerprint. In fact, Chuang et. al. [7] found that it is best to use all the frames instead of using only the I- or P-frames to compute a ﬁngerprint. Using a large number of frames can introduce signiﬁcant computation ov erhead. F or example, computing a ﬁngerprint from 60 I-frames of a one-minute HD video requires one to tw o minutes, whereas 30 to 40 minutes is required if all frames are used. In the case of camera identiﬁcation, the amount of compu- tation can be prohibiti ve in practical scenarios. For example, for computing ﬁngerprints from a thousand one-minute Full HD videos (using all 1800 frames) using a PC may take more than 20 days. Clearly , with billions of media objects uploaded every day on the Internet, lar ge scale camera source identiﬁcation becomes quickly infeasible. Although camera ﬁngerprints stored in a database may ha ve to be computed just once by a system, computing a ﬁngerprint estimate at run-time from a query video can be prohibitive when faced with a reasonable number of query videos presented to the camera identiﬁcation system in a day . Besides source camera identiﬁcation, digital stabilization operations performed within modern cameras also present a signiﬁcant challenge for PRNU-based source camera veriﬁ- IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 2 cation for video [8], [10], [11]. V ideo stabilization results in sensor-pix el mis-alignments between indi vidual frames of the video as the geometric transformations performed to compen- sate for camera motion and spatially align each frame are different. An accurate camera ﬁngerprint cannot be obtained using mis-aligned frames as is done with non-stabilized video ev en if video quality is very high. Although there are some preliminary methods that address source camera veriﬁcation for stabilized video, [8], [10], these methods are either limited in scope or have low performance (low true positive rate) and high computation ov erhead. An alternate approach to address the stabilization issue for a fairly long video (at least a couple of minutes) [12] is to use a large number of frames for computing the ﬁngerprint. The idea being that with a large number of frames, there will be suf ﬁcient number of aligned pixels at each spatial location that can result in the computation of an accurate ﬁngerprint. As discussed above, this approach ho we ver , can again introduce high computation ov erhead unsuitable for practical use. As a third example, modern devices such as smartphones capture dif ferent types of media with different resolutions. For example, most cameras don’ t use the full sensor resolution when capturing a video and do wnsize the sensor output to a lower resolution by proprietary and often unkno wn in-camera processing techniques. For such a challenging task PRNU based source camera matching may often fail if only I-frames are used. This paper proposes a computationally efﬁcient way to compute a camera ﬁngerprint from a large number of media objects, such as individual frames of a video or a large number highly compressed images taken from a social media platform. In contrast to the two-step con ventional ﬁngerprint computation method (which ﬁrst estimates PRNU noise from each frame using a denoising ﬁlter and then averages sev eral estimated individual PRNU noise estimates to get a reliable ﬁngerprint estimate), the proposed method uses a three step ap- proach: frame averaging, denoising, and noise averaging. The frame averaging step gets the arithmetic mean of the frames in spatial domain, resulting in a Spatial Domain A vera ged frame (SD A-frame) (Figure 2). Then, in the second step each SD A- frame is denoised, and an averaging of the estimated PRNU noise is done to arri ve at the ﬁnal ﬁngerprint estimate. The goal here is to minimize the number of denoising operations (as denoising is most e xpensi ve step), and also get rid of scene dependent noise by a veraging multiple frames. Experiments with VISION dataset [13] and NYU AD-MMD [14] show that the proposed method pro vides signiﬁcant speed up in comput- ing accurate ﬁngerprints. It achieves signiﬁcantly higher true positiv e rate than a ﬁngerprint computed by I-frames only and much lo wer computation cost than a ﬁngerprint obtained from all av ailable frames while yielding similar performance. The rest of the paper has been organized as follows. Section II summarizes the PRNU-based method and provides an ov erview of ho w digital video stabilization works. Sec- tion III explains the proposed ﬁngerprint extraction method using SDA-frames as well as an analysis comparing it with the con ventional approach. The insights obtained from the analysis are experimentally validated in Section IV. Section V examines applications for which SDA-frames based technique can be used and reports the improv ement that can be achie ved using an SD A-based method for those cases. Section VI section provides a discussion on future work and concludes the paper . I I . B AC K G RO U N D A N D R E L A T E D W O R K In this section, we provide a brief revie w of PRNU-based source camera attribution and video stabilization. A. PRNU-based Source Camer a Attribution PRNU-based camera attribution is established on the fact that the output of the camera sensor , I , can be modeled as I = I (0) + I (0) K + ψ (1) where I (0) is the noise-free still image, K is the PRNU noise, and ψ is the combination of additional noise, such as readout noise, dark current, shot noise, content-related noise, and quantization noise. The multiplicative PRNU noise pattern, K , is unique for each camera and can be used as a camera ﬁngerprint which enables the attribution of visual media to its source camera. Using a denoising ﬁlter F (such as a W a velet ﬁlter) on a set of images (or video frames) of a camera, we can estimate the camera ﬁngerprint by ﬁrst getting the noise residual, W k , (i.e., the estimated PRNU) of the k th image as W k = I k − ˆ I (0) k , ˆ I (0) k = F ( I k ) , and then av eraging the noise residuals of all the images. F or determining if a speciﬁc camera has tak en a given query image, we ﬁrst obtain the noise residual of the query image using F and then correlate the noise residual with the camera ﬁngerprint estimate. For images, the PRNU-based method has been well studied. Follo wing the seminal work in [1], much research has been done to impro v e the scheme [15]–[19], and also mak e camera identiﬁcation effecti v e in practical situations [2], [3], [5], [6], [20]. Researchers ha ve also studied the effecti v eness of the PRNU-based method by proposing various counter forensics and anti-counter -forensics methods [21], [22] It has also shown that the PRNU method can withstand a multitude of image processing operations, such as cropping, scaling [23], compression [24], [25], blurring [24], and even printing and scanning [26]. In contrast, there has been lesser work dedicated to PRNU- based camera attribution from a video [27]. Mo Chen et al. [28] ﬁrst extended PRNU-based approach to camcorder videos. They used Normalized Cross-Correlation (NCC) to correlate ﬁngerprints calculated from two videos, as the videos may be subject to translation shift, e.g., due to letter-boxing. T o compensate for the blockiness artifacts introduced by hea vy compression (such as MPEG-x and H26-x compression), they discard the boundary pixels of a block (e.g., a JPEG block). In [29], McCloskey proposed a conﬁdence weighting scheme that can improv e PRNU estimation from a video by mini- mizing the contribution from regions of the scene that are likely to distort PRNU noise (e.g., excluding high-frequency content). Chuang et al. [7] studied PRNU-based source camera identiﬁcation problem with a focus on smart-phone cameras. Since smart-phones are subject to high compression, they IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 3 considered only I-frames for ﬁngerprint calculation and corre- lation. Chen et al. [9] proposed a method to ﬁnd PRNU noise from wireless streaming videos, which are subject to blocking and blurring. In their approach, they divided a video frame into multiple blocks and did not consider the blocks having signiﬁcant blocking or blurring artifacts. Chaung et al. [7] showed that the best possible ﬁngerprint could be computed when all the frames are considered (instead of using only the I- or P-frames). Howe ver , to the best of our knowledge, ef ﬁcient computation of ﬁngerprint from a given video is a relativ ely unexplored area. B. Afﬁne T ransformation in V ideo Stabilization Fig. 1: V ideo Stabilization Pipeline. This ﬁgure is a modiﬁed version of a ﬁgure that appeared in [30]. An out-of-camera digital video stabilization process con- tains three major stages: camera motion estimation, motion smoothing, and motion correction (Figure 1) [31] [30]. In the motion estimation step, the global inter-frame motion between adjacent frames of a non-stabilized video is modeled from the optical ﬂow vectors of the frames using an afﬁne transformation. In the motion smoothing step, unintentional translations, rotations, shearing, are ﬁltered out from the global motion vectors using a low pass ﬁlter . Finally , in the motion correction step, stabilized video is created by shifting, rotating, shearing, or zooming frames according to the parameters in the ﬁltered motion vector . Since each video frames can use different parameters, pixels can be misaligned with the sensor array . For example, one frame can be rotated with an angle -1 degree while another by 0.5 degrees. Digital video stabilization presents a big challenge for PRNU-based camera attribution. The frame speciﬁc af ﬁne transformations described abov e make the PRNU method ineffecti v e as there is misalignment between frames. The brute-force methods [10], [22] proposed to address the sta- bilization issue hav e had limited success and resulted in low performance. These brute-force methods try to overcome the desynchronization issue by ﬁrst ﬁnding the stabilization parameters through an exhausti ve search and then performing the corresponding inv erse af ﬁne transformation. Such methods, therefore, have very high computation ov erhead. Recently , Mandelli et al. [11] impro ved o ver brute-force approaches by using a best-ﬁt refer ence frame in the parameter searching process rather than using the ﬁrst frame of the given video. The best-ﬁt refer ence frame is obtained by looking for a frame that matches with the lar gest number of frames. Their approach also has high computation overhead. I I I . S PA T I A L D O M A I N A V E R AG I N G As mentioned in the introduction, this paper proposes spatial domain av eraging for computing camera ﬁngerprints, which reduces the number of denoising operations when many visual objects are a vailable. In the proposed method, ef ﬁcient com- putation of a ﬁngerprint is achie v ed by ﬁrst creating av eraged frames from a large collection, and using these av eraged frames for computing the ﬁngerprint. F or example, given a video with m frames, g non-intersecting equal-sized subgroups are formed each with d = m g frames. A Spatial Domain A veraged fr ame (SD A-fr ame) is created from each subgroup by getting the mean of the d frames in the subgroup. Then, in the second step, each SDA-frame is denoised, and an av eraging of the estimated PRNU noise patterns is done to arriv e at the ﬁnal camera ﬁngerprint estimate. In this manner , the number of frames that are denoised gets reduced by a factor of d . An SD A-frame obtained from three different images is shown in Figure 2. (a) 1st (b) 2nd (c) 3rd (d) SD A-frame Fig. 2: SD A-frame is the a v erage of 1 st , 2 nd , and 3 rd frames. The proposed method is inspired by the fact that although the denoising ﬁlter is designed to remove random noise from an image originating from the camera sensors (e.g., readout noise, shot noise, dark current noise etc.), as well as noise caused by processing (e.g., quantization and compression), it is not able to do a perfect job . Therefore, some scene content leaks into the extracted noise pattern. A veraging in the spatial domain acts as a preliminary ﬁlter that smoothens the image and potentially reduces the content noise that leaks into the extracted noise pattern. Of course, the effecti veness of the approach then depends on the nature of the two noise signals. Below we analyzed this fact and characterized the relationship between the noise signal arriv ed at by using the con ventional approach and the SD A-approach. Further , when using the proposed approach, many questions arise. First, does frame-av eraging lead to a drop in the accurac y of the ﬁngerprint computed as compared to the con ventional method, assuming the same number of images are used for both? If so, what is the trade-off between the decrease in computation and the loss in the accuracy? Can accuracy be increased by utilizing more images in the SD A method? If so, what is the optimal combination of averaging and denoising that leads to the least computation while yielding the best performance? Then, we in v estigated these questions, both the- oretically and experimentally . W e ﬁrst provide a mathematical IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 4 analysis using a simple framework in the two subsections below . W e then validate our study in the next section by providing experimental results. The results show that spatial domain averaging strategy can indeed result in signiﬁcant savings in computation while maintaining performance and in some cases, improving it. The rest of this section provides an analysis of spatial domain averaging. T o this end, we ﬁrst provide an analysis of the con ventional method and then analyze the SD A method. A. Con ventional method As discussed in Section II, in the con v entional method, the camera ﬁngerprint is estimated from n images from a known camera. Each image I can be modeled as I = I (0) + I (0) K + ψ , where ψ is the random noise accumulated from a variety of sources (as in (1)) and K is the PRNU noise. T o estimate K , a denoising ﬁlter , F , such as [32], BM3D [33], is used to estimate the noise free signal I (0) . Using such a ﬁlter , we denote the noise residual as W = I (0) K + ψ + ξ , where ξ is the content noise. This noise is essentially due to sub-optimal denoising ﬁlter that is unable to completely eliminate the content from PRNU noise. Then, from the n known image, the camera ﬁngerprint estimate, ˆ K , can be obtained using Maximum Likelihood Estimation ( MLE ) as ˆ K = P n i =1 W i .I i P n i =1 I 2 i (2) where W i is noise pattern extracted from I i . Note that in the estimated camera ﬁngerprint, ˆ K , ψ and ξ are the unwanted noise. The quality of ˆ K can be assessed from its v ariance V ar ( ˆ K ) [34]. The lower the v ariance is (i.e., images with smooth content), the higher the quality becomes. Assuming that ψ and ξ are independent White Gaussian Noise with v ariances σ 2 1 and σ 2 2 respectiv ely , V ar ( ˆ K ) can be found as (using Cramer-Rao Lower Bound as shown by Fridrich et al. [34]) V ar ( ˆ K ) ≥ σ 2 1 + σ 2 2 P n i =1 I 2 i . (3) Thus a better PRNU is obtained from lower σ 2 1 and σ 2 2 (i.e., high luminance and and low textured image [34]). B. Pr oposed SD A method In this subsection, we deri ve the v ariance of the estimated camera ﬁngerprint obtained using frame av eraging. W e then compare this v ariance with that obtained by the conv entional approach (in (3)). Suppose I 1 , I 2 , . . . , I m are m images used to compute the camera ﬁngerprint using SD A method. W ith frame averaging, these m images are divided into g = m d disjoint sets of equal size with d pictures in each set. From each set, an SD A- frame is computed. Thereafter , the process is similar to the con v entional approach. Each SD A-frame is denoised, and the camera ﬁngerprint is computed from g noise residuals using MLE. Suppose, I S DA i is the SD A-frame obtained from the i th image set. Then I S DA i = P id j =( i − 1) d +1 I j d = P id j =( i − 1) d +1 ( I (0) j + I (0) j K + ψ j ) d W e can write the above equation as I S DA i = I (0) ,S DA i + I (0) ,S DA i K + ψ S DA i , (4) where I (0) ,S DA i is the noise free image, and ψ S DA i is the random noise (from pre-ﬁltering sources) in the SD A-frame. This noise can be written as ψ S DA i = P id j =( i − 1) d +1 ψ j d . Suppose σ 2 1 is the variance of ψ ’ s (which is assumed to be White Gaussian Noise). Then, the v ariance of ψ S DA i turns out to be σ 2 1 d . Suppose W S DA is the noise residual of each SD A-frame, I S DA . Then, W S DA = I S DA − F ( I S DA ) = I (0) ,S DA K + ψ S DA + ξ 0 , where F is the denoising ﬁlter, and ξ 0 = I (0) ,S DA − F ( I S DA ) is the content noise due to the sub-optimal nature of the denoising ﬁlter . Note that ξ 0 is assumed to be independent of PRNU signal I (0) ,S DA K (although ξ 0 contains content layov er I (0) ,S DA − F ( I S DA ) as ξ 0 is ne gligible compared to I S DA 0 K [34]. W e kno w that ξ 0 is dependent on the smoothness of the SD A-frames. If the frames contain textured content, ξ 0 is high. Assuming that SDA-frames have similar smoothness to the input frames from which they are created, we consider that ξ 0 and ξ have the same variance σ 2 2 . Using MLE, the camera ﬁngerprint can now be estimated from g SD A-frames I S DA 1 , I S DA 2 , . . . , I S DA g as ˆ K S DA = P g i =1 W S DA i .I S DA i P g i =1  I S DA i  2 . Using Cramer -Rao Lo wer Bound, the variance of the esti- mated ﬁngerprint ˆ K S DA becomes V ar ( ˆ K S DA ) ≥ σ 2 1 d + σ 2 2 P g i =1  I S DA i  2 . (5) In an ideal case, we want that the a veraging operation does not degrade the quality of the estimated PRNU from the SD A-frames. In other words, we want that V ar ( ˆ K S DA ) is approximately equal to the v ariance from the con v entional method V ar ( ˆ K ) . That is, in other words, using the results from (3) and (5), it is desired that σ 2 1 d + σ 2 2 P g i =1  I S DA i  2 ≈ σ 2 1 + σ 2 2 P n i =1 I 2 i . IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 5 By simplifying the above equation, we get σ 2 1 d + σ 2 2 σ 2 1 + σ 2 2 ≈ P g i =1  I S DA i  2 P n i =1 I 2 i . Suppose P g i =1 ( I S DA i ) 2 P n i =1 I 2 i = g n × k where k = ( P g i =1 ( I S DA i ) 2 ) /g ( P n i =1 I 2 i ) /n . Note that the v alue of k is a temporary variable that is less than or equal to 1 as the numerator P g i =1 ( I S DA i ) 2 ) /g is less than equal to the denominator P n i =1 I 2 i ) /n . Putting these values in the abov e equation, we get g n × k ≈ σ 2 1 d + σ 2 2 σ 2 1 + σ 2 2 . Putting g = m d in the above equation, we get m × k d × n ≈ σ 2 1 + d × σ 2 2 d × ( σ 2 1 + σ 2 2 ) . or , m ≈ n k × σ 2 1 + d × σ 2 2 σ 2 1 + σ 2 2 (6) W e then discard the temporary v ariable, k , from the equa- tion. Since 0 < k ≤ 1 , the ﬁnal equation becomes m ≤ n ×  σ 2 1 + d × σ 2 2 σ 2 1 + σ 2 2  (7) From (7), we can deriv e the following concluding remarks: • Since d ≥ 1 , the right hand side of the equation is at least 1 . Therefore, the number of images required in the proposed SD A method (i.e., m ) will be more than or equal to the number of images required in the con v entional method (i.e., n ). • For smooth images σ 2 2 is close to zero. So, the impact of SD A-depth, d , will be negligible for such images. Therefore, SD A and con v entional approaches will hav e similar performance. Ho we v er the SD A technique will be d times faster in the best case. • For textured images, when the number of for both tech- niques is equal (i.e., m = n ), because σ 2 is greater than zero, con v entional approach is expected to outperform SD A approach. • Since σ 2 2 is greater than zero for textured images, the ratio of images for SD A- divided by conv entional approach, m n , will increase as the SD A-depth, d , increases. There- fore, SDA approach will require more images to achieve same performance for textured images. Notice that it is hard to characterize the relationship of σ 1 and σ 2 , also σ 1 depends on various factors such as shot noise, exposure time, temperature, illumination, image content and so on. Therefore, we are not focusing on their relationship in this research. In the following section, we experimentally validate the observations listed above. I V . V A L I D A T I O N O F A N A L Y S I S In this section, we experimentally verify the main conclu- sions arrived at by the analysis performed in the previous section. In our experiments we use both ﬂatﬁeld and textured images from the VISION dataset [13]. The implementations were done using Matlab 2016a on Windo ws 7 PC with 32 GB memory and Intel Xeon(R) E5-2687W v2 @3.40GHz CPU. The wav elet denoising algorithm [32] was used to obtain ﬁngerprint and PRNU noise. PCE and NCC methods were used for comparison. A preset threshold of 60 [35] was used for PCE values. V alues higher than this threshold were taken to conclude that the two media objects originated from the same camera. A. Studying the effect of smoothness T o verify the observations of the analysis related to smooth- ness of the images used to compute a camera ﬁngerprint, we randomly selected 50 ﬂatﬁeld images and 50 textured images from each camera in the dataset. For each of these types, ﬁv e experiments were conducted by using a random set of 5 , 10 , 20 , 30 , and 50 images for computing the ﬁngerprint. So for example, when we chose 30 ﬂatﬁeld images, we created one ﬁngerprint using the con ventional approach by denoising each of the 30 images and then averaging the PRNU noise patterns to arriv e at the ﬁngerprint estimate. Then a ﬁngerprint estimate using the SD A approach was computed by av eraging the same 30 images in the spatial domain ﬁrst and then denoising this SD A-frame of depth 30 to directly arri ve at another ﬁngerprint estimate. Therefore, a total of 20 ﬁngerprints were obtained for each camera ( 2 types of images; 2 ﬁngerprint extraction techniques; 5 different cardinalities of image sets used for ﬁngerprint computation). Each of these two ﬁngerprints was correlated with the PRNU noise obtained from the rest of the images in the dataset taken with the same camera. This set consisted of both textured and ﬂat-ﬁeld images. T o create an abundance of test cases, we divided each full resolution ﬁngerprint into 500 × 500 disjoint blocks and correlated them with the corresponding blocks in the test images to match the PRNU noise. As a result, a total of 244 , 127 comparisons were made. 0 200 400 600 800 1000 1200 5 10 20 30 50 0 200 400 600 800 Fig. 3: The effect of texture in terms of PCE Fig. 3 shows how image content affects the PCE for ﬁnger- prints obtained from 5 , 10 , 20 , 30 or 50 ﬂatﬁeld and textured images. The ﬁgure shows that with ﬂatﬁeld images, despite the signiﬁcantly lower number of denoising operations performed IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 6 by the SD A approach, the results obtained are similar to the con v entional approach. This observation holds regardless of the number of images averaged for ﬁngerprint extraction. The performance of the SD A approach drops for textured images. Howe v er , this difference can be overcome by increasing the number of images used for SD A technique b ut still keeping the number of denoising operations lower than the conv entional approach. W e in vestig ate this issue in the next subsection. If we consider the above results in terms of TPR, the SD A approach starts doing better as the PCE is thresholded around a set value (60 in our case) to arrive at the attrib ution result. So a drop in PCE does not necessarily result in a wrong decision. This improvement can be observed in Fig. 4 which shows TPR for the same experiments when the threshold is set to 60 as proposed in [35]. The other implications of these ﬁgures are already well-kno wn in the ﬁeld (i.e., ﬂatﬁeld images are better than textured and as the number of images increase quality of ﬁngerprint also increases which results in a higher PCE and TPR.) 0 0.2 0.4 0.6 0.8 1 5 10 20 30 50 0 0.2 0.4 0.6 0.8 1 Fig. 4: The effect of texture in terms of TPR T able I sho ws the av erage time it takes to e xtract a ﬁngerprint estimate by the two methods in the above experi- ment. Notice that in both cases the same number of images, m , are read from the disk but for the SD A technique only one denoising operation is needed whereas for conv entional way , m denoising operations are done. This implies that as the training images increase, the speedup also increases. A speedup of 13 . 5 times can be achie ved by a veraging 50 images before denoising. T ABLE I: A verage time to extract ﬁngerprints with proposed and con ventional methods (in sec) 5 10 20 30 50 SD A 4.97 5.99 8.22 10.35 14.49 Con ventional 21.57 40.81 79.96 118.79 196.59 Speedup 4.34 6.81 9.73 11.48 13.57 B. F ingerprint equivalence for textur ed images For textured images, our analysis indicated that more images are needed by the SD A method and hence a corresponding reduction in the speedup obtained would occur . In this ex- periment, our goal is to in vestigate the relationship between the number of images required by SD A compared to the number needed by the conv entional approach to yield similar performance for textured images while still retaining a speed- up in ﬁngerprint computation. This experiment was again performed using images from the VISION dataset [13]. W e created a training set from 50 textured images for each camera in the VISION dataset. 19 ﬁngerprints were created using 2 , 3 , . . . 20 images using the conv entional approach. W e also created 49 ﬁngerprints using SDA method using 2 , 3 , . . . 50 images. As done in the pre vious experiment, each ﬁngerprint was partitioned into disjoint 500 × 500 blocks and correlations were computed with the corresponding blocks of the test PRNU noise pattern. 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25 30 35 40 45 50 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Fig. 5: Fingerprint equiv alence for SD A and conv entional ap- proaches. x-axis indicates number of images for con ventional. The left of y-axis (red) is the number of images required for SD A and the right one (blue) is the speedup gained in this case. Figure 5 shows the number of images required by the SD A approach to achie ve at least the same TPR as the con v entional approach. Moreov er , it shows the speedup gained in these cases. For example, when ﬁngerprint is created from 20 textured images using con ventional way , the same TPR can be achiev ed using 48 images in SD A approach. In this way , the ﬁngerprint extraction is approx. 3 . 85 times faster for SD A approach. The ﬁgure shows that using 2 − 3 times more images for SDA method, up to 4 times speedup can be achieved with no loss in TPR when the images are textured. C. Effect of SD A-depth on image ﬁngerprint In Section III, we hav e shown that as the SDA-depth in- creases, when the number of images for ﬁngerprint extraction is constant, the TPR is expected to drop. T o verify this remark, we used 50 textured images for ﬁngerprint extraction. W e didn’t include any ﬂatﬁeld image in this set as ﬂatﬁeld images results in a negligible difference in performance between SD A and con ventional ﬁngerprints. W e then created ﬁngerprints using 50 textured images from each camera in the VISION dataset. W e set SD A-depth to 1 , 2 , 5 , 10 , 25 and 50 . Therefore, we created 50 , 25 , 10 , 5 , 2 , and 1 SD A-frames, respectiv ely . The SD A-frames were de- noised and then averaged to arrive at the ﬁnal ﬁngerprint estimate. For each ﬁngerprint estimate computed, the rest of the images were used as test images. W e correlated each IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 7 ﬁngerprint with the PRNU noise extracted from the test images in a block-wise manner as done in previous experiments. Notice that S D A − 1 is the same as conv entional approach. SD A- 1 SDA- 2 SD A- 5 SDA- 10 SD A- 25 SD A- 50 PCE 652.8 514.6 390.0 332.2 285.0 252.4 TPR 0.80 0.78 0.75 0.72 0.69 0.67 T ABLE II: SDA-depth vs TPR and PCE, change with ﬁgure T able II sho ws that as the SD A-depth increases, the av erage PCE decreases. For textured images, the more images we combine to create an SD A-frame, the lo wer the PCE and TPR values that will result. This supports the third observ ation of the analysis in Section 3. This section has provided a validation of Section III by experimentally supporting all three observations deriv ed from the analysis. Namely , when images are not textured, hence resulting in low post-ﬁltering noise, both the SD A and con- ventional ﬁngerprints from the same images perform similarly which can lead to 13 . 5 times speedup. On the other hand, textures images and larger SD A-depth result in requiring higher number of images to achie v e the same performance as con v entional approach. Y et, a speedup by a factor of 4 can still be achiev ed in most cases. In the next section, we apply the proposed approach to prac- tical problems, and show that SD A ﬁngerprints can perform with a signiﬁcantly higher accuracy or result in signiﬁcant speedup compared to state-of-the-art ﬁngerprint extraction techniques. V . A P P L I C A T I O N T O C O M P U T I N G V I D E O FI N G E R P R I N T S In this section, we in vestig ate a more practical use case of the proposed SDA technique which is its usage for extracting FE from videos. As Section II explains, two of the most common ways to extract a ﬁngerprint from a video are using only I-frames or using all frames (or the ﬁrst n frames). While the former results in low performance, the latter can be impractical in many real life applications due to very high computational needs. For example, ﬁngerprints from 50 1 − minute videos (i.e., approximately 1800 frame per video) using a single-thread may take up to a day to compute. In this section, we pro vide e xperimental results that demonstrate how using the SD A approach can provide signiﬁcant impro vements in the time needed for computing ﬁngerprint estimates from video, while retaining the same performance obtained using a signiﬁcantly lar ger number of denoising operations using con v entional approaches. In each experiment below , three different types of ﬁnger- prints (i.e., I-frames only , SD A-frames and ALL-frames) were obtained from each video. For the sake of simplicity , we refer to them as I-FE (i.e., F ingerprint Estimate ), SD A-FE , and ALL-FE , respecti v ely . Moreov er , in some cases, we add an indication of the SD A-depth when we need to highlight it. For example, SD A-50-FE indicates that the video frames were divided into groups of 50 and each group a veraged to create an SD A-frame. In the ﬁrst experiment, we examine source matching for videos. That is gi ven two videos, can we determine if they are from the same camera. Next we inv estigate a more difﬁcult case that in v olves mixed media. In this subsection, we also analyze an important question related to mixed media: “What is a good balance of SDA-depth which optimizes speed and performance?”. In the next two subsections, we examine the performance achiev ed with video and images obtained from social media such as Facebook and Y ouT ube. Finally , we sho w how the proposed technique can be used for source attribution with moderate length stabilized videos (i.e., up to 4 minutes) from which obtaining a “reliable” FE might tak e couple of hours each using all frames. T wo datasets were used in all the experiments, the NYU AD- MMD, and VISION datasets. The NYU AD-MMD dataset contains images and videos of different resolutions and aspect ratios from 78 cameras from dif ferent models and brands. This makes it a challenging dataset for mixed media attrib ution. Moreov er , it contains stabilized videos longer than 4 minutes from 5 cameras. Hence, we used this dataset for e xperiments using mixed media and stabilized video. The videos in the dataset are typically around 40 seconds( i.e., each video is approximately 1200 frames) and images are pristine (i.e., no out-camera operations). The VISION dataset contains differ - ent high quality videos and images from social media such as Facebook and Y ouT ube. Hence, we used this dataset in experiments in volving social media. A. Matching T wo Non-Stabilized V ideos In the ﬁrst experiment, we examine source matching for videos using FE computed from the three different approaches that hav e been presented. Our goal was to estimate the length of videos and the resulting computation time needed to achiev e greater than 99% TPR for I-FEs, SD A-FEs and ALL-FEs. This way , a clear comparison of the the three approaches could be made. FE from the non-stabilized videos of the same resolution from the VISION dataset were ﬁrst created. FE were extracted from the ﬁrst 5 , 10 , . . . 40 seconds of each video using the two techniques mentioned in Section II and the proposed method. On average, each video had approximately one I-frame per second. W e selected an SD A-depth of 30 resulting in an SD A- frame from each second of video. 5 10 15 20 25 30 35 40 0.4 0.5 0.6 0.7 0.8 0.9 1 Fig. 6: TPR for different lengths of video using I-FEs, SD A- FEs, and ALL-FEs IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 8 Figure 6 shows TPR using I-FE, SD A-FE, and ALL-FE as the length of the videos increases. As seen, SDA-FEs outperforms ALL-FEs in this setting for all video lengths. The difference varies between 0 . 5 (for 5 sec videos) and 1 . 7% (for 15 sec videos). Both FE achie v e signiﬁcantly higher TPR than I-FEs. For example, for 10 seconds video, SD A-FEs and ALL- FEs result in 94 . 1% and 95 . 6% TPR, respecti vely , whereas I-FEs can only reach 62 . 2% TPR. The highest TPR achiev ed using I-FEs was 83 . 7% (i.e., for 40 second videos) which is still lower than the TPR of SD A- FEs and ALL-FEs when they were computed from only 5 - second videos (i.e., more than 87% ). This is because SD A-FE and ALL-FEs use all the 150 frames in a 5 second video (i.e., I-, B- or P-frames) whereas the I-FEs use only 40 I-frames on average and “waste” the rest of the frames. Hence, for this setting, I-FEs fail to reach to a comparable accuracy as the other two methods. T ABLE III: Time for video ﬁngerprint e xtraction in second type averaging I/O + denoising total I-FE 0 50 50 SD A-FE 12 50 62 ALL-FE 0 1407 1407 W e then estimated the time required for extraction of each FE from a 40 second Full HD video captured @30 FPS. T able III compares the average times for them. It takes 50 , 62 , and 1407 seconds for an I-FE, SD A-FE and ALL-FE, re- spectiv ely . Ho we ver , these times are when each one is obtained from 40 second videos. When we e v aluate the required time to achieve 83% TPR, we need less than 5 seconds of video for SD A-FEs and ALL-FEs whereas I-FEs require 40 seconds of video. This suggest that the required time for SDA-FEs and ALL-FEs are less than 8 and 176 seconds, respectiv ely . Hence, SD A technique is at least 6 times f aster than I-FEs and requires 8 times shorter videos, yet still achieves a higher TPR. Moreover , it performs up to 1 . 7% higher than ALL-FEs in terms of TPR and speeds up approximately 22 . 5 times in this setting. Moreov er , while SDA-FEs can achiev e 99% TPR with 20 seconds videos, the same can be achieved with 30 seconds for ALL-FEs. Therefore, close to 34 times speedup can be achiev e in this case when SDA-depth is set to 30 . Notice that these results in volve videos that did not undergo any processing such as scaling, compression in social media and so on. Also, all videos were taken with high luminance in the VISION dataset. Therefore, it is possible to hav e lower performance with more difﬁcult datasets such as when videos are dark or processed. Howe ver , our intention here was to demonstrate the effecti v eness of SD A approach ﬁrst for the simplest of cases. W e examine more challenging situations in further experiments below . B. Mixed Media Attribution As we have seen in the pre vious subsection, using I-FEs causes a signiﬁcant drop in TPR whereas 20 − 30 seconds of video is enough to achiev e more than 99% TPR for both SD A-FEs or ALL-FEs. In this subsection, we in v estigate a more challenging scenario where a video FE needs to be matched with a single query image. In [14], source attribution with mixed-media was in vestigated using the NYUAD-MMD dataset which is a very challenging dataset containing images and videos of various resolutions from 78 of cameras. Here, we performed “Train on videos and test on images” experiment for I-FEs, SD A-FEs, and ALL-FEs. That is a camera FE was computed from the video and the query image was cropped and resized and its PRNU matched with the FE. The resizing and cropping parameters to perform the matching were obtained from the “Train on images, test on videos” experiment done in [14]). The videos in this dataset were typically around 40 seconds long; each having approximately 1200 frames. The dataset contains a total of 301 non-stabilized videos and 6892 images from those cameras. Each video FE was correlated with the PRNU noise of all the test images from the same camera to estimate “true cases” which ended up with 23571 correlations. Then, each video FE from i th camera was compared with the PRNU noise of images from ( i + 1) th camera for resizing and cropping parameters that maximizes the PCE for the image FE (i.e., the FE obtained from all images of the camera using con v entional approach). This way , we estimated the “false cases” resulted in 17755 correlations. In the pre vious experiment we had used a ﬁx ed SDA- depth, d , of 30 . In this experiment we used different SDA- depths to in vestigate its impact on performance and speed. Giv en a video of m frames (in our case approximately 1200 frames), we divided the frames into groups of d = 1 , 5 , 10 , 30 , 50 , 200 , 1200 . Therefore, the number of SDA- frames, g , became 1200 , 240 , 120 , 40 , 24 , 6 , 1 respectively . When g = 1 , the technique becomes the same as using all frames whereas when p = 1200 , only a single SDA-frame is created by averaging all 1200 frames. After obtaining the PCE of the “true” and “false” cases, we created an ROC curve for each video FE type/depth. Figure 7 sho ws the R OC curves for each of the SD A-FEs of different depths, as well as I-FE and ALL-FE. The results sho w that ALL- FE results in the highest performance, whereas I-FE perform signiﬁcantly poorer compared to others. The proposed SD A method performs close to ALL-FE method for all depths. Fig. 7: The R OC curves for varying SD A-depths T able IV sho ws more detailed results. | P C E | stands for the av erage of the PCE ratios with respect to I-FEs. For example, when an ALL-FE from i th video is correlated with the noise of j th image, its PCE is on a verage 3 . 2% times higher compared to the I − F E obtained from the same video. The reason IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 9 we used such a normalization instead of av erage PCE is that outliers hav e a big impact on a verage PCE. Moreo ver , the table shows the TPR for the PCE threshold of 60 , av erage time to extract a FE, and the speedup compared to ALL-FEs. As seen, the results indicate that the TPR of SDA method are very close to ALL-FE. Howe ver , a speedup of up to 52 times can be achiev ed using the SD A method. T ABLE IV: Detailed information for mixed media attribution I- ALL- 5 10 30 50 200 1200 | P C E | 1.0 3.2 3.1 2.9 2.6 2.6 2.5 2.4 TPR( % ) 64.0 83.1 82.3 81.3 80.0 79.8 80.1 79.8 time(s) 50 1407 276 142 62 48 32 27 speedup 28.1 1.0 5.1 9.9 22.7 29.3 44.0 52.1 Similar to the previous experiment using I-FEs hav e signif- icantly lower accuracy (at least 16% lower TPR). Moreov er , when SD A-depth ≥ 30 , SD A-FEs are faster to extract as com- pared to I-FEs. Notice that when ALL-FEs are used, it takes approximately ﬁv e days to extract all the FEs from the 301 videos in the NYU AD-MMD dataset using a single-threaded implementation. This type of performance will clearly imprac- tical for many applications. C. T r ain and test on Y ouT ube videos This experiment explores the performance achiev ed when two video FEs from Y ouT ube are correlated. Although this experiment is essentially the same as the Section V -A, it is relev ant in practice as high compression is inv olv ed. Note that a key motiv ation of the SD A approach is that when high compression is used, a large number of frames are needed for computing a reliable FE. W e created FE from all non-stabilized Y ouT ube videos in VISION dataset (i.e., the ones labeled ﬂatYT , indoorYT , and outdoorYT) using only I-frames, SD A- 50 , SDA- 100 , SD A- 200 , and ALL-frames. Here, we used the ﬁrst 10 , 20 , . . . 60 seconds of the Y ouT ube videos to extract FEs. Each 60 second video had approximately 1800 frames that were used for SD A- or ALL-FEs, whereas they contained 31 . 3 I-frames on a verage. After ﬁngerprint extraction, we correlated each video FE with others of the same type and same length taken by the same camera. For example, an I-FE from 20 seconds of video is correlated with all I-FEs obtained from the rest of the 20 seconds videos from the same camera. The same was done for SD A- and ALL-FEs. This way , a total of 3124 correlations were done for each type. Figure 8 sho ws the TPR for varying lengths of video for each FE type. The ﬁgure shows that I-FEs perform v ery poorly for all cases and any FE type created from video of more than 20 seconds outperforms I-FEs. While ALL-FEs perform better than SDA-FEs for the same-length videos, this difference can be overcome by increasing the video length but still using much fewer denoising operations. For example, SD A − 50 obtained from 50 second videos or SD A − 100 from 60 seconds videos, perform approximately the same as ALL- FEs obtained from 30 seconds (within + − 1% TPR range). Hence, instead of using 900 frames for ALL-FEs, using 1800 frames for SD A − 100 can result in signiﬁcant speedup with no loss in TPR. While an ALL-FE from 900 frame of a Full HD 10 20 30 40 50 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fig. 8: The effect of FE type and video length on TPR for Y ouT ube videos video tak es 1045 seconds to compute, and SD A − 100 FE from 1800 frames, which only does 18 denoising instead of 900 , takes 56 seconds to compute. Therefore, a speedup of close to 19 times can be achieved with SD A − 100 with 1% increase in TPR. Notice that, because most videos are around 60 seconds in the VISION dataset, it limits the maximum length we could use in our experiments. D. T r ain on F acebook images, test on Y ouT ube videos From the pre vious experiments, we know that the SD A method can help achie ve a signiﬁcant speedup for both videos and images with a small loss in performance which can be ov ercome by increasing the number of still images used for ﬁngerprint extraction if av ailable. In this experiment, our goal was to sho w that the proposed method can be successfully applied to other social media. Speciﬁcally , in this subsection, we extract FEs from Facebook images and match them with the FE of Y ouT ube videos. W e call this the “Train on Facebook images, test on Y ouT ube videos” experiment. The importance of this e xperiment is both media sharing services contain billions of visual media and computing ALL-FEs from these collections can hav e very high time complexity . Therefore, faster ﬁngerprint extraction methods (along with search tech- niques) that speeds up attribution are badly neededl In this experiment, for the cameras in the VISION dataset that had non-stabilized videos, we created a FE from 100 Facebook images (i.e., the ones labeled FBH) using con- ventional ﬁngerprint computation method. W e then used the FEs from non-stabilized Y ouT ube videos (those created in the previous experiment). W e again used I-frames, SDA- 50 , SD A- 200 , SD A- 600 , and ALL-frames that were computed from the ﬁrst 60 seconds of Y ouT ube videos. W e then correlated the image FE of a camera with the FE of each video of each type using the efﬁcient search proposed in [14] and a total of 343 pairs were compared for each FE type. T able V sho ws the TPR of these correlations. Similar to “T rain on videos, test on images” experiment, these results show that for FEs obtained from Facebook images matches with 81 . 34% TPR with the Y ouT ube videos for SD A- 50 which is higher than both ALL- FEs and I-FEs. On the other hand, FEs from I-frames yield approximately 30% lower TPR. These results show that SDA approach is a good replacement over using I-FEs or ALL-FEs IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 10 for this scenario. T ABLE V: TPR of different FE types when a FE from Face- book images and another from Y ouT ube videos are correlated I-FE SD A- 50 SD A- 200 SDA- 600 ALL-FE TPR 51.60 81.4 79.88 78.13 79.59 E. Matching two stabilized videos A recent work [12] has sho wn that a FE obtained from a long stabilized video can successfully be matched with other videos from the same camera. Howe v er , thousands of frames must be denoised. This may not be practical in many circumstances. A potential alternati ve for this problem is the use of SD A method which may lead to a signiﬁcant speed up. T o ev aluate this, we captured stabilized videos from 5 cameras. A total of 37 videos were captured which added up to 260 minutes. W e extracted FEs from the frames of 20 , 40 , . . . 240 second video lengths using conv entional (I-frame and ALL-Frame) method as well as SDA method for SDA-depths of 30 , 50 , and 200 . These depths were deemed to be reasonable choices from previous experiments. As sho wn in [8], [10], [11], the ﬁrst frame of the videos are typically not geometrically transformed. Since we divide video into pieces, some video pieces do not hav e an untransformed frame. So, we discarded the ﬁrst frame of each video to avoid inconsistencies. W e correlated each FE with the other FEs of different videos from the same camera that are created using the same number of frames. For example, S D A − 30 − FEs of 20 second videos are correlated with the same type FEs from the same camera. Figure 9 shows the TPR for three cameras (i.e., Hua wei Honor , Samsung S8, and iPhone 6plus) and the total av erage of all the ﬁv e cameras. 0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 140 160 180 200 220 240 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 140 160 180 200 220 240 Fig. 9: TPR for stabilized videos for varying SDA-depths The results show that as videos get longer , ALL-FEs and SD A-FEs achieve higher TPR. Moreov er , the effect of increased SD A-depth is more signiﬁcant for this case in comparison to non-stabilized videos. While for some cameras ALL-FEs and SD A-FEs perform similarly (e.g., Huawei and Samsung cameras), for others (e.g., iPhone cameras) there is a signiﬁcant difference between the two. For example, for Samsung S8 S D A − 200 -FE from 120 seconds video, perform similarly as 180 seconds ALL-FE. Therefore, for this particular case, S DA − 200 can speedup 66 times  i.e. 180 120 × 1407 32  (see T able IV for times). On the other hand for iPhone 6 plus, ALL-FEs from 60 seconds video and 160 seconds S DA − 50 hav e similar TPR. Therefore, 11 times  i.e., 60 160 × 1407 48  speedup can be achie v ed in this case. Hence, a speedup between these numbers (i.e. 11 and 60 ) can be achiev ed without any loss in TPR if a long video is available. Overall, this section shows that the proposed SD A-FEs outperforms the commonly used I-frame-only technique in all the cases for videos. These include mixed media, stabilized videos, and social media. On the other hand, the SD A-FEs achiev es comparable results as ALL-FEs with up to 52 times speedup in these e xperiments. W e also show the impact of SD A-depth on the performance that can be achie ved in various cases. V I . C O N C L U S I O N A N D F U T U R E W O R K This paper has in vestigated camera ﬁngerprint extraction using Spatial Domain A v eraged frames, which are the arith- metic mean of multiple still images. By adding one extra step of a veraging before denoising, a signiﬁcant speedup can be achiev ed for ﬁngerprint extraction. W e show that this technique can successfully be used for images, non-stabilized videos as well as stabilized video to speedup ﬁngerprint extraction process. The proposed method is especially useful when the number of denoising operations needed can be very high. F or e xample, when dealing with non-stabilized or highly compressed stabilized videos or images from social media. It is often considered that for video source attribution, using only I-frames for ﬁngerprint extraction (I-FEs) is “enough” to achiev e high performance. Howe v er , in this research, we hav e shown that I-FEs performs poorly compared to ALL-FEs in all cases. On the other hand, using ALL-FEs is impractical due to the large computation time needed for practical scenarios where thousands of videos can be av ailable. The proposed SD A approach comes into play here to resolve the problem of I-FEs (i.e., accuracy) and ALL-FEs (i.e., speed). Both SD A- and ALL-FEs perform similarly in most cases. When the SDA method performs worse, this can be overcome by using more of the av ailable frames if any . The proposed technique can be used for other source attribution related problems where many denoising operations are needed. F or instance, this method can be applied when many “partially misaligned” still images and a suspect camera are av ailable. F or example, a seam carv ed video contains many partially misaligned frames with its source camera. In such a scenario, instead of denoising all frames of the video, the SD A technique can be used as a way to speed up this process. Moreov er , determining whether a video is stabilized or not is another issue which requires a number of denoising operations. As an alternativ e to using only I-frames, the proposed SDA technique could successfully work with only 2 denoising operations. Another avenue for future research is to create an SD A- FE in a weighted manner such that performance achieve with SD A method can be increased. T wo of the potential ways to IEEE TRANSA CTIONS ON INFORMA TION FORENSICS AND SECURITY 11 achiev e this are weighting I-, P- and B- frames differently , and weighting the frames in a block-by-block manner . For example, it has been sho wn that ﬂatﬁeld images perform better with SD A method compared to textured ones. Using this idea, one may weight textured regions differently from smooth regions. R E F E R E N C E S [1] J. Lukas, J. Fridrich, and M. Goljan, “Digital camera identiﬁcation from sensor pattern noise, ” IEEE T ransactions on Information F or ensics and Security , v ol. 1, no. 2, pp. 205–214, 2006. [2] M. Goljan, J. Fridrich, and T . Filler, “Managing a large database of camera ﬁngerprints, ” in Media F or ensics and Security II , vol. 7541. International Society for Optics and Photonics, 2010, p. 754108. [3] S. Bayram, H. T . Sencar, and N. Memon, “Efﬁcient sensor ﬁngerprint matching through ﬁngerprint binarization, ” IEEE T ransactions on Infor- mation F orensics and Security , vol. 7, no. 4, pp. 1404–1413, 2012. [4] D. V alsesia, G. Coluccia, T . Bianchi, and E. Magli, “Compressed ﬁngerprint matching and camera identiﬁcation via random projections, ” IEEE Tr ansactions of Information F or ensics and Security , vol. 10, no. 7, pp. 1472–1485, July 2015. [5] S. Bayram, H. T . Sencar , and N. Memon, “Sensor ﬁngerprint identiﬁca- tion through composite ﬁngerprints and group testing, ” IEEE T r ansac- tions of Information F or ensics and Security , vol. 10, no. 3, pp. 597–612, March 2015. [6] S. T aspinar, H. T . Sencar , S. Bayram, and N. Memon, “F ast camera ﬁngerprint matching in very lar ge databases, ” in Imag e Processing (ICIP), 2017 IEEE International Confer ence on . IEEE, 2017, pp. 4088– 4092. [7] W .-H. Chuang, H. Su, and M. W u, “Exploring compression effects for improved source camera identiﬁcation using strongly compressed video, ” in Image Processing (ICIP), 2011 18th IEEE International Conference on . IEEE, 2011, pp. 1953–1956. [8] S. T aspinar, M. Mohanty , and N. Memon, “Source camera attribution using stabilized video, ” in Information F orensics and Security (WIFS), 2016 IEEE International W orkshop on . IEEE, 2016, pp. 1–6. [9] S. Chen, A. Pande, K. Zeng, and P . Mohapatra, “V ideo source iden- tiﬁcation in lossy wireless networks, ” in IEEE INFOCOM , 2013, pp. 215–219. [10] M. Iuliani, M. Fontani, D. Shullani, and A. Piv a, “ A hybrid approach to video source identiﬁcation, ” arXiv preprint , 2017. [11] S. Mandelli, P . Bestagini, L. V erdoliv a, and S. Tubaro, “Facing de vice attribution problem for stabilized video sequences, ” IEEE T r ansactions on Information F orensics and Security , 2019. [12] J. Lubin, M. Isnardi, C. Spence, I. Sur , and A. Chaudhry , “Joint sensor ﬁngerprinting and processing history recovery for visual media forensics, ” Priv ate con versation, 2018. [13] D. Shullani, M. Fontani, M. Iuliani, O. Al Shaya, and A. Piva, “V ision: a video and image dataset for source identiﬁcation, ” EURASIP Journal on Information Security , vol. 2017, no. 1, p. 15, 2017. [14] S. T aspinar, M. Mohanty , and N. Memon, “Source camera attribution of multi-format de vices. ” [15] J. Luk ´ a ˇ s, J. Fridrich, and M. Goljan, “Digital camera identiﬁcation from sensor pattern noise, ” IEEE T ransactions Information F or ensics and Security , vol. 1, no. 2, pp. 205–214, 2006. [16] Y . Sutcu, S. Bayram, H. T . Sencar, and N. Memon, “Improvements on sensor noise based source camera identiﬁcation, ” in IEEE International Confer ence on Multimedia and Expo , 2007, pp. 24–27. [17] C. T . Li and Y . Li, “Color-decoupled photo response non-uniformity for digital image forensics, ” IEEE Tr ansactions on Circuits and Systems for V ideo T echnology , vol. 22, no. 2, pp. 260–271, 2012. [18] G. Chierchia, S. Parrilli, G. Poggi, C. Sansone, and L. V erdoli va, “On the inﬂuence of denoising in PRNU based forgery detection, ” in A CM Multimedia in F orensics, Security and Intellig ence , 2010, pp. 117–122. [19] C. T . Li, “Source camera identiﬁcation using enhanced sensor pattern noise, ” IEEE T ransactions on Information F orensics and Security , v ol. 5, no. 2, pp. 280–287, 2010. [20] W . Y aqub, M. Mohanty , and N. Memon, “T o wards camera identiﬁcation from cropped query images, ” in 25th ICIP . IEEE, 2018, pp. 3798–3802. [21] S. Bayram, H. T . Sencar , and N. Memon, “Seam-carving based anonymization against image & video source attribution, ” in IEEE W orkshop on Multimedia Signal Pr ocessing , 2013, pp. 272–277. [22] S. T aspinar , M. Mohanty , and N. Memon, “Prnu based source attribution with a collection of seam-carved images, ” in Imag e Pr ocessing (ICIP), 2016 IEEE International Conference on . IEEE, 2016, pp. 156–160. [23] M. Goljan and J. Fridrich, “Camera identiﬁcation from scaled and cropped images, ” Proc. SPIE, Electronic Imaging, F or ensics, Security , Ste ganography , and W atermarking of Multimedia Contents X , v ol. 6819, pp. 68 190E–68 190E–13, 2008. [24] E. J. Alles, Z. J. Geradts, and C. J. V eenman, “Source camera identiﬁca- tion for low resolution heavily compressed images, ” in Computational Sciences and Its Applications, 2008. ICCSA’08. International Confer- ence on . IEEE, 2008, pp. 557–567. [25] K. Rosenfeld and H. T . Sencar , “ A study of the robustness of prnu- based camera identiﬁcation, ” in Media F orensics and Security , ser. SPIE Proceedings, E. J. Delp, J. Dittmann, N. D. Memon, and P . W . W ong, Eds., v ol. 7254. SPIE, 2009, p. 72540. [26] M. Goljan, J. Fridrich, and J. Luk ´ a ˇ s, “Camera identiﬁcation from printed images, ” Proceedings of SPIE , vol. 6819, p. 68190I, 2008. [Online]. A vailable: http://www .ws.binghamton.edu/fridrich/Research/Printed.pdf [27] S. Milani, M. F ontani, and P . B. et. al., “ An o vervie w on video forensics, ” Signal Pr ocessing Systems , vol. 1, pp. 1–18, June 2012. [28] M. Chen, J. Fridrich, M. Goljan, and J. Lukas, “Source digital camcorder identiﬁcation using sensor photo response non-uniformity , ” in SPIE Electr onic Ima ging , 2007, pp. 1G–1H. [29] S. McCloskey , “Conﬁdence weighting for sensor ﬁngerprinting, ” in IEEE CVPR W orkshops , 2008, pp. 1–6. [30] N. Ejaz, W . Kim, S. I. Kwon, and S. W . Baik, “V ideo stabilization by detecting intentional and unintentional camera motions, ” in IEEE Inter- national Confer ence on Intelligent Systems, Modelling and Simulation , 2012, pp. 312–316. [31] Y . Matsushita, E. Ofek, W . Ge, X. T ang, and H.-Y . Shum, “Full- frame video stabilization with motion inpainting, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , v ol. 28, pp. 1150–1163, July 2006. [32] M. K. Mihcak, I. Kozintse v , K. Ramchandran, and P . Moulin, “Low- complexity image denoising based on statistical modeling of wa velet coefﬁcients, ” IEEE Signal Processing Letters , vol. 6, no. 12, pp. 300– 303, 1999. [33] K. Dabov , A. Foi, V . Katkovnik, and K. Egiazarian, “Bm3d im- age denoising with shape-adaptive principal component analysis, ” in SP ARS’09-Signal Processing with Adaptive Sparse Structur ed Repre- sentations , 2009. [34] J. Fridrich, “Sensor defects in digital image forensic, ” Digital Image F or ensics , pp. 1–43, 2013. [35] M. Goljan, J. Fridrich, and T . Filler , “Large scale test of sensor ﬁngerprint camera identiﬁcation, ” in Media for ensics and security , vol. 7254. International Society for Optics and Photonics, 2009, p. 72540I.

Camera Fingerprint Extraction via Spatial Domain Averaged Frames

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment