A Robust Image Watermarking System Based on Deep Neural Networks

1  Abstract — Digital image watermarking is the process of embedding and extracting watermark covertly on a carrier image . Incorporating deep learning networks with image watermarking has attracted increasing attention during recent years. However, existing deep le arning-based watermarking systems cannot achieve robustness, blindness, and automated embedding and extraction simultaneously. In this paper, a fully automated image watermarking system based on deep neural networks is proposed to generalize th e image watermarking processes. An unsupervised deep learning structure and a novel loss computation are proposed to achieve high capacity and high robus tness without any prior knowledge of possible attacks. Furthermore, a challenging application of watermark extraction from camera-captured images is provided to validate the practical ity as well as the robustness of the proposed system. Experimental results show the superiority pe rformance of the proposed system as compar in g against several currently available techniques. Index Terms — Image watermarking, robustness , deep learning, convolutional neural networks , phone camera scan I. I NTRODUCTION IGITAL image watermarking refers to the process of embedding and extracting information covertly on a carrier image . The data (i.e., the watermark) is hidden into a cover-image to create a marked-image that will be distributed over the Internet. However, only the authorized recipients can extract the watermark information correctly. According to user’s demands, the watermark can be in different forms, for instances , some random bits or electronic signatures for image protection and authentication as well as some hidden messages for covert communication [1] . The watermark can be encoded for different purposes, such as increasing the perceivable randomness for additional security via encrypti on methods or restoring the i mpact of noise via error correction codes for watermark integrity under atta cks [2, 3] . Wh ile the primary concentration of a steganographic system is the imperceptibility to human vision as well as the undetectability to c omputer analysis, an image watermarking system often controls the robustness as its priority. Thus the watermark should survive even if the marked-image is degraded or distorted [4]. Ideall y, a robust image watermarking system keeps the watermark intact under a designated class of distortions without the assistance of other techniques. However, in practice the robust image watermarking systems often extract the watermark approximat ely under malicious attacks and apply various encoding methods for restorati on [5, 6] . Traditional image watermarking schemes manually design algorithms for the watermark embedding and extracti on . For example, the least significant bits (LSB) based strategies place the watermark on a cover-image through bits substitutions or other mathematical operations [5 , 7]. Although the trivial replacement enables the invisibility, LSB-based methods are less robust and can be easily revealed by statistical analysis. More advanced waterm arking schemes place the watermark on various image domains. For example, Cox et al . [8] embedded the watermark on the frequency spectrum for high fidelity and high security. S hih and Zhong [9] increased the fre quency domain capacity while preser ving the fideli ty. Pevny et al . [ 10 ] enhanced the security by an embedding scheme that maintains the cover image statistics. Zong et al . [ 11 ] improved the robustness by embedding the watermark into image histog ram. Incorporating deep neural networks with image watermarking has attracted increasing attention during recent years. In contrast to significant achievements in stega nalysis for hidden data reveal [ 12 , 13 ], very few attempts applying deep learning in watermark embedding and extraction are reported. Earlier methods [ 14 - 16 ] used neural networks to assign the significance for the bits of each pixel instead of manual determination. Tang et al. [ 17 ] proposed a generative adversarial network to determine the embedding position and the strength on the cover-image. Kandi et al. [ 18 ] used two deep autoencoders for non-blind binary watermark extraction in the marked-image, where the pixels produced by the first auto- encoder represent bit zero and the pixels produced by the second auto-encoder represent bit one. Baluja et al. [ 19 ] a pplied deep autoencoders for blind image watermarking to achieve high fidelity as well as high capacity. Li et al. [ 20 ] embedded the watermark in the discrete cosine domain and used convolutional neural networks for extraction. However, due to fragility of deep neural networks [ 21 ], the robustness issue becomes a challe nge since inputting a modified i mage to a pre- A Robust Image W atermarking System Based on Deep Neural Networks Xin Zhong* University of Nebraska at Om aha xzhong@unomaha.edu Frank Y. Shih New Jersey Institute of Tech nology shih@njit.edu D 2 trained deep learning system can cause failure. Mun et al. [ 22 ] proposed adversarial networks to solve this issue by includi ng attack simulation in the trai ning. Developing robust image watermarking systems fo r watermark extraction from camera resamples requires that the watermark must simultaneously resist multiple distortions, such as ge ometric distortions, optical tilt, quality degradation, compression, lens distortions, and lighting variations [ 23 , 24 ]. Researchers have developed various methods in solving th ese problems . Katayam a et al . [ 25 ] proposed a sinusoidal watermark pattern for robust watermark embedding and a visible frame for marked-image rectification . Other methods based on the autofocus function of a phone camera have been developed, such as embedding the waterm ark through a correlation function, placing the watermark in selected positions via spread spectrum, and applying log-polar transformation [ 26 - 28 ] . Pram ila et al . [ 24 ] proposed watermark extraction from a camera resample of an image printed on blank paper by combining computational photog raphy and robust image watermarking, but the nonblind property of the system restricts its application range . In this paper, we develop an autom ated image watermarking system using deep learning networks based on three main motivations. First , exploring the fitting ability of deep learning models in le arning t he rules of watermark embedding is helpful in developing an automated system . Second, the proposed system is tested on the application of waterm ark extraction from camera resamples, providing a potential solution towards this challenging issue. Third, image watermarking is viewed from a novel perspective – an image fusion task [ 29 , 30 ] between the cover-image and the latent spaces of the watermark, where the fused result (i.e., the marked-image) contains the watermark while references the visual ap pearance of the cover-image . The remainder of this paper is organized as fo llows. The proposed system is presented in Section 2. Experiments and analyses are described in Section 3. The application of watermark extraction using a phone camera to scan a screen is given in Se ction 4. Finally, conclusi ons are drawn in Sect ion 5. II. T HE P ROPOSED S YSTEM A. Preliminaries Fig. 1 shows a general image watermarking system. The watermark w is inserted into the cover-image c to generate a marked-image m that will be transported through a communication channel. The receiver extracts the watermark data w * from the received marked-image m *, which may be a modified version of m if some distortions or attacks are occurred during transmission. A robust image watermarking system intends to secure the integrity of the watermark, i.e., minimizing the difference between w and w *. Conventional strategies formulate an image watermarking task as preserving certain parts from the cover-image for the watermark. As given i n Eq. (1), w is embedded by ta king some proportions in a domai n of c ,     󰇛 󰇜   (1) where  and  are the weights wh ich control the watermark strength and D ( c ) denotes an image domain of the cover-image. Different op timization schemes can be applied to control the embedding and enable the extraction of w* from m* according to user ’s purposes. So me keys, as in cryptographic systems, can also be used in generating, embedding, or extracting the watermark for various applica tions and extra protections [5]. In contrast , we view im age watermarking as a n image fu sion task. Given two input spaces of the watermark and the cover- image,      and      . The input watermark space is firstly mapped to one of its latent spaces (a feature space    =    ) by a function       , and then the watermark embedding is performed by a mapping function  󰇝    󰇞   that fuses the feature space of the watermark a nd the input cover-image space to produce an intermediate latent space      .  is the space of the marked-image with two main constraints. The visual appearance of  must be similar to  , while the feature of  must correlate to the feature of   . Therefore , M has the desired attributes of marked-images. On the other hand , watermark extraction is performed by two mapping functions,      that reconstructs t he feature space   from  , and       that reconstructs the watermark data from   . B. Overall Architecture We apply deep neural networks    ,    ,    and    with parameters   ,   ,   and   to learn the mapping functions  ,  ,  and  . T he architecture of the proposed image watermarking system is shown in Fig. 2, where   ,    ,   , and   are the examples of the spaces  ,   ,  and  .    and Fig. 1. A general image water marking system. Fig. 2. The architecture of the proposed system. 3    are named as the embedder network, and    and    are named as the extractor network. By taking two inputs, t he embedder network transforms the input spaces  and  to the intermediat e space  . Instead of assigning som e unnoticeable portions of the visual c omponents as the watermark,    learns to replace the visual appearance of   with  while mai ntaining the characteristics of   . H ence, the space  after the fusion contains the information from  and  . On the contrary , the extractor network takes in a transformation of  and learns to separate and reconstruct   and  . The overall structure of the proposed system is compatible with the unsupervised deep autoencoders [3 1] , in which an input space can be transform ed to a latent space containing the most representative features. The original input can be recovered f rom the latent space . Similarly, the p roposed system transforms two input spaces to a desired latent space and reconstruct one of the inputs from the latent space. The recovery ability of the a utoencoders, that ensures an exact reconstruction of the input with appropriate features extracted by the deep neural networ ks, can sec ure the feasibility of the proposed structure. The blindness property is enabled since the reconstruction only takes fr om the latent space, and the fi delity is enabled by the constrai nts placed on the learned latent space. A latent space in autoencoders is often learned through a bottleneck for the dim ensionality compression, while the proposed system learns over-complete representations for bo th accurate watermark reconstructi on and robustness. The entire system is trained as a single deep neural network. In th is presentation, the samples of sp ace  are considered as 128 × 128 × 3 color images. The watermark is assumed to be binary data that could be raw or encoded of 1024-bit information (reshaped to 32 × 32). Hence, the presented system has a fixed capacity of 1kb. C. In variance Layer To tolerate the distortions on the marked-images without considering all po ssible attacks, an invariance layer is developed to reject irrelevant information. The invariance layer introduces a function     that maps space  to an over- complete transformation space  . The neurons in this layer are activated i n a sparse manner not only to allow a pos sible loss in  for robustness but also to enhance computational efficiency . As shown in Fig. 3 , it converts a 3-channel instance   of  into an N -channels ( N  3) instance   of  by a fully- connected layer, where N is the redundant parameter. Increasing N means higher redundancy in  , which implies higher tolerance of the errors in  and thus enhancing the robustness. Referring to the contractive autoencoder [ 32 ], the invariance layer employs a regularization term to achieve the sparse activation that is ob tained by the Frobenius norm of the Jacobian matrix of t he laye r outp uts wit h regards to the training inputs. Mathematically, the regula rization term P is given as     󰇛   󰇛󰇜   󰇜   (2) where   denotes the i - th input and   denotes the outp ut of the j - th hidden unit. Similar to the common gradient com putation in neural networks, the Jacobi an matrix can be written as   󰇛󰇜     󰇛    󰇜       (3) where  is a n ac tivation function and   is t he weight between   and   . The hyperbolic tangent (tanh) is applied as the activation function of the invariance layer for strong gradients as well as bias avoidance [ 33 ]. W ith  being assigned as the hyperbolic tangent , P can be computed as    󰇛     󰇜   󰇛   󰇜    ( 4) Minimizing term P alone essentially renders the weights in the layer unchangeable to all the inputs X . However, placing it as a regularization in the total loss computation enables the layer to preserve only useful information while rejecting all other noises and irrelevant inform ation to achieve the robustness. Different from the contractive autoencoder , each channel in   is treated as a sing le input in the invariance l ayer to improve the computational efficiency. Fo r example, treating one pixel in   as an input means 49,152 inpu ts for a 128  128  3 marked-image. Setting the redundant parameter N as its smallest value 3 will imply 14 7,456 units in the fully-connected invariance layer, which requires at least 7,247,757,312 parameters. This is not practical in most of the current graphic computation units and significantly lowers the efficiency. On the contrary, treating one channel as an input unit considers only 3 input units for the RGB marked -image, which enables faster computation as well as a much larger N for higher robustness. D. Embedder and Extractor Network Structure Taking the samples   from the space  ,    with the parameter   learns a mapping from  to its feature space   , and    learns the reverse mapping of   to  with samples    . As shown in Fig . 4, the structures of    and    are symmetric. In    , the 32 × 32 × 1 binary watermark samples Fig. 3. The invariance layer. 4 are successively increased into 32 × 32 × 24 and 32 × 32 × 48 by each of two convolution blocks. The result reshaped to 128 × 128 × 3 is the feature space sample    . Reversely,    reshapes the 128 × 128 × 3    back to 32 × 32 × 48 , and successively decreases it to a 32 × 32 × 1 binary waterm ark . Obviously, the space  is increased by 48 times and then restored. The pur pose of the i ncrement can be summarized into two-fold. First, it produces a    that has the same size as the cover-image sample   to facilitate a concatenation step in    . Second, the increment in the latent sp ace    introduces some redundancy, decomposition, and perceivable randomness to   , which not only helps robustness but also provides additional security . A few 32 × 32 binary watermark samples and their corresponding 128 × 128 × 3 samples from   are shown in Fig. 5. To partition the patterns in the binary watermark into different channels, the inception residual block [ 34 ] is adopted as the convolution block in the proposed system . It consists of a 1 × 1, a 3 × 3, a 5 × 5 convolution, and a residual connection that sums up the features and the input itself, so that variou s perception fields are included in the feature extraction . In the proposed structure, each convolution has 32 filters, and the 5 × 5 convolution is replaced by two 3 × 3 convolutions for efficiency. These 32-channel features are concatenated along the channel dimension to form a 96 -channel feature, and a 1 × 1 convolution is ap plied to convert the 96-channel feature ba ck to the original input channels for the summation in the residual connection. Fig. 6 presents a convolution block f , where   ,   , and   denote the height, width, and the channel of the block input, respectively. By t ak ing the samples    from the space   along with the samples    from the space  , the    with the parameter   learns to fuse these two spaces to ob tain the marked-image space  . Reversely,    learns to detect and extract   from the transform ation space  of  . As shown in Fig. 7, the convolution block f is firstly used to e xtract    features t hat a re concatenated along the channel dimension with the cover- image sample   . Another convolution block takes the 128 × 128 × 6 concatenation and fuses it to generate the space  . To achieve the fidelity,  contains the feature of   while referencing the visual contents of  . On the other hand,    takes in the 128 × 128 × N transformation sample   produced by the invariance layer and maps it back to    by two convolution blocks. Instead of us ing the sp ace   , the proposed structure fuses Fig. 5. Samples of the space  and   . First row: samples fro m  , and second row: their correspon ding samples from   . Fig. 7.    and    . Fig. 6. A convolution block f. Fig. 8. Samples of the space  ,  and   . First row: samples from  , second row: samples from  , and third row: the corresponding original and extracted samples from   . Fig. 4.    and    . 5 the feature space of   obtained through the convolution block f into the space  with the main purpose of controlling the appearance of the space  . Visually, the intermediate latent space  should primarily rely on the components of  , so the input sample   is directly exploited in the structure. In contrast, the information of    should not be displayed on   , and he nce the feature of   is designed to be correlated to the feature of    . This indirect fusion enables the fidelity in the proposed system. In summ ary, space  borrows the visual contents from  and preser ves the features from   . Various samples of  ,  and   are shown in Fig. 8 . Human vision can hardly tell the differences between marked- and cover-images in the spatial domain, while the convolution blocks in    are able to find and extract    . E. System Objective The proposed system intends to learn the mapping functions  ,  ,  ,  and  , using the neural networks    ,    ,    ,    and    parametrized by   ,   ,   ,   and   given the data samples including     and     . The proposed system is trained as a single deep neural network with a few constraints. Like the autoencoders, the system maps the space  to itself. Hence, t he ground t ruth of  is  itself, and the distance between the input   and the system output    must be minimized. What dissim ilar to th e autoencoders is that t he intermediate latent space  in the proposed sy stem is an im age that looks similar to the input space  , but con tains features extracted from  . For this purpose, the system minimizes the distance between the generated sam ples of the intermediate latent space   and the samples of the input space   , while maximizes the correlation between the samples from the feature space of   and the samples from the feature space of   . Denoting the parameters to be learned as   󰇟              󰇠 , the empirical risk  󰇛󰇜 of the proposed system can be expressed as  󰇛  󰇜     󰇟                          󰇠 (5) where B is the number of training e xamples and  is a f unction computi ng the correlati on as given below.           󰇛󰇛  󰇛   󰇜󰇜  󰇛  󰇛  󰇜󰇜  󰇛  󰇛   󰇜󰇜  󰇛  󰇛  󰇜󰇜  󰇜 (6) where  denotes the Gram matrix of all possible inner products. Besides    , the convolutional block f in    also extracts features from   , and the correlation between the se features is maximized by minimizing the distance between the Gram matrices. To highlight the overall performance rather than a few outliers , the mean absolute error is selected to compute the distance. Along with the regularization  computed by Eq. ( 4), the structural risk of the proposed model can be represented as 󰇛 󰇜   , where  is the weight controlling the strength of the regularization term. The objective of the system is to learn the parameter   that minimizes the structural risk.      󰇛 󰇜   (7) In the gradient flow during the backpropagation in the training, the term         is applied by all the components of the proposed structure in their weights updates, while only the embedder network (    and    ) applies term        and        to their weight updates. III. E XPERIMENTS AND A NALYSES A. Training and Testing By providing a fixed watermarking capacity of 1,024 bits, the proposed system is trained using ImageNet [35] (rescaled to 128  128) as the cover-image and the binary version of CIFAR [36] (32  32) as the watermark. Both datasets include more than millions of images to introduce a large scope of instances to the system. The ADAM [3 7] opti mizer that applies a moving window in gradient computation is adopted for its ability of continuous learning after large epochs. Fig. 9 shows the value of the term s in the empirical risk and of the structural risk duri ng 200 epochs. At the training and testing, both T1 and T2 in 󰇛 󰇜 converge smoothly below 1.5% a nd 󰇛 󰇜   converges below 3%. Term T1 has slightly more errors because there are some modifications on the marked-image to indicate the watermark features.  is set to be 0.01 in this case, and all the layers in the system app ly the rectified linear unit (ReLU) as the activation functi on except that   and    use sigm oid to limit the output range into (0, 1) and the invariance layer uses hyperbolic tangent. The testing is performed on 10,000 image samples from the Microsoft COCO dataset [ 38 ] as the cover-image, and 10,000 images of the testing division of the binary CIFAR as the watermark. To dem onstrate that the propo sed system generalizes the watermarking rules without over -fitting to the training sam ples, both the testing cover-images and te sting watermarks are not us ed in the training. The peak signal- to - noise ratio ( PSNR ) and bit-error-rate ( BER ) are also respectively used to quantitatively evaluate the fidelity of the marked image and the quality of the watermark extraction in the testing. The PSNR is defined as Fig. 9. The empirical risk and the structural risk during 200 epochs . 6      󰇛  󰇛  󰇜  󰇛    󰇜 󰇜 (8) where MSE is mean squared error. T he BER is computed as the percentage of error bits on the binarization of watermark extraction    . In the testing, t he BER is zero, indicating that th e original and the extracted waterm arks are identical. The tes ting PSNR is 39.72 dB, m ean ing a high fidelity of the marked- images, so that the hidden information cannot be noticed by human vision. A few testing exam ples with various imag e content and color are presented in Fig. 10. The residual error showing the absolute difference in each RGB channel between the marked- a nd t he cover-images is al so displayed, which demonstrates that the watermark is dispersed over the marked image. This provides extra security to the marked-image. Even when the cover-image is leaked, its subtracti on from the marked-image would not reveal the watermark. After the pixel values are rescaled between 0 and 255, the mean of the absolute difference for each RGB channel is computed. The averag es over the testing set yi eld 2.57, 2.10, and 1.63, respectively . The average maxima of the RGB absolute differences are 14.11, 24.79, and 17.08, respectively . These numbers indicate that there are only slightly spiky modification to enable the extraction, bu t on average the watermark insertion does not alter the channels significantly. B. Synthetic Images To further validate that the watermark embedding and extract ion rules are learned without over-fitting, the proposed system is exposed to some extreme cases with synthetic images. In particular, the synthetic situations that are not included during the training process are analyzed, and the results involving blank and random generated images and watermarks are presented. Fig. 11 sho ws the results of embedding watermarks into synthetic blank cove r-images of red, green, and blue, where the residual er rors are increased tenfold. Although the blank cover- images are not included in the training, the proposed system provides promising results. The residual errors display more green color and the blank green marked image displays relatively more noticeable noises than those in other colors, implying that the proposed system modifies the green color slightly more. Applying blank cover-images is known to be extremely difficult in conventional watermarking methods due to the lack of psycho-visual information . However, instead of assigning some unnoticeable portions of visual components as the watermark, the proposed dee p learning model learns to apply the correlation between the features of space   and the features of the fused space  . Fig. 12 presents the result of embedding a randomly generated binary image into a natural cover-image, as well as th e result of embedding a testing binary watermark into a random color-spotted cover-im age . For random watermarks , 10 , 00 0 randomly generated bits are tested on 10, 00 0 cover- images from the testing dataset and the average BER is 0.36%, which indicat es that applying random binary stream as the watermark does not cause problems to the proposed system . Fig. 10 . A few testing examples. First column: the wate rmark, second column: the cover-image, third column: the marked-image, and fourth , fifth , and sixth co lumns: the absolute differences of R , G, and B channels between the marked- and the cover-i mages. Fig. 11 . Embedding watermarks into blank cover-images . Fir st column: the watermark , seco nd column : the blank cov er i mage, th ird column: th e extracted watermark , forth column: the marked-image, and fifth and sixth colu mns: the residual errors. Fig. 12. With noise images. Fi rs t colu mn: the watermark , seco nd column : the co ver-image, t hird column: the extracted watermark , fo urth column: the marked-image, and fif th and sixth columns: the residual errors. Fig. 1 3. Visual c omparison. First row: marked-images , second row: distorted marked-i mages , where t he distortions from left to right respectively are histogram equalization, Gaussian blur, random noise, salt - and-pepper noise, and cropping, third row: original watermarks, and fourth row : watermark extractions fro m the distorted marked-image s. 7 When it comes to embedding watermarks into random cover- images, a test of embedding 10,000 watermarks from the testing dataset into 10,000 randomly generated cover-images yields a higher average BER of 11 .98%. Although the general shape is still recognizable, there are obvious distortions on the watermark extraction. However, i n practical applications , embedding a watermark into ra ndom noises m eans that the appearance of the marked-media is noisy and meaningless, so the encryption m ethods mapping a waterm ark into random patterns could be used instead . C. Robu stness The robustness of the proposed system a gainst different distortions applied to the marked-image is evaluated by analyzing the distortion tolerance range. Fig. 13 shows some visual comparison between the marked-images and their distortions , as well as between the original watermarks and the watermark extractions from the distorted marked-image s. Quantitatively, distortions with swept-over parameters that control the attack strength are applied on the marked-images produced from the testing dataset. The watermark extraction BER caused by each distortion under each param eter is averaged over the testing dataset. Som e distortion s with swept- over parameters versus the average BER are plotted in Fig. 14 . Since the proposed system is designed against image- processing attacks and the input to the system is assumed to be pre -processed to rectify the geometric distor tions such as rotation, scaling and translation, the responses of the proposed system against some challenging and common image- processing attacks are discuss ed here. The extracted watermarks respectively have 10.6%, 7.8%, 32.2%, 11 .6%, 46.2%, and 12.3 % average BER when the distortions are a Gaussian blur with mean 0 and variance 85%, a cropping discarding 65% per cent of the marked image, a Gaussian additive noise mean 0 and variance 20%, a JPEG compression wit h quality factor 10, a 20% random noise, and a 90% salt-and-pepper noise. The proposed system shows high tolerance range on these challenges especially for cropping, salt-and-pepper noise, and JPEG com pression. T he attacks that randomly fluctuate the pixel values through image channels show higher BER including Gaussian additive noise and random m odificative noise. However, a 20% Gaussian additive noise or a 20% random m odificative noise destroys most of the contents on the marked-image as shown in Fig. 15, and the proposed system responds acceptable perform ances given a decent distortion parameter, such as 16% BER on 10% Gaussian noise. D. Comparison The proposed system is analytically compared against several state- of -the-art image watermarking methods that incorporate deep neural networks as shown in Table I . Kandi et al . [ 18 ] proposed to use convolutional neural networ ks for image watermarking. It appl ie s two deep autoencoders to rearrange a cover-image to a m arked-image. To indicate a watermark in the marked-image, the pixels produced by the first auto-encoder represent bit zero and the pixels produced by the second represent bit one. However , the method is a non-blind scheme although achieving robustness. Embedd ed by increasingly ch anging an image block to represent a waterm ark bit, the system in [ 22 ] is trained to extract the watermark bits from their corresponding blocks with attack sim ulation and achieves both blindness an d r obustness. However, it requires to include the distortions in the training phase for robustness. I n Fig. 1 4. Distortion s with swept-over parameters versus average BER. TABLE I C OMPARISON BETWEEN THE PROPOSED SYSTEM AND ST ATE - OF - THE - ART IMAGE WATERMARKING METHODS APPLYING DEEP NEURA L NETWORKS Method Function of the deep neural network Blind Robust Concentration [1 7] Embedding no no Undetectability [1 8] Embedding and extraction no yes Robustness [1 9] Embedding and extraction yes no Capacity [ 20 ] Extraction yes no Undetectability [2 2] Extraction yes yes Robustness Ours Embedding and extraction yes yes Robustness Fig. 15 . Sa mple distortions. Left: the marked-image, middle: after 20% Gaussian ad ditive noise, and right: after 20% rando m modificative noise. 8 reality , we have no way to predict and enumerate all kinds of attacks . To overcome this , our proposed system not only appl ie s deep neural networks to learn the rules of both embedding and extraction, but also intends to achieve blindness and robust ness simultaneously without the requirement of the attacks’ prior knowledge, and hence has a wider range of applications. The propose d system is also quantitatively c ompared agai nst several related competitors that are blind and robust image watermarking systems. The selection of the competitors considers variation and their concentrations . Mun et al . [ 22 ] applied convolutional neural networks, and Zong et al . [ 11 ] , Zareian and Tohidypour [ 39 ], and Ouyang et al. [ 40 ] used manually-designed, traditional, and robust methods respectively with different image domains includ ing histogram domain adopting statistical image features, frequency domain, and log-polar domain with summarized image features. All the selected competitors fo cus on the robustness against image- processing attacks. The testing is performed on the same cover- image sets as well as the same watermarks reported i n the references. As the proposed system fo cuses on common image- processing attacks, t he crucial results focusing on this category are presented in Table II , where “ / ” denotes not applicable, S&P denotes the salt-and-pepper no ise, and GF denotes Gaussian filtering. The proposed system sh ows advantages by covering more distortions in image-processing attacks and obtaining a lower BER under the same distortion parameters. For instance, traditional methods such as manipulating the image histogram cannot tolerate the histogram equalization attack. In addition, the proposed method has a higher tolerance range; for example, [ 22 ] and [ 40 ] can only extract the watermark with a high JPEG quality of 80 to 90, w hile the pro posed m ethod covers as low as 10 . A lthough the method in [ 39 ] focusing on the compression ha s higher performance on the JPEG, the proposed method outperforms t he competitors in all other listed distortions. Remarkably, the competitors tolerate cropping 20% to 30%, while the BER is as high as 7.8% % even if 66% of the marked- image is cropped. Finally, under a similar PSNR , the proposed method shows its advantages by simultaneously achieving the highe st robustness and the highe st ca pacity. IV. A N A PPLICATION : W ATERMARK EXTRACTION USING A PHONE CAMERA TO SCAN A SC REEN To the best of our knowledge, all the methods solving the problem of watermark extraction from camera resample focus on printed papers up to now [ 23 - 28 ]. Applying deep neural networks for watermark extract ion from camera resamples of a screen remai ns unexplored. Although the paper printings sometimes bring noises such as p rinting quality and bending, the watermark extraction from the resa mples of a screen presents a much more challenging tas k . Besides the noises brought by the camera including geometric distortion, optical tilt, quality degradation, compression, lens distortions, and lighting variation, it introduces much more possible noises from the screen, such as the Moire pattern (i.e., the RGB ripple), the refresh rate of the screen, and t he spatial resolution of a m onitor (see the examples of camera resam ples in Fig. 16). Developing a blind image watermarking system that is simultaneously robust to all of these disto rtions is extremely difficult . Since our proposed watermarking system is designed to reject all irrelevant noises instea d of focusing on certa in types of attacks, its application to deal with this problem seems feasible. The outlined process of this applicati on is shown in Fig. 16. First, an information provider prepares the data by encoding through some error correction coding (ECC) techniques. Then, the marked-image can be obtained by fusing the encoded watermark and the cover-image using the trained embedder network. The marked-image that looks identical to the cover- image is distributed online and display ed on the user ’ s screen . Finally, the user scans the marked-image to extract the hidden watermark by the trained extractor network in our proposed system. The distortions occurred in the application can be divided into two categories: projective and image-processing distortions. The geometric and projective distortions will be rectified by image registration techniques, and the major function of the proposed system in this application is to overcome the pixel-level modifications coming from image- processing distortions, such as the compression , lighting variations, the Moire pattern, and the interpolation errors from the rectification. The autofocus function of a smartphone is utilized. To simulate a realistic sit uation , a prototype i s developed for TABLE II Q UANTITATIVE COMPARIS ON BETWEEN THE PROPOSED SYSTEM AND SOME BLIND AND ROBUST COMPETITOR S . Method BER ( % ) under the distortions PSNR (dB) Capacity (bits) HE JPEG 10 Cropping 20% S&P 5% GF 10% [ 11 ] / 17.50 7.06 3.51 6.33 46.63 25 [22] / / 6.61 7.98 4.81 38.01 1 / block [39] / 2.15 / 4.94 0.21 41.00 256 [40] / / 7.51 9.41 27.91 36.77 24 Ours 0.43 8.16 0 0.97 0 39.93 1,024 Fig. 1 6. Process of the application. 9 a user study , and a 32  16 information is used for its clear structure. The user interface (UI) and the sample information are shown in Fig. 17. Reed Solomon (RS) code [ 41 ] is adopted as the ECC to protect the information under some BER . RS(32,16) is applied to protect each row of the 32  16 information, so the enc oded information will be a 32  32 watermark satisfying the fixed watermarking capacity of the proposed system. In the watermark, each row is a codeword consisting of a data of length 16 and a parity of length 16, and hence can correct up to an error of length 8. Therefore, inside this watermark of length 1,024, up to 256 errors can be corrected if there are no more than 8 errors in each row. Applying half of the bits as the parity, the watermarking payload is 512 bits. As shown in the UI, the prototype only analyzes the region of interest (ROI) in a camera view and hand-taken pictures can hardly be parallel to a screen. Therefore, there exist so me geometrical, affine, and perspective distortions , which the pr oposed system does not concentrat e on. Therefore, the image registration te chnique in [ 42 , 43 ] is adopted to re ctify these distorti ons before inputting a pict ure to the pr oposed system for an e xtraction. To simplify the prototype as shown in Fig. 18, four corner s of the largest contour inside the ROI are used as the reference points. The contoured content is mapped on the bird vie w plane , and the watermark is extracted from the rectification. Five volunteers were asked to take a few pictures of some marked-images displayed 425px  425px on a 2,560  1,440 screen by the camera of a mobile phone. Two rules were given to the users. First, the e ntire image should be placed as large as possible inside the ROI. As a prototype for demonstration, this rule facilitates our segmentation that the largest contour inside the ROI is the marked-image, so that this appli cation can focus on the test of the proposed system instead of some com plicated segmentation algorithms. In a ddition, placing the image largely in the ROI helps with the capture of desired details and feat ures for the watermark extraction. Second, the camera should be kept as still as possible. Although the proposed system tolerates some blurring effects, it is not designed to extract watermark in high-speed motion. Fig. 19 presents a few ext ractions and their corresponding ROIs, where the BER s from left to right are 3.71%, 4.98%, 1.07% , 4.30%, and 8.45%, respectively . It can be observed that the closer the picture is taken, the lower the error is. The more parallel between the camera and the scre en, the lower the error is. The angle tolerance between the camera and the screen is around 30°. The flashlight brings more errors since it may over- and underexpose some image areas. The flashlight may be turn ed off in this applicat ion since the screen ha s backlit. There are 20 images in th e user ’s test, and the average BER is 5.13%. For visual comparison, the displayed sample watermark extractions are the raw result before error correction . After executing RS(32,16), all the watermark extractions in the testing cases can be restored to the original information in Fig. 17 without error. In these tests, the proposed system extracts the watermark within a second as it only applies the trained weights in the extractor network on the m arked-image rectification. V. C ONCLUSIONS This paper introduces an automated im age waterm arking system using deep convolutional neural networks . The proposed blind image watermarking system achieves its robustness property without requiring prior knowledge of possible distortions on t he marked-image . The proposed system constructs an unsupervised deep neural network s tructure with a novel loss computation for automated image watermarking. Experimental results along with a challenging application of watermark extraction from camera resampled marked-images have confirmed the superiority performance of the proposed system. By exploring the ability of deep neural networks in the task of fusion betwee n the c over-image and the latent spaces of the watermark, the proposed system has successfully developed an image fusion applicatio n on image watermarkin g. R EFERENCES [1] H. Berghel and L . O' Gorman, “Protecting ownership rights through digital watermarking,” Comput. , vo l. 29, no. 7, pp. 101-103, 1996. [2] R. Caldelli, F. Fra ncesco an d R. Becarelli, “Rev ersible water marking techniques: An overview and a classification,” EURASIP J. Inform. Security , vol. 2010, no.1, p. 134546, 20 10. [3] B. Gunjal and R.R. Manth alkar, “ An overview of transform do main robust digital im age water marking algorith ms,” J. Emerg. Trend s in Comput. Inform. Sci. , vol. 2, no. 1, pp . 37-42, 2010. [4] I. Cox, M. Mille r, J. Bloom, J. Fridrich and T. Kalker, Digital watermarking and steganography , Burlington , MA, USA: Morgan Kaufmann, 2008. [5] F. Y. Sh ih, Dig ital Watermarkin g and Steganography: Fundamentals and Techniq ues Second Edition , Boca Raton, FL, USA: CRC Press, 2017 . [6] X. Kang, J. Huang, Y.Q. Shi, Yan Lin, “ A DWT -DFT composite watermarking scheme robust to bo th affine transform and JPEG compression,” IEEE Trans. Circuits Syst. Video Technol. , vol.13, no. 8, pp. 776-786, 2003. Fig. 1 8. A marked-image rectification inside a ROI. Fig. 19 . A few watermark extractions b efore ECC and the ROIs. Fig. 17 A prototype and an info rmation. Left: UI of the prototype, and right: the sample info rmation. 10 [7] A. A. Tamimi, A. M. Abdalla, and O. Al- Alla f, “Hiding an image inside another image using variable- rate steganography,” Int. J. Adv. Co mput. Sci. Appl. , vol. 4, no. 10, pp. 18 -21, 2013. [8] I. J. Cox, J. Kilia n, F. T. Leigh ton, and T. Sha moon, “Secure spread spectrum watermarking for multimedia,” IEEE Trans. Imag e Process. , vol. 6, no. 12, pp. 1673 -1687, 1997. [9] F. Y. Sh ih and X. Zhong, “ Intelligent water marking for high -capacity low- distortion data embedding,” Int. J. Pattern Recognit. AI , vol. 30 , no. 5, p. 1654003 (17 pages), 2016. [10] T. Pevný, T. Fill er, and P. Bas, “Using high -dimensional image models to perform highly undetectable steganograph y,” in Proc. Int. Workshop Inform. Hiding , Calgary Canada, Jun . 28 – 30 2010, pp. 161 -177. [11] T. Zong, Y. Xiang, I. Natgunanathan, S. Gu o, W. Zhou, and G. Beliakov, “ Robust histogram shape- based method for image watermarking,” IEEE Trans. Circuits Syst. Video Technol. , vol. 25, no. 5, pp. 717 -729, 2015. [12] Y. Qian, J. Dong, W. Wang and T. Tan, “Deep learning f or steganalysis via convolutional neural n etworks,” Media Watermarking, Security, and Forensics , vol. 9409, p. 94090J, International Society for Optics and Photonics, 2015. [13] L. Pibre, J. Pasquet, D . Ienco, and M. Chaumont, “ Deep learning is a good steganalysis tool when embedding key is reused for different images, eve n if there is a co ver source mismatch,” Electron. Imag. , vol. 1, no. 11, pp. 1-11, 2016. [14] H. Sabah and B. Haitham, “Artificial neural network for steganography,” Neural Comput. Appl. , vol. 26, no .1, pp. 111-116, 2015. [15] S. B. Alexandre and C. J. David, “Artif icial neural n etworks applied to image steganog raphy,” IEEE Latin Amer. Trans. , vol. 14, no. 3, pp . 1361 -1366, 2016. [16] J. Robert, V. Eva, and K. Martin, “Neural network approach to image steganography techniques,” Men del , pp. 317-327. Springer, 2015. [17] W. Tang, S. Tan, B. Li, and J. Huang, “Automatic steganographic distortion learning using a generative adversarial network,” IEEE Signal Process. Lett. , vol. 24, no. 10, pp. 15 47-1551, 2017. [18] H. Kandi, D. Mishra and S. R. S Gorthi, “ Exploring the learning capabilities o f con volutional neu ral networks for robust image watermarking,” Comput. & Security , vol. 65 , no. C, pp. 247-268, 201 7. [19] S. Baluja, “Hiding i mages in plain sight: deep steganography,” in Proc. Adv. Neura l Info rm. Process. Syst. , Long Beach CA, 2017, pp. 2069- 2079. [20] D. Li, L. Deng , B. B. Gupta, H. Wang, and C. Choi, “A novel CNN based security guaranteed image watermarking generation scenario for smart city applications,” Inform, Sci. , vol. 479, no. 4, pp. 432-447, Apr. 201 9. [21] N. Papernot, P. McDani el, S. Jha, M. Fredrikson, Z.B. Celik, and A. Swami, “The limitations of d eep learning in adv ersarial settings,” in Proc. IEEE Eur. Symp. Secur ity Privacy , Saarbrucken Ger many, Mar. 2016, pp. 372- 387. [22] S. M. Mun, S. H. Nam, H.U. Jang, D. Kim and H. K. Lee, “Finding robust d omain from attacks: A learning framework for blind watermarking,” Neurocompu ting , vol 337, pp. 191-202, 2019 . [23] A. Pra mila, A. Keskinarkaus and T. Seppänen, “Camera based watermark extraction- proble ms and examples,” in Proc. Finnish Sign al Process. Symp. , Tampere Finland, 2007. [24] A. Pramila, A. Kes kinarkaus, and T. Seppänen, “Increasing the capturing angle in print - cam robust watermarking,” J. Syst. S oftw. , vol. 135 , pp.205-215, 2018. [25] A. Katayama, T. Nakamura, M. Yamamuro, N. Sonehara, “New hi gh - speed f rame detection method: Side trace algorith m (sta) for i -appli on cellular phones to detect wat ermarks,” in Proc. 3 rd Int. Conf. Mobile Ubiquitous Multimedia , Co llege Park MD, Oct. 2004, pp. 109 – 116. [26] W. G. Ki m, S. H. Lee and Y. S. Seo, “I mage fing erprinting sche me for print-and –capture model.” in Proc. Pacific-Rim Conf. Multimedia , Heidelberg Berlin Germany, Nov. 200 6, pp. 106 – 113. [27] T. Yamada and M. Ka mitani, “A method for detecting water marks in print using smart ph one: finding no mark.” in Proc. 5th Workshop on Mobile Video , ACM , Oslo Nor way, Feb. 2013, pp. 49 – 54. [28] L. A. Delgad o-Guillen, J. J. Garcia -Hernandez and C. Torres -Huitzil, “Digital watermarking of color images utilizing mobile platforms,” in Proc. 2013 IEEE 5 6th Int. Midwest S ymp. Circuits Syst. , IEEE , Columbus Ohio, Aug. 2013 , pp. 1363- 1366. [29] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unp aired image - to -i mage transaction usin g cycle- consistent adv ersarial networks, ” in Proc. IEEE Int. Conf. Comput. Vis. , Venice Italy, 2017 , pp. 2223-223 2. [30] H. Li and X. J. Wu, “DenseFuse: A fusion approach to in frared an d visible images,” IEEE Trans. Image Process. , vol. 28, no. 5, pp. 2614 - 2623, 2019. [31] G. E. Hinton and R. R. Sal akhutdinov, “Reducing the dimensionality of data with neural networks,” S ci. , v ol. 313, no. 5786, pp. 504 -507, 2006. [32] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto- encoders: Explicit invariance during feature extraction,” in Proc. 28th Int. Conf. Mach. Learn. , Bellevue Washington, Jun. 2011, pp. 833- 840. [33] Y. LeCun, L. Bottou, G. B. Orr, and K. R. Müller, “Efficient backprop,” in Neural Networks: Tricks of the Tra de , Germany: Springer, 2012, pp. 9- 48. [34] C. Szegedy, S. Ioffe, V. Vanho ucke, and A.A. Alemi, “Inception -v4, inception- resnet and the impact of residual connections on learning,” in Proc. AAAI Conf. AI , San Francisco CA, Feb. 2017, pp. 4278 -4284. [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Sathe esh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, and A.C. Berg, “ Imagenet large scale visual recognition challenge,” Int. J. Co mput. Vis. , vol. 115 , no. 3, pp. 211-252, 2015. [36] A. Krizhev sky an d G. Hin ton, “ Learning m ultiple layers of features from tiny images,” Univ. Toronto, Toronto, Canada, no. 4, vol. 1, pp. 7, 2009. [37] D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization.” in Proc. Int. Co nf. Learn. Representations , San Diego CA, May 2015, pp. 1 – 13. [38] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L.Zitnick, “Microsoft coco: Common objects in context,” in Proc. Eur . Conf. Comput. Vis. , Zurich Switzer land, Sept. 20 14, pp. 740 -755. [39] M. Zareian and H. R. Tohidypour, “Robust quantisatio n index modulation- based approach for image watermarking,” IEEE Tran s. Image Process. , vol. 7, no. 5, pp. 432 -441, 2013. [40] J. Ouyang, G. Coatrieux, B. Ch en, and H. Shu, “Color image watermarking based on qu aternion Fourier transf orm and i mproved uniform log-polar mappi ng,” Comput. & Elect. Eng. , vol. 46 , pp . 419- 432, 2015. [41] I. S. Reed and G. Solomon, “ Polynomial codes over certain f inite fields.” J. Soc. Ind. Appl. Math. , vol. 8, no.2, pp .300-304, 1960. [42] L. G. Bro wn, “A survey of i mage registration techniques.” ACM Comp ut . surveys , vol. 24 , no.4, pp.325-376, 199 2. [43] S. Zokai, and G, Wolberg, “Image registration using log -polar mappings for recov ery of large- scale similarity and projectiv e transfo rmations.” IEEE Trans. Image Process. , vol. 14 , no. 10 , pp.1422-1434, 2005.

A Robust Image Watermarking System Based on Deep Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment