Towards improved lossy image compression: Human image reconstruction with public-domain images

T o w ards impro v ed lossy image compression: Human image reconstruction with public-domain images Ash utosh Bho wn 1 , Soham Mukherjee 2 , Sean Y ang 3 , Irena Fisc her-Hw ang 4 , Sh ubham Chandak 4 , Kedar T atw a w adi 4 , Judith F an 5 , Tsac h y W eissman 4 1 P alo Alto High School 2 Mon ta Vista High School 3 Sain t F rancis High School 4 Stanford Universit y 5 Univ ersity of California San Diego schandak@stanford.edu Abstract Lossy image compression has b een studied extensiv ely in the context of t ypical loss functions suc h as RMSE, MS-SSIM, etc. Ho wev er, compression at lo w bitrates generally pro duces un- satisfying results. F urthermore, the a v ailabilit y of massive public image datasets app ears to ha ve hardly been exploited in image compression. Here, w e presen t a paradigm for eliciting h uman image reconstruction in order to perform lossy image compression. In this paradigm, one human describ es images to a second h uman, whose task is to reconstruct the target im- age using publicly a v ailable images and text instructions. The resulting reconstructions are then ev aluated b y h uman raters on the Amazon Mec hanical T urk platform and compared to reconstructions obtained using state-of-the-art compressor W ebP . Our results suggest that prioritizing seman tic visual elements may b e key to ac hieving signiﬁcant improv ements in image compression, and that our paradigm can b e used to dev elop a more h uman-cen tric loss function. Data: The images, results and additional data are av ailable at https://compression. stanford.edu/human- compression . In tro duction Image compression is critical for achieving eﬃcient storage and communication of image data. T o day’s most p opular image formats and compression tec hniques include PNG [1], JPEG [2], JPEG2000 [3], JPEG XR [4], BPG [5] and W ebP [6]. In order to ac hieve signiﬁcan t reduction in image size, most compression tec hniques allo w some loss while compressing images. Ho wev er, traditional loss functions prioritize pixel-by- pixel ﬁdelity , leading to blurry and unnatural images at high loss levels. The left tw o panels of Figure 1 show an example in whic h compression and reconstruction using W ebP [6] result in a severely blurred image. It is natural to posit that b etter compression results migh t b e ac hiev ed b y opti- mizing for visual prop erties that a human viewer cares ab out preserving, rather than lo cal, pixel-lev el diﬀerences. The rightmost panel of Figure 1 sho ws an example of a reconstruction whic h prioritizes image conten t o ver pixel-level ﬁdelity . Indeed, there has b een a large b o dy of w ork in the computer vision comm unit y [7][8][9] tow ards Most of this work was p erformed as part of the ﬁrst three authors’ summer internship at Stanford Electrical Engineering departmen t. dev eloping loss functions that more accurately reﬂect human visual priorities. Some compression metho ds, for example, tak e adv antage of the fact that human visual p erception is more susceptible to diﬀerences in intensit y than in color, and quantize color space more crudely than in tensity space in order to achiev e b etter compression p erformance. Figure 1: Original giraﬀe image with W ebP and h uman reconstructions. (A) Orig- inal giraﬀe image, (B) reconstruction of a W ebP compressed version that is o ver 1,000 × smaller than the original, and (C) a human reconstruction whose compressed represen tation is similar in size to the W ebP compressed v ersion. Here we prop ose a nov el strategy for impro ving lossy image compression that lev erages human image reconstruction b ehavior in order to exp ose which visual prop- erties h umans care ab out preserving in images. W e in tro duce a no vel experimental paradigm for accomplishing “h uman image compression” in v olving tw o human par- ticipan ts who in teract with one another in real time. During each interaction, the “describ er” priv ately views a target image and pro vides natural language descriptions of the image with the goal of helping the “reconstructor” (who cannot view the target image) pro duce a reconstruction of that image. Both participan ts hav e full access to publicly av ailable images on the internet, allowing them to augmen t their natural language descriptions with links to images, which, in turn, ma y b e com bined during target image reconstruction. Since this exp erimental setup results in a set of instruc- tions for creating a reconstruction of some target image under some deﬁnition of loss, this exp erimental paradigm ma y b e thought of as accomplishing lossy compression. T o determine the quality of the reconstruction, w e solicited human judgments ab out the reconstructed image using the Amazon Mec hanical T urk (MT urk) platform [10]. W e used the MT urk platform to p erform an ev aluation exp erimen t which as- sessed the qualit y of h uman-reconstructed images, and b enchmark ed the p erformance of human reconstructions against state-of-the art compressor W ebP . W e present the results of h uman compression for 13 high-resolution images of diﬀeren t t yp es. Related W orks There has b een signiﬁcan t w ork on incorp orating asp ects of the human visual system to wards improving lossy image compressors. Man y commonly used compressors suc h as JPEG, JPEG2000 and W ebP already attempt to implicitly capture properties of h uman p erception. F or example, the human visual system is prone to disregard- ing sharp edges in images, so JPEG quantizes high frequency comp onents heavily . The MS-SSIM metric was dev elop ed to emulate the higher-lev el image similarity that seems to b e v alued by humans, and is used by [11] and [12] for optimizing image compression. The compressor Guetzli [13] includes a p erceptual JPEG enco der op- timized for a new image similarit y metric dubb ed “butteraugli”[14]. More recently , [15] trained a neural netw ork to predict h uman p erceptual qualit y scores on a large dataset of h uman-scored images. Another interesting line of w ork attempts to capture the eﬀects of human p er- ception b y using generativ e mo dels for lossy compression, which implicitly capture distributions of natural images. Then, discriminator mo dels are used to train the generativ e mo dels instead of image similarit y metrics like RMSE or MS-SSIM. F ur- thermore, the discriminator mo dels are themselves trained to distinguish b et ween natural and syn thetically generated images. F or example, [16] uses generative adver- sarial netw orks to obtain visually pleasing images at lo w bitrates. Video enco ders suc h as MPEG [17] attempt to exploit extreme structural similarit y (i.e., translational similarit y) b etw een adjacen t video frames. How ever, apart from video data, exploitation of seman tic similarities (i.e., similar high-lev el features such as ob jects, p ersons, etc.) remains a secondary priority in image compression. Metho ds Our “human image compression” setup circum ven ts modeling asp ects of h uman visual p erception by directly utilizing humans in a lossy image compression task. The setup in volv es tw o human participants, referred to as the describ er and reconstructor, as presen ted in the Int ro duction. Figure 2: The human image compression setup. The describ er ﬁxes a target im- age, and attempts to describ e it using text instructions, including URL links. The reconstructor attempts to reconstruct the image based on the text instructions. The describ er is able to view the reconstruction, hear the reconstructor’s v oice and receiv e text feedback from the reconstructor. Both hav e access to the in ternet. Figure 3: Excerpts from a h uman compression session. A simpliﬁed excerpt of the reconstruction pro cess for the giraﬀe image. The text on the left sho ws text communi- cations sent from the describ er to the reconstructor. V arious stages of reconstruction are shown on the right, describing the background grass and bush, and the giraﬀes. Examples of in ternet links (blue) to publicly av ailable images are also shown. F or ev ery input target image, the roles of the describ er and reconstructor are as follo ws: • Describ er : Analyzes/recognizes the input image and informs the reconstructor of the necessary steps to b est recreate the target image. The describ er commu- nicates with the reconstructor only via real-time text chat and may view the reconstructed image in progress. They may also receive verbal communica tions from the reconstructor. • Reconstructor : Interprets text instructions from the describ er in order to pro duce a reconstruction of the original image. The reconstructor is not p er- mitted to view the original image until the reconstruction is complete, but may comm unicate with the describ er. This h uman image compression paradigm com bines t wo k ey asp ects. First, it exploits h uman participan ts’ preexisting comp etence in visual scene understanding and natu- ral language use in order to elicit what information is prioritized in images. Second, it lev erages public domain image data, th us av oiding the need to allo cate additional disk space for visual information that is w ell-approximated b y publicly a v ailable data. By permitting the exc hange of natural language and pointers to publicly av ailable im- ages, our approac h aims to c haracterize the limits of h uman-guided image compres- sion under realistic exp ectations ab out the semantic kno wledge and visual data that are shared betw een a sender and receiv er [18]. In principle, this empirical “h uman- cen tric” approac h may lead to the discov ery of impro v ed loss functions that resp ect h uman visual priorities to a greater degree than current compression techniques do. In addition to descriptions and links to images, the describ er may also send in- structions for manipulating the linked images in order to create a satisfactory re- construction. Altogether, our h uman compression scheme in volv es t wo streams of one-w ay comm unication: one text-based from the describ er to the reconstructor, and one in an y format from the reconstructor back to the describer. How ever, only the text from the describ er to the reconstructor is considered to be the “compressed” represen tation of the input image; any comm unication from the reconstructor to the describ er is not counted to w ards the size of the ﬁnal compressed representation of the image. T o justify our accoun ting, w e compare our h uman compression scheme to a mac hine implemen tation of compression. In our exp erimen ts, the compression pro cess in volv es communication betw een a de- scrib er and reconstructor which pro duces a text transcript, as w ell as a reconstructed image. The in teractions b etw een describ er and reconstructor ma y b e though t of as some sequence of instructions and actions. The describ er’s role is to issue instruc- tions, and the reconstructor’s job is to p erform actions (e.g. reference image cropping, scaling, translation, etc.). Eac h instruction issued by the describ er is based on their access to the target image as well as the previous action p erformed by the reconstruc- tor, while the reconstructor p erforms actions in resp onse to eac h instruction issued b y the describ er. The instruction-action pro cess is rep eated until the target image is reconstructed to the describ er’s satisfaction. In lossy compression algorithm implemen tations, the compression pro cess in v olves elemen ts that function similarly to the h uman describ er and h uman reconstructor. The elemen ts that p erform the description and reconstruction functions also interact m uch lik e their h uman counterparts do: machine “instructions” are issued based on the target image and previous actions, and actions (e.g., prediction of the next blo c k of pixels) are pro duced in resp onse to the instructions receiv ed. The instruction-action pro cess is rep eated until the en tire image is compressed, and generates a compressed represen tation, which is analogous to the transcript pro duced b y the human compres- sion setup. Ho wev er, unlik e the human compression setup, in machine-implemen ted compres- sion a reconstructed image is only pro duced when the decompression pro cess is ex- ecuted on the transcript. Notably , the decompression pro cess may b e though t of as identic al to the compression pro cess, but with the describ er replaced b y the tran- script. The stipulation of iden ticality necessitates that the actions p erformed by the reconstructor during decompression are identical to those p erformed during compres- sion. In other w ords, the reconstructor must p erform the same action in resp onse to the same instruction, whether that instruction is from the describ er or from a tran- script. As a result, only the mac hine instructions ﬂowing from the machine describ er to the machine reconstructor are recorded, and the actions ma y b e discarded from the transcript (cf. e.g., [19]). Of course, due to the large amoun t of v ariation in h uman cognition and b eha vior, it is unlik ely for a h uman reconstructor to p erform actions during the decompression pro cess exactly as they did during the compression process. Ho wev er, the fact remains that were the h uman reconstructor’s actions able to be repro duced identically up on demand in response to a receiv ed instruction, then the text transcript con taining only the describ er’s text instructions would suﬃce for creating an image reconstruction iden tical to that pro duced during the compression pro cess. F or this reason, in our h uman compression scheme w e also consider only the describer’s text to b e coun ted to wards the compressed represen tation of the target image. F urthermore, since a reconstructed image is pro duced in addition to a transcript of instructions, our setup ma y b e thought of as sim ultaneous execution of b oth compression and decompression pro cesses. Implementation details The describ er w as provided an input image for compression, and a Skyp e call w as ini- tiated b etw een the describ er and reconstructor with the following restrictions. First, the describ er could only comm unicate to the reconstructor through the inbuilt Skyp e text chat. The describ er turned oﬀ their outgoing audio/video to av oid inadverten tly leaking information to the reconstructor. No w, the reconstructor could communicate v erbally with the describ er through audio/video/text chat. Finally , the reconstructor could share partial, in-progress reconstructions with the describer in real time using Skyp e’s screen share feature. With these restrictions in place, the describ er would b egin to send a series of instructions for the reconstructor to attempt image reconstruction. Generally , the describ er could send URL links to reference images that already exist on the in ter- net, as well as sp eciﬁc text instructions for altering the image. A v ariet y of image editing tasks could b e sent, including: spatial translation of image elements, aﬃne or p ersp ectiv e transformations, erasure or addition of certain ob jects in the image, enlargemen t of a p ortion of the image, comp ositing m ultiple images, etc. Figure 3 sho ws parts of the reconstruction pro cess for the giraﬀe image. When reconstruction had b een completed to the lev el of the describer’s satisfac- tion, the experiment w as stopp ed. The Skype text transcript containing all instruc- tions from the describ er to the reconstructor w as sa v ed. Finally , the transcript was pro cessed by removing timestamps and compressing it using the bzip2 [20] compres- sor. The bzip2-enco ded Skype transcript represen ted the ﬁnal compressed represen- tation of the input image. The quality of image reconstruction can then b e compared to that of a standard lossy image compressor. Exp erimen ts Data Col le ction W e ﬁrst created a dataset of original images that are not publicly av ailable on the w eb. The creation of original images preven ts trivial enco ding via an exact cop y of a non-original picture. Original images w ere captured with a digital camera or smart- phone camera at high resolution. A wide v ariet y of images (e.g., faces, landscap es, sk etches, etc.) unknown to the describers and reconstructors were captured for the exp erimen ts. F rom these, we selected 13 diverse high-resolution images for our com- parison exp eriments. The images and additional details are av ailable in the app endix and at https://compression.stanford.edu/human- compression . Exp erimental Setup W e describ e the exp erimen tal pro cedure for ev aluating the quality of reconstructions b y h uman compressors and W ebP: 1. Human compression: The input image is compressed and reconstructed b y the h uman compression system using the pro cedure describ ed in the Metho ds. The size (in b ytes) of the compressed text instructions is recorded. 2. W ebP compression: The W ebP compressor is used to lossily compress the input image to a size similar to that of the compressed human text instructions. 3. Quality ev aluation: The quality of W ebP and h uman compressed images w ere compared using h uman scorers on the MT urk platform. W ebP [6] is a relatively recen t image compressor released by Go ogle. W e c hose W ebP as the reference compressor for comparing image reconstruction quality since W ebP outp erforms JPEG and JPEG2000 at the high compression levels achiev ed b y the human compression scheme. This is illustrated in Figure 4. Ho wev er, even when compressing images using W ebP at the lo west allow ed qualit y lev el (qualit y parameter set to 0), the compressed ﬁles w ere m uc h larger than those of the h uman compressors. As a result, we ﬁrst reduced the resolution of the images Figure 4: A comparison of JPEG, JPEG2000 and W ebP compression. The giraﬀe image compressed using (A) JPEG, (B) JPEG2000 and (C) W ebP . All reconstructions are generated from a ﬁle size that is similar to that of the human-compressed giraﬀe ﬁle. b efore compressing with W ebP with quality parameter 0 in order to attain the target size, alwa ys erring on the side of the W ebP ﬁle b eing larger than the compressed h uman text instructions. Quality Evaluation using MT urk 10/ 18/ 2018 Cre a t e | Re que s t e r | A m a z on M e c ha ni c a l T urk ht t ps : / / re que s t e r .m t urk.c om / c re a t e / proj e c t s / 1283356 1 / 1 ←  g o  t o  M T u r k . c o m A s h u t o s h  B h o w n  |  M y  A c c o u n t  |  S i g n  O u t  |  H e l p Home C r e a te Ma n a g e Developer Help N e w  Pro j e ct N e w  Ba t ch  w i t h  a n  Exi st i n g  Pro j e ct 1  En te r  Pr o p e r ti e s 2  DesignLayout 3  Pr e v i e w  a n d  F i n i s h E d i t P r o j ect T h i s i s h o w  yo u r H I T  w i l l  l o o k t o  Me ch a n i ca l  T u rk W o rke rs. Pr o j e c t N a m e :  R e c o n st r u c t i o n T e st - G i r a ﬀ e s  T h i s n a me  i s n o t  d i sp l a ye d  t o  W o rke rs. St a n f o rd  EE I ma g e  R e co n st ru ct i o n  Su rve y 9 R e q u e s te r :  Ash u t o sh  Bh o w n R e w a r d :  $ 0 . 3 0  p e r H I T HIT s  a v a i l a b l e :  0 D u r a ti o n :  3  Mi n u t e s Q u a l i fi c a ti o n s  R e q u i r e d :  Ma st e rs h a s b e e n  g ra n t e d  H I T  Pr e v i e w    S u b m i t In st ru ct io ns Th esec ond i m ag ei s ar ec onstruction of th efi rst i m ag e. Co mp ar et h et wo i m ag es an d rat ey our l evel ofsat is fac tion fr om t h erec onstruction u sing t h es c al ebel ow (1 =compl etel y u ns at is fi ed ,1 0=compl et ely sat is fi ed ).  Or igin a lI ma ge:  ImageReconstruct io n:     Lev el o fSat is fac tion:  1 (c ompl etel y u ns at is fi ed ) 2  3  4  5  6  7  8  9 10 (c ompl etel y sat is fi ed ) Help  C o n t a ct  U s  Po l i ci e s  St a t e  L i ce n si n g  Pre ss I n q u i ri e s  Bl o g  Se rvi ce  H e a l t h  D a sh b o a rd  C a re e rs MT u rk. co m  R e q u e st e rs  W o rke rs  D e ve l o p e rs  F o l l o w  U s o n  T witter Y o u r  p r o j e c t w a s  s u c c e s s fu l l y  s a v e d . © 2 0 0 5 2 0 1 8  Ama zo n . co m,  I n c.  o r i t s Af f i l i a t e s.  Al l  ri g h t s re se rve d .   Figure 5: Screen capture of the ev aluation HIT that MT urk w ork ers see in their web bro wser: a descriptive surv ey prompt at the top, the original image on the left, one image reconstruction on the right, and a rating scale b elow. W e compared the quality of compressed images using h uman scorers (w orkers) on Amazon Mec hanical T urk (MT urk) [10], an online platform for conducting behavioral studies. F or eac h image, w e displa yed the original image and a reconstructed image in a human in telligence task (HIT) which ask ed work ers to rate the reconstruction on a scale of 1 to 10, according to their “level of satisfaction” with the reconstruction. F or eac h HIT and for b oth t yp es of reconstruction (h uman compression and W ebP), w e collected 100 surv ey responses and obtained summary statistics. Figure 5 shows a screenshot of the MT urk survey as seen b y the work ers. Results Image Original Compressed c hat W ebP size Mean score σ mean size (KB) size (KB) (KB) Human W ebP Human W ebP arc h 1119 3.805 3.840 4.04 5.1 0.23 0.21 ballo on 92 1.951 2.036 6.22 5.45 0.23 0.25 b eac hbridge 3263 4.604 4.676 4.34 3.92 0.23 0.22 eiﬀelto wer 2245 4.363 4.394 5.98 5.77 0.22 0.22 face 1885 2.649 2.762 2.95 5.47 0.19 0.20 ﬁre 4270 2.407 2.454 6.74 5.09 0.23 0.23 giraﬀe 5256 3.107 3.144 6.28 4.48 0.24 0.21 guitarman 1648 2.713 2.730 4.88 4.07 0.26 0.20 in tersection 3751 3.157 3.238 6.8 4.15 0.19 0.22 ro c kwall 4205 6.613 6.674 4.41 4.85 0.23 0.23 sunsetlak e 1505 4.077 4.088 5.08 4.82 0.23 0.23 train 3445 1.948 2.024 6.85 3.62 0.23 0.21 w olfsketc h 1914 0.869 0.922 8.25 3.46 0.20 0.19 T able 1: Original image size and compressed sizes along with mean MT urk scores for human and W ebP reconstructions. Best results are b oldfaced, and standard error of the mean ( σ mean , sample size 100 for eac h compression metho d) for all scores are sho wn. T able 1 shows the mean ratings given by MT urk work ers to h uman and W ebP compressed reconstructions of the 13 high-resolution images. Figure 6 visualizes the distribution of these ratings with corresp onding 95% conﬁdence interv als whic h were obtained via b o otstrap resampling 1000 times [21]. W e ﬁt these ratings with a linear mixed-eﬀects regression mo del predicting rating from compressor t yp e (h uman vs. W ebP), with random intercepts for diﬀeren t images and h uman scorers. This analysis revealed a marginal adv an tage for human image reconstructions relative to W ebP compressed images, whic h were rated 0.984 p oints higher on av erage (t = 1.82, p = 0.090). This suggests that while the curren t study ma y b e underp o wered to detect a statistically reliable eﬀect across images, larger stud- ies con taining more images may reveal more consisten t diﬀerences in reconstruction qualit y accomplished by eac h compression metho d. Imp ortan tly , our study also revealed a large degree of v ariation in b oth the ab- solute ratings given to diﬀerent images, as well as the magnitude of the diﬀerence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● f ace arch rockwall eiff eltower sunsetlake beachbridge balloon guitarman fire giraff e intersection train wolfsk etch 1 2 3 4 5 6 7 8 9 10 rating image compressor ● ● human webp Figure 6: Mean ratings and 95% conﬁdence in terv als for h uman and W ebP compres- sion, ordered b y the diﬀerence in mean ratings b etw een the t wo compressor types. Figure 7: F ace image results. (A) Original face image with (B) W ebP and (C) human reconstructions. b et ween human and W ebP reconstructions. F or some images, the human reconstruc- tions were judged to b e clearly higher in quality relative to the W ebP compressed images (see Figures 1, 7, 8), while still ac hieving high compression ratios ranging from around 100 × to 1000 × . F or the giraﬀe image (Figure 1), w e susp ect that hu- man reconstruction achiev ed a b etter rating than W ebP b ecause human scorers give more priorit y to image sharpness ov er accuracy . In contrast, for the face image (Fig- ure 7) human compression achiev ed a signiﬁcantly lo wer qualit y score than W ebP . W e sp eculate that this is b ecause facial identit y is more imp ortan t than individual seman tic features of the image (suc h as the presence of facial blemishes). On the other hand, h umans ac hieve a muc h b etter score than W ebP for the w olfsk etch im- age (Figure 8), p erhaps because h uman scorers are not as sensitiv e to diﬀerences in Figure 8: W olfsketc h image results. (A) Original w olfsk etch image with (B) W ebP and (C) h uman reconstructions. w olf identit y . W e also observed that h uman compression achiev ed b etter compression ratios and MT urk scores when seman tically similar images were publicly av ailable. This was the case for images of famous monumen ts such as the eiﬀeltow er image, and for the intersection image where Go ogle Street View provided images of similar road in tersections. Discussion & Conclusion W e designed an exp erimen t to b etter understand the p otential for improving lossy image compression based on a human-cen tric loss. In the con text of this t wo-pla y er image reconstruction game, h uman participants play ed the roles of describ er and reconstructor and generated compressed v ersions of 13 div erse images of landscap es, p ortraits, animals and urban settings. W e ev aluated the qualit y of human compression b y comparing their reconstructions with those generated from W ebP compression. F or several of the images, the h uman reconstructions w ere preferred to the W ebP reconstructions (e.g., wolf, train). F or those images, the h uman compression pro cess w as better at identifying and preserving image properties that were relev an t to human scorers. Ho wev er, for a num b er of images the W ebP reconstructions were preferred o ver the h uman reconstructions (e.g., face, arc h). F or those images, it app ears that the publicly a v ailable image data and reconstruction in terface ma y not hav e b een suﬃcien t to preserve the attributes that p eople considered to b e most imp ortan t. W e plan to follow up these preliminary observ ations with a larger h uman reconstruction study containing more images. A wider breadth of test images should provide a more precise estimate of the relativ e qualit y of human reconstructions, as w ell as a better understanding of ho w the type of semantic information in an image aﬀects how well a reconstruction can b e ac hieved using simple op erations on publicly av ailable image data. The h uman compression sc heme is able to exploit seman tically similar images quite eﬀectiv ely during compression. Ho wev er, most p opular compressors do not app ear to tak e adv antage of this ric h public resource. Our exp eriment suggests that eﬀective utilization of semantically and structurally similar images (or parts of images) can dramatically improv e compression ratios. This is particularly relev ant to da y , when images can b e easily found using image search to ols suc h as the one oﬀered freely by Go ogle. While the human compression framew ork is useful as an exploratory to ol, it is clearly not practical due to its lab or-in tensive nature. W e did not striv e to optimize our proto cols in any wa y , and we could hav e undoubtedly achiev ed substantially b et- ter compression and reconstruction scores had w e done so. Notably , eac h of the image reconstructions to ok a few hours to complete. F urthermore, redundancies in English language resulted in sub-optimal compression, ev en though this is partly resolved b y the use of bzip2. Our dra wing skills, use of rudimen tary soft w are for image editing, ineﬃciencies due to o ccasional misunderstandings of describ er instructions b y the re- constructor, and diﬃcult y in man ually searching for similar images all contributed to transcript size. Impro vemen ts on any of these fron ts would further result in improv ed image reconstruction qualit y . W e plan to use the insights obtained from this work to build an image compressor that is b oth optimized for human p erception loss and able to utilize side information in the form of publicly av ailable databases. W e lo ok to the w ork in [15], which trains a neural net w ork to predict h uman scores, as a strategy for training mac hine-based compressors for the h uman p erception loss. W e also exp ect to take adv antage of rev erse image search to ols in order to b etter utilize side information. W e b eliev e these techniques will b e key to signiﬁcantly improv ed lossy image compression. W e also in tend to further explore the theoretical limits of information transfer using b oth state-of-the art image compressors as w ell as our human-inspired image compression setup. Our w ork was inspired in part b y Claude Shannon’s 1951 pap er [22], where humans w ere used to establish an upp er b ound on the fundamen tal limit of English language compression. A t the time, h umans were b etter compressors than an y practically implementable algorithm, and the pap er motiv ated subsequent develop- men ts in text compression to matc h and even tually surpass the 2.3 bits/sym b ol sho wn to b e ac hiev able b y human compressors. T ow ards this end, in future exp erimen ts we plan to generate and score human reconstructions at sev eral compression lev els for eac h image, and to compare the resultant reconstruction versus quality curves with those achiev ed by W ebP . This approach w ould provide a more comprehensive c harac- terization of the fundamental tradeoﬀ b etw een compression rate and reconstruction qualit y [23] for b oth state-of-the-art compressors and h uman compressors, calibrated to the same ev aluation metric. The results from such a study may guide dev elopment of lossy image compression algorithms that will ac hiev e and even tually surpass human p erformance. Ac kno wledgement W e thank Meltem T olunay , Yih ui Quek, Ja y Mardia, Y anjun Han, Dmitri P avlic hin and A riana Mann for fruitful discussions. W e also thank Debargha Mukherjee for his helpful commen ts on the man uscript. W e thank Lucas W ash burn for p ermitting us to take his photo and use it in our exp eriments. W e also thank the NSF Center for the Science of Information, NIH, the Stanford Compression F orum and Go ogle for funding v arious parts of this pro ject. References [1] Greg Ro elofs, PNG: The Deﬁnitive Guide , O’Reilly & Asso ciates, Inc., Sebastop ol, CA, USA, 1999. [2] Gregory K W allace, “The JPEG still picture compression standard,” Communic ations of the ACM , v ol. 34, no. 4, pp. 30–44, 1991. [3] David T aubman and Mic hael Marcellin, JPEG2000 Image Compr ession F undamentals, Standar ds and P r actic e , Springer Publishing Compan y , Incorp orated, 2013. [4] “JPEG XR,” https://jpeg.org/jpegxr/ , Accessed: 2018-10-22. [5] “BPG,” https://bellard.org/bpg/ , Accessed: 2018-10-22. [6] “W ebp,” https://developers.google.com/speed/webp/ , Accessed: 2018-10-16. [7] Zhou W ang, Alan C Bo vik, Hamid R Sheikh, and Eero P Simoncelli, “Image qualit y assessmen t: from error visibility to structural similarity ,” IEEE tr ansactions on image pr o c essing , vol. 13, no. 4, pp. 600–612, 2004. [8] Zhou W ang, Eero P Simoncelli, and Alan C Bo vik, “Multiscale structural similarit y for image qualit y assessment,” in The Thrity-Seventh Asilomar Confer enc e on Signals, Systems & Computers, 2003 . Ieee, 2003, v ol. 2, pp. 1398–1402. [9] T ro y Chinen, Johannes Ball´ e, Chunh ui Gu, Sung Jin Hw ang, Sergey Ioﬀe, Nick John- ston, Thomas Leung, David Minnen, Sean O’Malley , Charles Rosenberg, et al., “T o- w ards a seman tic p erceptual image metric,” in 2018 25th IEEE International Confer- enc e on Image Pr o c essing (ICIP) . IEEE, 2018, pp. 624–628. [10] Michael Buhrmester, T racy Kwang, and Samuel D Gosling, “Amazon’s mechanical turk: A new source of inexpensive, y et high-quality , data?,” Persp e ctives on psycho- lo gic al scienc e , vol . 6, no. 1, pp. 3–5, 2011. [11] Thomas Rich ter and Kil Jo ong Kim, “A ms-ssim optimal jp eg 2000 enco der,” in Data Compr ession Confer enc e, 2009. DCC’09. IEEE, 2009, pp. 401–410. [12] Johannes Ball´ e, V alero Laparra, and Eero P Simoncelli, “End-to-end optimized image compression,” arXiv pr eprint arXiv:1611.01704 , 2016. [13] Giaime Ginesu, Maurizio Pintus, and Daniele D Giusto, “Ob jective assessment of the w ebp image co ding algorithm,” Signal Pr o c essing: Image Communic ation , vol. 27, no. 8, pp. 867–874, 2012. [14] “butteraugli,” https://github.com/google/butteraugli/ , Accessed: 2018-10-16. [15] T ro y Chinen, Johannes Ball´ e, Chunh ui Gu, Sung Jin Hw ang, Sergey Ioﬀe, Nick John- ston, Thomas Leung, David Minnen, Sean O’Malley , Charles Rosenberg, et al., “T o- w ards a seman tic p erceptual image metric,” in 2018 25th IEEE International Confer- enc e on Image Pr o c essing (ICIP) . IEEE, 2018, pp. 624–628. [16] Eirikur Agustsson, Mic hael Tschannen, F abian Mentzer, Radu Timofte, and Luc V an Go ol, “Generativ e adv ersarial netw orks for extreme learned image compression,” arXiv pr eprint arXiv:1804.02958 , 2018. [17] Didier Le Gall, “MPEG: A video compression standard for multimedia applications,” Communic ations of the ACM , vol. 34, no. 4, pp. 46–58, 1991. [18] H.H. Clark, H.H. Clark, and H.H. Clark, Using L anguage , [ACLS Humanities E-Book]. Cam bridge Universit y Press, 1996. [19] A. No and T. W eissman, “Rateless lossy compression via the extremes,” IEEE T r ans- actions on Information The ory , v ol. 62, no. 10, pp. 5484–5495, Oct 2016. [20] “bzip2,” http://www.bzip.org/ , Accessed: 2018-10-16. [21] Bradley Efron and Rob ert Tibshirani, “Bo otstrap metho ds for standard errors, con- ﬁdence interv als, and other measures of statistical accuracy ,” Statistic al scienc e , pp. 54–75, 1986. [22] Claude E Shannon, “Prediction and entrop y of printed english,” Bel l system te chnic al journal , vol. 30, no. 1, pp. 50–64, 1951. [23] Claude Elwoo d Shannon, “A mathematical theory of communication,” Bel l system te chnic al journal , vol. 27, no. 3, pp. 379–423, 1948. App endix A dditional details T able 2con tains additional details ab out the images and the mechanical turk exp eri- men ts. Image Original W ebP Original Compressed chat W ebP size resolution resolution size (KB) size (KB) (KB) arc h 1762 × 2286 506 × 656 1119 3.805 3.840 ballo on 1024 × 683 630 × 420 92 1.951 2.036 b eac hbridge 4032 × 3024 500 × 375 3263 4.604 4.676 eiﬀelto wer 2448 × 3264 492 × 656 2245 4.363 4.394 face 3024 × 4032 435 × 580 1885 2.649 2.762 ﬁre 3036 × 4048 375 × 500 4270 2.407 2.454 giraﬀe 5472 × 3648 528 × 352 5256 3.107 3.144 guitarman 1136 × 640 550 × 310 1648 2.713 2.730 in tersection 3024 × 4032 450 × 600 3751 3.157 3.238 ro c kwall 3036 × 4048 531 × 708 4205 6.613 6.674 sunsetlak e 3264 × 2448 1148 × 861 1505 4.077 4.088 train 4032 × 3024 340 × 255 3445 1.948 2.024 w olfsketc h 2698 × 3539 290 × 380 1914 0.869 0.922 T able 2: Resolution and original/compressed size for the images. Chat transcripts w ere compressed with bzip2. W ebP resolution was reduced till the ﬁle size just exceeded the compressed chat transcript size, k eeping quality parameter 0 and asp ect ratio ﬁxed. Images This section con tains all 13 original images along with their W ebP and h uman recon- structions. Figure 9: (A) Original arch image with (B) W ebP and (C) human reconstructions. Figure 10: (A) Original balloon image with (B) W ebP and (C) human reconstruc- tions. Figure 11: (A) Original b each bridge image with (B) W ebP and (C) human recon- structions. Figure 12: (A) Original eiﬀeltow er image with (B) W ebP and (C) human reconstruc- tions. Figure 13: (A) Original face image with (B) W ebP and (C) human reconstructions. Figure 14: (A) Original ﬁre image with (B) W ebP and (C) human reconstructions. Figure 15: (A) Original giraﬀe image with (B) W ebP and (C) h uman reconstructions. Figure 16: (A) Original guitarman image with (B) W ebP and (C) h uman reconstruc- tions. Figure 17: (A) Original intersection image with (B) W ebP and (C) human recon- structions. Figure 18: (A) Original ro c kwall image with (B) W ebP and (C) human reconstruc- tions. Figure 19: (A) Original sunsetlak e image with (B) W ebP and (C) h uman reconstruc- tions. Figure 20: (A) Original train image with (B) W ebP and (C) human reconstructions. Figure 21: (A) Original w olfsk etch image with (B) W ebP and (C) h uman reconstruc- tions.

Towards improved lossy image compression: Human image reconstruction with public-domain images

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment