Wavelet decomposition of software entropy reveals symptoms of malicious code

W a velet decomposition of software entr op y re veals symptoms of malicious code Michael W ojno wicz, Glenn Chisholm, Matt W olff , Xuan Zhao Dept. of Research and Intelligence Cylance, Inc. 18201 V on Kar man Dr iv e Ir vine , CA 92612 {mwojno wicz, gchisholm, mwolff , xzhao} @cylance.com ABSTRA CT Sophisticated malware authors can sneak hidden malicious con tents into p ortable executable ﬁles, and these conten ts can be hard to detect, esp ecially if encrypted or compressed. Ho wev er, when an executable ﬁle switc hes betw een con- ten t regimes (e.g., native, encrypted, compressed, text, and padding), there are corresp onding shifts in the ﬁle’s repre- sen tation as an entrop y signal. In this pap er, we develop a metho d for automatically quantifying the extent to which patterned v ariations in a ﬁle’s entrop y signal make it “suspi- cious." In Exp erimen t 1, we use w a v elet transforms to deﬁne a Suspiciously Structured Entropic Change Score (SSECS), a scalar feature that quantiﬁes the suspiciousness of a ﬁle based on its distribution of en tropic energy across multiple lev els of spatial resolution. Based on this single feature, it w as p ossible to raise predictive accuracy on a malware de- tection task from 50.0% to 68.7%, ev en though the single feature w as applied to a heterogeneous corpus of malware disco vered “in the wild." In Exp erimen t 2, we describ e how w av elet-based decomp ositions of soft ware en trop y can b e ap- plied to a parasitic malware detection task inv olving large n umbers of samples and features. By extracting only string and en trop y features (with w a v elet decompositions) from soft ware samples, we are able to obtain almost 99% detection of parasitic malware with fewer than 1% false p ositiv es on go od ﬁles. Moreov er, the addition of wa v elet-based features uniformly improv ed detection performance across plausible false p ositiv e rates, both in a strings-only mo del (e.g., from 80.90% to 82.97%) and a strings-plus-en trop y mo del (e.g. from 92.10% to 94.74%, and from 98.63% to 98.90%). Ov er- all, wa v elet decomp osition of softw are entrop y can b e useful for machine learning mo dels for detecting malware based on extracting millions of features from executable ﬁles. 1 KEYW ORDS: wa v elet decomposition, structural entrop y , malw are detection, parasitic malware, machine learning 1. INTR ODUCTION 1.1 The Entropy Of Malicious Softwar e A fundamental goal in the information security industry is malw are detection. In this pap er, we fo cus our malw are detection eﬀorts on the fact that malicious ﬁles (e.g. para- sitics, or exploits with injected shellco de) commonly contain encrypted or compressed (“pack ed”) segments whic h conceal 1 This article is a post-print of [18] whic h corrects t ypos in- tro duced during editing. malicious con ten ts [3]. Thus, the information securit y indus- try has been interested in developing metho dologies whic h can automatically detect the presence of encrypted or com- pressed segments hidden within p ortable executable ﬁles. T o this end, entrop y analysis has b een used, b ecause ﬁles with high entrop y are relativ ely likely to hav e encrypted or com- pressed sections inside them [5]. In general, the entrop y of a random v ariable reﬂects the amount of uncertain t y (or lac k of knowledge) ab out that v ariable. In the context of softw are analysis, zero en trop y would mean that the same c haracter w as rep eated ov er and ov er (as might o ccur in a “padded” c hunk of co de), and maximum en trop y would mean that a c hunk consisted of entirely distinct v alues. Th us, c h unks of code that ha ve been compressed or encrypted tend to ha ve higher en trop y than native code. F or instance, in the soft ware corpus studied by [5], plain text had an av erage en tropy of 4.34, native executables had an a vera ge entrop y of 5.09, pack ed executables had an a v erage entrop y of 6.80, and encrypted executables had an av erage entrop y of 7.17. 1.2 Suspiciously Structured Entr opy Based on the reasoning ab o v e, previous research has used high mean entrop y as an indicator of encryption or com- pression. Ho w ev er, malicious conten ts, when concealed in a sophisticated manner, may not b e detectable through simple entrop y statistics, suc h as mean ﬁle entrop y . Mal- w are writers sometimes try to conceal hidden encrypted or compressed co de that they introduce in creating ﬁles suc h as parasitic malware; for instance, they may add ad- ditional padding (zero entrop y ch unks), so that the ﬁle passes through high entrop y ﬁlters. How ev er, ﬁles with con- cealed encrypted or compressed segments tend to v acillate mark edly b et w een native co de, encrypted and compressed segmen ts, and padding, with each segment having distinct and characteristic exp ected entrop y lev els. Thus, the ﬁeld of cyb ersecurit y has started to pa y atten tion to ﬁles with highly structur ed entr opy [11], [2], that is, ﬁles whose co de ﬂips b e- t ween v arious distinguishing levels of entrop y through the ﬁle. In order to automatically iden tify the degree of en tropic structure within a piece of softw are, w e represent each p ortable executable ﬁle as an “entrop y stream. ” The entrop y stream describ es the amount of en trop y ov er a small snippet of co de in a certain location of the ﬁle. The “amount” of en tropic structure can then be quan tiﬁed, suc h that w e can diﬀeren tiate, for example, b et w een a low-structured signal with a single lo cal mean and v ariation around that mean, v ersus a highly-structured signal whose local mean c hanges man y times o v er the course of the ﬁle. In this pap er 2 , we deﬁne suspiciously structur e d entr opy as a p articular p attern of entr opic structur e which matches those of malicious ﬁles. T o quantify the suspiciousness of the structured entrop y within a piece of softw are, we develop the notion of a “Suspiciously Structured Entropic Change Score” (SSECS). W e ﬁrst describ e how to calculate SSECS as a single predictive feature, and analyze its p erformance in malware detection. W e then generalize this feature to large-scale malware detection tasks. The deriv ation of the SSECS feature dep ends up on the notion of a w a v elet trans- form, which we now brieﬂy review. 1.3 Brief Overview Of W av elets The W a v elet T ransform is the primary mathematical op er- ator underlying our quantiﬁcation of structurally suspicious en tropy . The W av elet T ransform extracts the amount of “detail” exhibited within a signal at v arious lo cations ov er v arious levels of resolution [8]. In essence, it transforms a one-dimensional function of “lo cation” (in our case, ﬁle lo- cation) into a t w o-dimensional function of “lo cation” and “scale. ” By using the output of the wa velet transform (the so-called “w a v elet coeﬃcients”), it is p ossible to obtain a se- ries of coarse-to-ﬁne approximations of an original function. These successive approximations allow us to determine the m ulti-scale structure of the entrop y signal, in particular the “energy” av ailable at diﬀerent levels of resolution. F or this pap er, we apply Haar W a v elets, which is a particu- larly simple family of wa v elets whose members are piecewise constan t. The Haar W av elet T ransform pro jects the origi- nal entrop y signal on to a collection of piecewise constan t functions which oscillates as a square wa v e ov er bounded supp ort (i.e., the functions assume non-zero v alues only on certain b ounded interv als). Since these piecewise constant functions hav e supp orts which v ary in their scale (width) and lo cation, the resulting pro jections describ e the “detail” within the signal at v arious locations and resolutions. More sp eciﬁcally , the Haar W av elet T ransform is based up on the so called “mother function”, ψ ( t ) , deﬁned b y: ψ ( t ) =    1 , t ∈ [0 , 1 / 2) − 1 , t ∈ [1 / 2 , 1) 0 , otherwise a v ery simple step function. Given the Haar mother function ψ ( t ) , a collection of dy adically scaled and translated wa v elet functions ψ j,k ( t ) are formed by: ψ j,k ( t ) = 2 j / 2 ψ (2 j t − k ) (1) where the in tegers j, k are scaling parameters. The dilation parameter j indexes the level of detail or resolution, and the translation parameter k selects a certain lo cation within the signal to b e analyzed. Note that as the scaling parameter j increases, the function ψ j,k applies to (is non-zero ov er) 2 This pap er is a developmen t of earlier researc h originally published in conference pro ceedings [15]. F or a more com- prehensiv e viewp oin t, see [16]. −0.5 0.0 0.5 1.0 1.5 −1.0 0.0 1.0 Wa velet Functions (Resolution J=0) x y1 −0.5 0.0 0.5 1.0 1.5 −1.5 −0.5 0.5 1.5 Wa velet Functions (Resolution J=1) x y2 −0.5 0.0 0.5 1.0 1.5 −2 −1 0 1 2 Wa velet Functions (Resolution J=2) x y4 Figure 1: Examples of Haar wavelet functions. Here w e sho w some Haar wa v elet functions ov er the unit interv al. Eac h colored square wa v e represen ts (the non-zero part of ) a diﬀerent wa v elet function. The Haar w a v elet functions are deﬁned in Equation 1. In particular, w e plot wa v elet functions for resolution levels j = 0 , 1 , 2 and lo cations k = 0 , .., j . These wa velet functions are used as ﬁlters to pick up the magnitude of entropic change in a piece of softw are at diﬀeren t levels of resolution and in diﬀerent ﬁle locations. successiv ely ﬁner interv als of the signal. Some example Haar w av elet functions are sho wn in Figure 1. Giv en a signal x ( t ) where t = 1 , . . . , T , we ﬁrst rescale the signal so that the ﬁrst observ ation o ccurs at time t = 0 and the ﬁnal observ ation o ccurs at time t = 1 . Then, the so- called “mother wa v elet co eﬃcien t” at scale j and lo cation k is given by the inner pro duct of the signal with the w av elet. Since we are dealing with discrete signals, the inner pro duct tak es the form: d j,k = < x, ψ j,k > = T X t =1 x ( t ) ψ j,k ( t ) , One interpretation of this co eﬃcien t is that it gives the (scaled) diﬀerence b et w een lo cal av erages of the s ignal across neighboring ch unks or bins. The size of the neigh- b oring ch unks is determined by the scaling parameter j . The family of mother wa v elet co eﬃcien ts, { d j,k } , enable a “Multi-Resolution Analysis” (MRA) of the signal x ( t ) . In particular, the signal x ( t ) can b e decomp osed in to a series of approximations x j ( t ) , whereby eac h successiv e approxi- mation x j +1 ( t ) is a more detailed reﬁnement of the previous appro ximation, x j ( t ) . The functional approximations are 0 100 200 300 400 500 0 2 4 6 8 Projected signal, resolution level 3 Entropy 0 100 200 300 400 500 0 2 4 6 8 Projected signal, resolution level 5 Entropy 0 100 200 300 400 500 0 2 4 6 8 Projected signal, resolution level 8 File Location (Index of Chunk of Raw Bytes) Entropy Figure 2: W avelet-base d functional appr oximations to a soft- war e’s entr opy signal at diﬀer ent levels of r esolution. Here, w e sho w the entrop y signal from a single Portable Exe- cutable (PE) ﬁle pro jected on to Haar father wa v elet space at diﬀeren t levels of resolution ( j ∈ { 2 , 5 , 8 } from Equation 2). In general, each successive functional appro ximation adds the incremental detail pro vided at that level of spatial res- olution, compared to the next-most-coarse level of spatial resolution, and does so across v arious spatial lo cations. obtained through the wa v elet co eﬃcien ts by the f orm ula: x j +1 ( t ) = x j ( t ) + 2 j − 1 X k =0 d j,k ψ j,k ( t ) (2) where x 0 ( t ) , the coarsest-level functional appro ximation, is the mean of the full signal. Thus, the collection of mother w av elet coeﬃcients { d j,k } store the “details” that allow one to mov e from a coarser approximation to a ﬁner approxima- tion. Examples of successive functional approximations, in the context of softw are entrop y signals, are shown in Figure 2. Using the wa v elet transform, it is p ossible to “summarize” the ov erall amount of detail in a signal at v arious levels of resolution. The total amount of detail at a particular ( j th) lev el of resolution is kno wn as the ener gy at that lev el of resolution: E j = 2 j − 1 X k =1 ( d j k ) 2 (3) The distribution of energy across v arious levels of resolution is kno wn as an ener gy sp e ctrum . Note that the energy at resolution lev el j is just the squared Euclidean norm of the v ector of mother w av elet co eﬃcients from resolution level j . After this step, w e hav e reduced the original signal of size T = 2 J (and resultant w a v elet v ector of size T − 1 ) to a vector of J elemen ts, where each element represents the amoun t of “energy” at a single lev el of resolution. 1.4 W a velet-Based Classiﬁers The energy sp ectra of signals hav e b een very useful features for classiﬁers such as neural netw orks. In fact, this combined strategy , whereby the coeﬃcients from a discrete wa velet transform are used as no de activ ations in a neural netw ork, is referred to as a wa v elet neural net w ork ( WNN ) strategy (see e.g. [10], [ ? ]). Using WNN’s, researchers ha v e b een able to automatically classify lung sounds into categories (crack- les, wheezes, striders, squawks, etc.) [6], to automatically determine whether brain EEG scans originated from health y patien ts, patients with epilepsy , or patients who were in the middle of having a seizure [9], or to automatically determine whether EMG signals collected from the bicep originated from patients who were healthy , suﬀering from my opath y , or suﬀering from neurogenic disease [4]. W e refer to the ov erall strategy of using wa v elet co eﬃcien ts as features in a classiﬁer as a W avelet-Base d Classiﬁer strat- egy . W e prefer this term ov er WNN , whic h, although w ell- established in the literature, is sp eciﬁc to neural netw ork classiﬁers. Indeed, in this pap er, we choose logistic regres- sion (b oth standard and regularized) rather than a neu- ral netw ork to mo del our data, b ecause the logistic regres- sion mo del pro vides an atomic analysis of the relationship b et ween the wa velet-based features and classiﬁcation cate- gories. 1.5 Suspiciously Structured Entropic Change Score (SSECS) The initial fundamental problem with applying wa velet- based classiﬁers to malw are analysis is that executable ﬁles out “in the wild” ha v e diﬀerent lengths. This contrasts with controlled observ ational situations, e.g. those described ab o ve, which produce signal samples of ﬁxed length that are held constant across the data set. In con trolled observ a- tional situations, all samples will produce the same num ber of features, J, and v ariation across these set of J features can b e immediately asso ciated with a classiﬁcation v ariable in a straigh tforward manner, for example by setting the input la yer of the neural net w ork to ha v e J activ ation notes. Ho wev er, in uncontrolled observ ational con texts, signal lengths can diﬀer wildly from sample to sample. Imag- ine, for instance, comparing signal A of length 32 (so J=5, and if E f ,j represen ts the energy at resolution level j = 1 , . . . , J for p ortable executable ﬁle f , we would hav e E a, 1 , . . . , E a, 5 ) with signal B of length 256 (so J=8, and w e ha v e E b, 1 , . . . , E b, 8 ). How should we compare these tw o ﬁles? Our solution to this problem, for smaller data sets 3 , is to transform each ﬁle’s J -dimensional energy sp ectrum into a 3 A second solution, for larger datasets, is describ ed in Ex- p erimen t 2. single scalar feature, a 1 -dimensional “Suspiciously Struc- tured Entropic Change Score” (SSECS). The computation of SSECS is a tw o-step pro cess: ﬁrst, we compute the wa v elet- based energy sp ectrum of a ﬁle’s entrop y signal, and second, w e compute the ﬁle’s malware prop ensit y score from that en- ergy sp ectrum. In our case, we ﬁt a logistic regression mo del to the binary classiﬁcation resp onse (malware or not) which uses these wa v elet energy features as predictor v ariables. W e ﬁt J separate regression mo dels, one for eac h ﬁle size group- ing. Giv en the Energy Sp ectrum { E f ,j } , whic h is the set of wa v elet energies for each resolution level j = 1 , . . . , J of p ortable executable ﬁle f , the logistic regression mo del es- timates b P f , the predicted probability that ﬁle f is malware, b y the form ula b P f = 1 1 + exp [ − β 0 + E f ,j · β ( J ) ] where β ( J ) j is a mo del parameter, known as a “logistic re- gression co eﬃcien t”, from the J th logistic regression mo del. This num ber, b P f is what w e refer to as the SSECS. 2. EXPERIMENT 1: ANAL YZING AND EV ALU A TING THE PREDICTIVE PER- FORMANCE OF A SINGLE W A VELET - B ASED FEA TURE In Exp erimen t 1, we attempt to assess the predictiv e v alue of SSECS as a single feature describing potentially suspicious v ariation in soft ware en tropy . In particular, as discussed in Section 2.2, the wa v elet-based feature is constructed in an attempt to describe the “suspiciousness" of a piece of soft ware’s en trop y signal when that entrop y signal is re- represen ted, through a wa v elet transform, in terms of en- tropic c hange distributed across diﬀerent levels of spatial resolution. 2.1 Data Data are a set of n=39,968 p ortable executable ﬁles from a Cylance rep ository . 19,988 (50.01%) of these ﬁles w ere kno wn to be malicious, and the remaining ﬁles were b e- nign. These ﬁles were collected “from the wild," and thus highly heterogenous. F or example, the “malware" category con tains diﬀerent types of malicious softw are (e.g. viruses, T rojan horses, spyw are, backdoors, bots, and ransomw are – but not adw are.) 2.2 Method 2.2.1 Constructing the entr opy str eam T o compute the entrop y of an executable ﬁle, the original ﬁle, represen ted in hexadecimal (00h-FFh), is split into non- o verlapping ch unks of ﬁxed length, t ypically 256 b ytes. F or eac h ch unk of co de, the entrop y is then computed using the form ula b elo w: H ( c ) = − m X i =1 p i ( c ) log 2 p i ( c ) , (4) where c represents a particular ch unk of co de, m represents the num ber of p ossible c haracters (here, n=256), and p i is the probability (observed frequency) of each character in the giv en ch unk of code. The en trop y for any given ch unk then ranges from a minimum of 0 to a maxim um of 8. 2.2.2 Computing the Suspiciously Structur ed En- tr opic Change Scor e (SSECS) The procedure for computing the suspiciously structured en- tropic change score (SSECS) is as follows: 1) Partition data set by size: Group sampled ﬁles into j = { 1 , . . . , J } groups, where j = b log 2 T c and T is the length of the ﬁle’s en trop y stream: 2) Iterate: F or all ﬁles which fall in to the j th length group 2a) Compute Haar Discrete W av elet Co eﬃcien ts: The dis- crete wa v elet transform tak es as input a discrete series of size T = 2 J observ ations. Because the transform re- quires the series to hav e a dyadic length, if the num ber of observ ations in the executable ﬁle’s entrop y stream is not an integer p o w er of 2, we right-truncate the series at v alue 2 b log 2 T c . The so called “mother” wa v elet co ef- ﬁcien ts, d j k , describ e the “detail” at successively ﬁne- grained resolutions. In particular, the mother wa v elet co eﬃcien ts are indexed suc h that j ∈ { 1 , . . . , J } repre- sen ts the resolution level, ordered from coarse-grained to ﬁne-grained, and k ∈ { 1 , . . . , K = 2 j − 1 } represents the particular location (or bin) of the en tropy signal at that resolution lev el. A t eac h resolution lev el j , the signal is divided into N j = 2 j − 1 non-o verlapping, ad- jacen t bins suc h that each bin includes B j = 2 J − j ob- serv ations. Note that the num ber of bins, K, increases as j increases to ﬁner resolutions. The mother wa v elet co eﬃcien t at index ( k, j ) is then given by: d kj = 1 s j  2 kB j X i =(2 k − 1) B j +1 y i − (2 k − 1) B j X i =(2 k − 2) B j +1 y i  (5) where the scaling factor is s j = ( √ 2) J − j +1 and is nec- essary for the w av elet transform to preserv e the size (norm) of the signal. There are T-1 mother w av elet co eﬃcien ts. 2b) Compute W av elet Energy Spectrum : The wavelet en- er gy sp ectrum summarizes the “detail” or “v ariation” a v ailable at v arious resolution lev els. The energy sp ec- trum is computed as a function of the mother w a v elet co eﬃcien ts, d j k . In particular, the “energy”, E j , of the en tropy stream at the j th resolution level is deﬁned b y Equation 3. Given a particular executable ﬁle’s en- trop y stream, we refer to its distribution of energy ov er diﬀeren t resolutions the ﬁle’s “energy sp ectrum. ” 2c) Compute W av elet Energy Suspiciousness: Now we use the wa v elet energy sp ectrum to determine the “prop en- sit y” of eac h ﬁle to b e malw are (i.e., its suspiciousness). Computing this prop ensit y requires training. W e use 5-fold v alidation. 2c1) Partition The Curren t Sample Of Files: Split the en tire set of F J ﬁles whic h are of the appropriate size into 5 mutually exclusive subsets F 1 J , . . . , F 5 J , eac h of which represents exactly 20% of the entire sample. 2c2) Iterate: F or eac h subset F i J , where i ∈ { 1 , . . . , 5 } 2c2a) Fit a logistic regression : Fit a logistic regres- sion mo del on the other four subsets { F k J : k 6 = i } , where the mo del ﬁts the class v ariable (mal- w are or not) as a function of the wa v elet energy sp ectrum. The logistic regression model will pro duce a set of b eta coeﬃcients to w eigh the strength of eac h resolution energy on the ﬁle’s probabilit y of being malw are. 2c2b) Calculate malware prop ensit y: Use the logistic regression mo del ab o v e to then mak e a predic- tion ab out ﬁles in subset F i J . In particular, use the mo del learned in step 1c2a to calculate the predicted probability that eac h ﬁle in set F i J is malware, given its w a v elet energy spectrum. This malware propensity (i.e., predicted mal- w are probability) lies within the in terv al [0 , 1] , and is what we call the Suspiciously Structured En tropic Change Score (SSECS). 2.3 Results 2.3.1 Suspicious P atterns of Entr opic Change in A Single F ile Size Gr oup Ho w do es the mo del transform these wa velet energy sp ec- tra into predictions ab out whether the ﬁle is malware (that is, into a Suspiciously Structured Entropic Change Score)? T o illustrate, we consider the subset of n=1,599 ﬁles in our corpus b elonging to ﬁle size group J = 5 . Because these ﬁles can b e analyzed at J = 5 diﬀerent spatial resolutions, w e extract 5 features from each ﬁle, with eac h feature rep- resen ting the energy at one lev el of spatial resolution in the ﬁle’s entrop y stream. F or illustrative purposes, we b egin by analyzing the wa v elet energy sp ectrum for tw o ﬁles from this size category , as they em bo dy more general trends in the energy patterns of malicious versus clean ﬁles. Figure 3 shows wa v elet-based functional approximations for tw o diﬀeren t entrop y streams. The left column of the plot depicts the entrop y signal from File A, which is legitimate softw are, whereas the right col- umn of the plot depicts the entrop y signal from File B, whic h is malware. Reading these columns from top to b ottom, w e see that the wa v elet transform pro duces successively de- tailed functional approximations to these ﬁles’ entrop y sig- nals. The title ab o v e each subplot shows the w a v elet energy , as computed in Equation (3) in the text, of the signal at a particular spatial resolution lev el. The wa v elet energy is simply the sum of the squares of the scaled diﬀerences in the mean en trop y lev els, where the diﬀerences are only taken b e- t ween even/odd index pairings (i.e. the algorithm takes the diﬀerences mean bin 2 − mean bin 1 , mean bin 4 − mean bin 3 , and so forth). Th us, we can gain some visual in tuition ab out ho w the energy spectra can b e deriv ed from these successive functional approximations. Based on this entropic energy sp ectrum decomp osition (or distribution of energy across v arious levels of spatial resolu- tion), the mo del b eliev es that File A is legitimate softw are, whereas File B is malware. In v estigating this conclusion, w e see that these tw o ﬁles hav e radically diﬀerent wa v elet energy distributions across the 5 lev els of spatial resolution. The legitimate softw are (File A) has its “entropic energy” mostly concentrated at ﬁner levels of resolution, whereas the piece of malware (File B) has its “entropic energy” mostly concen trated at coarser levels of resolution. F or the clean ﬁle, the energy in the entrop y stream is concentrated at the resolution levels j = 4 and j = 5 (where the energy is 34.5 and 23.84 squared bits, respectively). F or the dirt y ﬁle, the energy in the entrop y signal is concen trated at coarser levels of analysis, p eaking esp ecially strongly at level j = 2 (where the energy is 139.99 squared bits). The ﬁt of the logistic regression mo del (for b oth raw and normalized features) is summarized in T able 1. Note that for the en tire table, num bers outside the parentheses repre- sen t results for the normalized features, whereas n um bers in- side the paren theses represent results for raw features. The t wo “Energy” columns list the energy at all ﬁve levels of spatial resolution for these t w o ﬁles. The “V alue of β j ” column describ es the estimated b eta weigh t in a logistic re- gression ﬁtting ﬁle maliciousness to the ﬁve wa v elet energy v alues, based on a corpus of n=1,599 ﬁles. The “P-v alue” column describ es the probability of getting the test statis- tic we observ ed (not sho wn, it is a function of the data) under the hypothesis that there is no relationship b et w een energy at that level and ﬁle maliciousness. The co des are: ∗ = p < . 05 , ∗∗ = p < . 01 , ∗ ∗ ∗ = p < . 001 , ∗ ∗ ∗∗ = p < . 0001 , ∗ ∗ ∗ ∗ ∗∗ = p < . 00001 . The “Malware Sensitivity” represen ts the estimated change in the o dds that a ﬁle is malw are associated with an increase of one unit in the cor- resp onding feature. It is calculated by ( e β − 1) × 100% . F or the normalized v alues (those outside the parenthesis), an increase of one unit refers to an increase of one standard deviation. Based on these logistic regression b eta weigh t ( β j ) v alues, w e see that the tw o sample ﬁles from Figure 3 are indeed represen tative of a larger trend: having high energy at spa- tial resolution levels 1,2 and 3 (the coarser levels) is asso- ciated with a higher probability of the ﬁle b eing malw are (since those β j ’s are p ositiv e), whereas having high energy at levels 4 and 5 (the ﬁner lev els) is asso ciated with a low er probabilit y of the ﬁle b eing malicious (since those β j ’s are negativ e). Moreo v er, these asso ciations app ears to b e re- ﬂectiv e of trends in the larger p opulation of ﬁles, since the p-v alues are largely strongly statistically signiﬁcant. This ﬁnding makes sense if artiﬁcial encryption and compression tactics tend to elev ate moderate to large sized c h unks of malicious ﬁles in to high en trop y states. 2.3.2 Suspicious P atterns of Entr opic Change Acr oss All F ile Size Gr oups Do the trends found in the single level analysis of n = 1 , 599 ﬁles hold up in the full corpus of n = 39 , 968 ﬁles? In par- ticular, regardless of ﬁle size, can we corrob orate the sim- ply stated conclusion that “malw are tends to concen trate en tropic energy at relatively coarse lev els of spatial resolu- tion?” And if so, where is the dividing line b et w een “coarse” and “ﬁne”? In Figure 4, we summarize the results of logistic regres- sion mo dels ﬁts across all ﬁle size groupings. The plot sho ws logistic regression beta co eﬃcien ts for determining the probability that a p ortable executable ﬁle is malw are based up on the magnitude of ﬁle’s entropic energy at v ar- ious levels of spatial resolution within the co de. Positiv e b etas (red colors) mean that higher “en tropic energy” at that resolution level is asso ciated with a greater probability 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 1 = 4.35 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 1 = 14.44 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 2 = 0.8 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 2 = 139.99 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 3 = 5.29 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 3 = 53.84 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 4 = 34.5 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 4 = 9.75 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 5 = 23.84 squared bits 0 5 10 15 20 25 30 0 4 8 W avelet Energy at Le vel 5 = 19.22 squared bits File Location (Each Chunk is 256 Bytes) Entropy Figure 3: W a v elet-based functional approximations, and the corresp onding wa v elet energy sp ectrum, for the en trop y signals of tw o representativ e p ortable executable ﬁles from one ﬁle size group. Resolution E ner gy S pectr a S tatistical M odel F or F ile S iz e J = 5 Lev el # B ins B in S iz e F il e A F il e B V alue of β j P − v alue M alw are S ensitiv ity 1 2 16 -0.39 (4.35) -0.01 (14.44) 0.448 (0.017) ***** +56.5% (+1.7%) 2 4 8 -0.79 (0.80) 6.27 (139.99) 0.174 (0.008) * +19.0% (+0.89%) 3 8 4 -0.48 (5.29) 2.18 (53.83) 0.847 (0.046) ***** +133.2% (+4.74%) 4 16 2 1.42 (34.50) -0.37 (9.75) -0.106 (-0.008) n.s. -10.0% (-0.75%) 5 32 1 1.77 (23.84) 1.19 (19.22) -0.240 (-0.030) ** -21.4% (-2.99%) T able 1: Inv estigating the relationship b et w een the en tropic w a ve let energy sp ectrum and maliciousness for ﬁles in one size group. of b eing malw are. Negativ e b etas (blue colors) mean that higher “entropic energy” at that resolution level is asso ciated with a lo w er probability of being malw are. F or b oth colors, stronger intensities represen t stronger magnitudes of the re- lationship b et w een entropic energy and malware. Mathe- matically , the dot product b et ween a ﬁle’s energy sp ectrum and these b eta weigh ts determine the ﬁtted probability that the ﬁle is malicious. Th us, the Danger Map interpretation can be in terpreted as follo ws: F or any ﬁle size grouping (or ro w), ﬁles that ha v e high energies in the red sp ots and low energies in the blue sp ots are signiﬁcantly more lik ely to be “dangerous. ” Conv ersely , ﬁles that ha v e lo w energies in the red sp ots and high energies in the red sp ots are signiﬁcantly more likely to be “safe. ” T aking this Danger Map in to consideration, we draw the follo wing conclusions: • T o a ﬁrst appro ximation, the full analysis supports the “coarse-energy-is-bad, ﬁne-energy-is-goo d” mantra (observ ed in Section 2.3.1’s analysis of a single ﬁle-size group). Visually , most diagonal elements of the matrix are blue (and also more blue than the oﬀ-diagonals). Th us, across most ﬁle sizes, high energies at the ﬁnest- lev el of spatial resolution app ear to be indicative of ﬁle legitimacy , and high energies at coarse levels of spatial resolution are often asso ciated with suspiciousness. • How ev er, what qualiﬁes as a suspicious pattern in the w av elet decomp osition of a ﬁle’s entrop y stream ap- p ears to b e more complex than the simplistic sum- mary ab o v e. F or example, the app earance of the dou- ble diagonal bands in blue suggest somewhat regular v acillations in terms of how “suspicious” high entropic energy would lo ok at v arious levels of spatial resolu- tion. W e ﬁnd that the particular patterning depicted in the Danger Map provides a statistically signiﬁcan tly b etter description of malware than random (baseline- informed) guessing alone. Likelihoo d ratio tests com- paring the ﬁt of the size-sp eciﬁc models (where the b eta co eﬃcien ts of each size-sp eciﬁc mo del are given b y the sp eciﬁc colorings in the corresp onding row of the Danger Map) versus the ﬁt of mo dels with no fea- tures (interpretable as a uniform color across rows, where the intensit y of the color is determined by base- line malware rates, indep endent of the wa velet energy sp ectrum) yield the test statistics b elo w. Moving from b ottom (J=3) to top (J=15) of the ﬁgure, w e hav e: χ 2 (3) = 198 . 36 , χ 2 (4) = 563 . 51 , χ 2 (5) = 257 . 52 , χ 2 (6) = 235 . 09 , χ 2 (7) = 150 . 11 , χ 2 (8) = 585 . 57 , χ 2 (9) = 662 . 22 , χ 2 (10) = 283 . 24 , χ 2 (11) = 385 . 33 , χ 2 (12) = 305 . 04 , χ 2 (13) = 233 . 39 , χ 2 (14) = 116 . 17 , χ 2 (15) = 61 . 88 , All of these test statistics achiev e statistical signiﬁ- cance at the α = . 05 level. Moreov er, ev en after a conserv ative Bonferroni’s correction for simultaneous h yp othesis testing (of 10 null hypotheses), we can still reject the null h yp othesis of a uniform color across A 'Danger Map' For Software Entropy Resolution Level File Size Grouping (J) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 4 5 6 7 8 9 10 11 12 13 14 15 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Beta V alue Figure 4: A “Danger Map" for entropy patterns within a pie c e of softwar e. The danger map is deriv ed from a statis- tical mo del of malware classiﬁcation which learns suspicious patterns inherent within each softw are’s en tropy streams. In particular, a wa velet decomp osition of these entrop y streams rev eals the entropic energy at v arious levels of resolution. The plot shows logistic regression b eta co eﬃcien ts for de- termining the probability that a p ortable executable ﬁle is malw are based up on the magnitude of ﬁle’s entropic energy at v arious levels of resolution within the code. ro ws for each spatial resolution except spatial resolu- tion level 9. This ﬁnding suggests that the distribution of colors in the “Danger Map" of Figure 4, while not suﬃcien tly simplistic to be easily verbalizable, is un- lik ely to be obtainable by random c hance. 4 2.4 Predicti ve perf ormance of the single wa velet featur e Ho w can w e use the information distributed across the “Dan- ger Map” to construct a single n um ber whic h could score a piece of soft w are’s suspiciousness based on the wa v elet de- comp osition of its en trop y signal? W e studied the predictive p erformance of SSECS in identifying malware b y construct- ing a hold-out test set of n = 7 , 991 ﬁles and found: 1. SSECS as a single fe atur e impr ove d pr e dictions of mal- war e , within a balanced sample of malw are and legit- imate soft w are, from 50% to 68.7% accuracy . This mak es SSECS a particularly impressive feature, con- sidering that most machine learning mo dels of malware consist of millions of features. 4 W e reject the n ull h ypothesis that the colors in each ro w are uniform, and this rejection is consistent with the hypoth- esis that the complex patterns of colors are meaningful in predicting malware. How ev er, we point out for the sake of completion that this ﬁnding is also consistent with simpler but more sp eciﬁc hypotheses, suc h as that the right-most oﬀ-diagonal cell is driving the result. Ideally , a more so- phisticated statistical mo del, well-tailored to the structure of a multi-resolution dataset, would be applied here to tease apart these remaining p ossibilities. 2. SSECS pr ovides pr e dictive information b eyond what is c ontaine d in a me an entropy fe atur e . A mo del with mean entrop y as a single feature ac hiev ed 66.2% predictiv e accuracy . Th us, mean entrop y is indeed also an impressiv e single predictor of malw are (p er- haps not surprisingly given its prev alence in the liter- ature). How ev er, unlik e mean entrop y , the wa v elet en- ergy sp ectrum detects suspicious patterns of en tropic change across the co de of the executable ﬁle. W e found that a 2-feature mo del which includes both mean en- trop y and SSECS achiev es 73.3% predictiv e accuracy (so adding wa v elet-based information to the mo del yields a 7.1% bo ost in predictive accuracy b ey ond what is obtained b y mean en trop y alone). 3. SSECS pr ovides pre dictive information b eyond what is c ontaine d in a “standar d deviation of entr opy" fe atur e . A sk eptic migh t ask: wh y not simply use standard deviation, a more commonly used and more computa- tionally straightforw ard measure of v ariation? Stan- dard deviation is useful, but a relatively cruder mea- sure of v ariation, as it operates on only a single spatial scale. Indeed, a 2-feature mo del which includes b oth mean entrop y and standard deviation ac hiev es merely 70.4% predictive accuracy . 3. EXPERIMENT 2: LARGER-SCALE DE- TECTION OF P ARASITIC MAL W ARE In Exp eriment 1, we ev aluated the predictive v alue of a sin- gle wa v elet-based feature that describ es ho w soft w are’s en- tropic shifts are distributed across multiple spatial scales. W e found that this feature can exploit v aluable information from a softw are’s entrop y signal which is relev an t to mal- w are status and whic h go es b ey ond the predictive v alue of the most commonly used en trop y measures, mean entrop y , as well as a p oten tially conceptually simpler measure of en- trop y v ariation, en trop y standard deviation. In Exp erimen t 2, w e apply a broader system of wa v elet-based features to a larger-scale malware prediction task. In particular, the task is to iden tify parasitic malware from a large corpus of otherwise go od ﬁles. P arasitic malware generally infects ex- isting ﬁles on a user’s system, and the infected part of the ﬁle t ypically conceals itself through encryption or compres- sion. Thus, if wa v elet decomp osition of soft w are entrop y indeed yields features whic h successfully trac k the presence of suspicious c h unks of encrypted or compressed code, then these features should b e particularly v aluable for a parasitic detection task. 3.1 Data Data were 699,121 samples of Portable Executable (PE) ﬁles from a Cylance repository . Of these samples, 17,605 ﬁles (2.51%) were parasitic malware, and the remaining ﬁles w ere legitimate softw are. W e randomly selected 80% of the dataset for training, and the remaining 20% were allo cated to the test set. 3.2 Method T o v alidate the utility of w av elet features in distinguishing parasitic malware from clean softw are, we compared four mo dels (in the sense of types of features extracted from ex- ecutable ﬁles to feed in to a mac hine learning classiﬁer): 1. Strings Mo del : A strings-only mo del is a common wa y to build features for a machine learning classiﬁer [13]. Th us, we extract the P 1 = 1 , 117 , 127 most common strings observed in our corpus and use them as binary features in a predictive mo del. 2. Strings+W avelet Mo del : W e would like to in v estigate if wa velet-based features can add predictive v alue to a strings only mo del. Because of the relatively large- scale size of the dataset ( ≈ 20 × the size of Exp er- imen t 1), we streamline the feature generation pro- cess. Rather than computing SSECS, the energy sp ec- trum suspiciousness score, which requires a nested mo deling step, we follo w the feature generation algo- rithm of Section 2.2 only up to Step 2.2.2, computing the wa v elet energy sp ectrum. W e then represen t the w av elet energy sp ectrum separately for each ﬁle size group. In particular, a sample with T p oin ts in its en tropy stream will ha v e J = b log 2 T c features in its w av elet energy sp ectrum. If J max is the maxim um observ ed v alue of J in the dataset, then there are P J max J =1 J = J max ( J max + 1) 2 features, where any given sample with T p oin ts in its entrop y stream will only ha ve non-zero v alues for J = b log 2 T c of these features (namely , for the part of the vector that corresp onds to its ﬁlesize group). Although obviously this pro ce- dure creates a huge proliferation of features relative to the single SSECS feature studied in Exp eriment 1, the pro cedure is more informative and becomes more feasible as more data is collected, while simultaneously streamlining the mo deling pip eline for larger datasets. Finally , w e bin the w a v elet energy spectrum features, whic h are originally contin uous, to create a sparse bi- nary dataset. In this wa y , we obtain 24,009 binary fea- tures deriv ed from the w a velet energy sp ectrum. Af- ter adding in the strings as w ell, the Strings+W a v elet mo del includes P 2 = 1 , 141 , 136 binary features. 3. Strings+Entr opy+W avelet Mo del : The wa v elet fea- tures capture some information ab out the entrop y sig- nal, but it is incomplete. F or example, the wa v elet energy sp ectrum describ es variation at multiple lev- els of resolution, but ignores ﬁrst-or der information (i.e., measures of central tendency , such as the mean). Th us, in an attempt to construct a more p o w erful pre- dictiv e mo del from strings and the entrop y signal, here w e add simple summary statistics ab out the entrop y signal: mean, standard deviation, signal-to-noise ra- tio, maximum entro py , p ercen tage of the signal with “high” entrop y ( ≥ 6.5 bits), p ercen tage of the signal with zero entrop y , and length and squared length of the signal. As these supplementary entrop y features are relatively simple to compute, we obtain these mea- suremen ts separately for eac h PE section. As these features are also contin uous, they are then binned through an internal binning pro cess to create a sparse binary dataset. This pro cedure creates 108,835 ad- ditional features to add to the strings model (24,009 deriv ed from the wa v elet energy sp ectrum, and 84,826 other en trop y features). All together, this model con- tains P 3 = 1 , 225 , 962 binary features. 4. Strings+Entr opy Mo del : In order to provide a more rigorous test of the v alue of the wa v elet features, w e create a fourth mo del which includes strings and the summary en trop y features described ab o v e, but no wa velet features. Our reasoning is that, ev en if the wa v elet features improv e the strings- only mo del, this improv ement could, in theory , hav e b een merely driven b y the inclusion of some en- trop y information (or even ﬁle length). By construct- ing this model, we can compare the performance of the Strings+Entrop y+W av elet mo del with the p er- formance of the Strings+En tropy mo del to answer the question: do wa v elet features provide additional predictiv e information that go es ab o v e and b ey ond the information inherent in summary entrop y statis- tics (mean, max standard deviation, etc.)? Thus, this mo del includes the 84,826 summary en trop y fea- tures, but not the wa v elet features. All together, with the string features as well, this mo del contains P 4 = 1 , 201 , 953 features. Because we hav e a large n um b er of predictors (up to P max = 1 , 225 , 962 ) relativ e to samples ( N = 699 , 121 ), we apply a “logistic lasso" model (i.e. ` 1-p enalized logistic regres- sion) to p erform classiﬁcation and feature selection simul- taneously . Similarly to unregularized logistic regression, w e can use the learned regression (or beta) weigh ts as a proxy for feature imp ortance. Since the features are all binary , eac h β j , j = 1 , . . . , P can be interpreted as the increase in log o dds that the ﬁle is malw are which is asso ciated with the j th feature “turning on" (i.e. ﬂipping from 0 to 1) and all other features staying constant. Thus, features with large p ositiv e (resp ectiv ely , negative) b eta weigh ts can b e consid- ered particularly strong predictors of goo dness (resp ectiv ely , badness). In the results section, we explore prop erties of the most “inﬂuential" features, deﬁned as the collection of 100 features with the largest p ositiv e w eigh ts and 100 features with the largest negative weigh ts. As our purp ose in this pap er is to compare the eﬀect of diﬀerent feature subsets on predictiv e p erformance, and not to explore the predictive b eneﬁts of v arying lev els of sparsit y in f eature selection, w e simply ﬁx the sparsity parameter to 1.0. 3.3 Results and Discussion In Figure 5 and T able 2, we compare the p erformance of the logistic lasso parasitic malware classiﬁer using datasets with and without wa v elet features. In particular, the ROC curves in Fig. 5 graphically depict p erformance results across a range of decision thresholds, and T able 2 highlights n umeri- cal results at particular samples of the R OC curves. The left hand column of T able 2 shows the hit rate of the mo del, and the right hand column shows the correct rejection rate. Eac h pair of rows in T able 2 can b e seen as providing concrete v alues for samples of p oin ts from the ROC curv es in Fig. 5, where the rows for each pair represen t samples from the blue and red curves which hav e nearly aligned x-co ordinates. Th us, each pair of ro ws describ es the eﬀect of adding w a v elet features at roughly comparable tolerances for risking a false p ositiv e. The wa velet features impro v ed the string-only mo del’s abil- it y to detect parasitics while simultaneously reducing false p ositiv es. The eﬀect of wa velet features on detection was fairly strong for most false p ositiv e rates. F or example, for false p ositiv e rates around one-third of one p ercen t, the wa v elet features bo osted detection of parasitic malw are from 80.90% to 82.97% despite only adding ∼ 24k features to the original corpus of ∼ 1.1 million strings. Moreo v er, Fig 5. (right plot) rev eals that inclusion of w a v elet features b oosted the parasitic detection p erformance of a strings- plus-en tropy mo del in a fairly pronounced wa y as well. F or false p ositive rates around .02-.03%, detection of parasitic malw are jump ed from 92.10% to 94.27%. F or false posi- tiv e rates around .77-.79%, detection of parasitic malware jump ed from 98.63% to 98.90%. These results in Fig. 5 (righ t side) reinforce the conclusion of Experiment 1, we ﬁnd that the wa v elet features capture information that go es b ey ond more p edestrian entrop y-based information (mean, max, standard deviation, etc.). Overall, these results sug- gest that the wa v elet energy spectrum extracted from the en tropy signal of an executable ﬁle provides a useful set of features for a mac hine learning mo del for automatically de- tecting parasitic malware. Moreov er, the predictive v alue of these features seems to not be redundant with other, simpler summary features deriv able from the en trop y signal. In T able 3, we rep ort some additional results ab out the most inﬂuential features in the v arious mo d- els. In the strings-only mo del, w e found that the 100 most inﬂuential strings in terms of push- ing the mo del to wards a parasitics classiﬁcation in- cluded examples suc h as: CreateKernelThread, Tram- poline, FreeAllBuffers, VVVVVVVVVVVVVVVVVVVVVVVVVV- VVVVVVVVVVVVVVVVVVVVVVVV, UUUUUUUUUUUUUUUUUUUUUUU- UUUUUUUUUUUUUUUUUUUUUUUUU, SetProcessPriorityBoost , CreateProcessA , and ! Best regards 2 Tommy Salo 002E [Nov-2005] yours [Dziadulja Apanas] . F or the strings+w av elet mo del, we see that ev en though the wa v elet features comprise a relatively small proportion (2.1%) of the strings+wa v elet model, they constitute a relatively large prop ortion (7.0%) of that mo del’s set of inﬂuential features. F rom an adversarial p oin t of view, it is a nice ﬁnding that w av elet-based features can displace some of the imp ortance of strings, as it is presumably easier for an ev asive mal- w are writer to alter a suggestive string such as Trampo- line (the string is suspicious as it evok es deriv ativ es of the state-sponsored Stuxnet parasitic worm) than to dis- place an entropic energy sp ectral conﬁguration in a direc- tion fav ored by a machine learning mo del. Finally , in the strings+w av elet+entrop y model, wa velet features were also disprop ortionately inﬂuential on the ﬁnal classiﬁcation; they w ere ab out 2.5 times more likely to b e inﬂuential features than w ould hav e b een predicted based on their o v erall prev a- lence in the feature corpus alone. 4. GRAND DISCUSSION All together, w a velet decomp ositions on soft w are entrop y seem to b e useful for malware prediction tasks b y captur- ing the degree to which a p ortable executable ﬁle exhibits suspicious patterns of shifting entrop y within its byte-lev el co de. In particular, w e considered the problem that certain kinds of malw are (e.g. parasitic malware) tend to contain c hunks of encrypted and compressed co de em b edded in an otherwise normal lo oking executable ﬁle. T o address this situation, w e applied a wa velet decomposition to eac h ﬁle’s en tropy stream so as to obtain eac h ﬁle’s entropic wa velet energy spectrum. The entropic wa v elet energy spectrum Str ings Str ings + Entrop y 60 70 80 90 100 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F alse P ositiv e Rate (%) Hit Rate (%) W a v elets Y es No Figure 5: Performance b o ost on par asitic malwar e dete ction task c ause d by adding wavelet-b ase d fe atur es to two diﬀer ent b aseline fe atur e pr o c essing methods. Performance here was measured as accuracy by a logistic lasso classiﬁer on a hold-out test set of softw are samples. Predictiv e Accuracy (T est Set) Mo del P arasitic Malware Clean Softw are Strings 80.90% 99.64% Strings+W av elet 82.97% 99.65% Strings+En trop y 92.10 % 99.97% Strings+En trop y+W av elet 94.27 % 99.98% Strings+En trop y 98.63% 99.19% Strings+En trop y+W av elet 98.90 % 99.23% T able 2: W avelet-b ase d de c omp ositions of softwar e entr opy b o osts p erformanc e on a p arasitic malwar e dete ction task. The left hand column sho ws the hit rate of the model, and the right hand column shows the correct rejection rate. Each pair of ro ws show the numerical v alues for p oin ts that form approximate vertical slices through the red and blue ROC curves in Fig. 5. That is, each pair of rows compares hit rates on parasitic malware for appro ximately equal false p ositiv e rates on clean soft ware. Con tribution of W a v elet F eatures Mo del % of All F eatures % of Inﬂuential F eatures Strings+W av elet 2.10% 7.00% Strings+En trop y+W av elet 1.96% 4.50% T able 3: W avelet-base d fe atur es ar e disprop ortionately likely to b e inﬂuential fe atur es. As deﬁned in Section 3.2, inﬂuential features hav e a particularly strong impact on the mac hine learning model’s classiﬁcation. c haracterizes how a ﬁle distributes entropic change across m ultiple levels of spatial resolution. In the ﬁrst study , we found that a single feature derived from wa v elet decomposi- tions of softw are entrop y can yield v aluable predictive infor- mation in a heterogeneous corpus of malware. In the second study , we found that features derived from the w av elet de- comp ositions b oosted p erformance on a large-scale parasitic malw are detection task, and that a classiﬁer built solely on three t ypes of features (strings+entrop y+w a v elet) can pro- duce excellent predictive performance. In b oth studies, we found that the information pro vided by wa velet decomp osi- tions of softw are entrop y is not merely redundant with more common measures such as mean entrop y or standard devia- tion of the entrop y . F uture research relating wa velet decomp ositions to malware classiﬁcation in machine learning tasks might consider any of the follo wing goals: 1. Exploit predictive v alue from information ab out the lo c ation of en tropic change (perhaps as pointers for extracting further information about those parts of the ﬁle). This lo cation of entropic change is provided in the mother wa v elet co eﬃcien ts across which we hav e marginalized to obtain the w a v elet energy spectrum. 2. Apply a more p o w erful classiﬁer, such as a deep- learning neural netw ork, which could consider more complicated in teractions b et w een features when mod- eling the resp onse. In addition, incorporate other classes of features (n-grams [7], statistical functions of n-grams [14], etc.) What kinds of features interact usefully with the w a v elet energy spectrum in predict- ing malware, and what can w e learn from that ab out the existing corpus of parasitic malware? 5 3. In v estigate the p oten tial utilit y of non-entr opic w av elet energy sp ectra from byte-lev el representations of executable ﬁles. Indeed, entrop y streams are just one p ossible example of real-v alued streams deriv able from byte-lev el ﬁle con ten t (see e.g. [14]), and wa v elet energy spectra can b e extracted from any real-v alued function on the raw bytes. 5. REFERENCES [1] Anderson, B., Storlie, C., & Lane, T. (2012, Octob er). Impro ving malware classiﬁcation: bridging the static/dynamic gap. In Pro ceedings of the 5th A CM w orkshop on Securit y and artiﬁcial intelligence (pp. 3-14). ACM. [2] Ba ysa, D., Lo w, R. M., & Stamp, M. (2013). Structural en tropy and metamorphic malware. Journal of computer virology and hacking techniques, 9(4), 179-192. [3] Brosc h, T., & Morgenstern, M. (2006). Run time pac kers: The hidden problem. Black Hat USA. [4] Subasi, A., Yilmaz, M., & Ozcalik, H. R. (2006). Classiﬁcation of EMG signals using wa v elet neural 5 Note that the predictiv e p erformance of the model would lik ely improv e b y ﬁrst applying appropriate dimensionality reduction techniques; see e.g. [17]. net work. Journal of neuroscience methods, 156(1-2), 360-367. [5] Lyda, R., & Hamro c k, J. (2007). Using en trop y analysis to ﬁnd encrypted and pac k ed malware. IEEE Securit y & Priv acy , 5(2). [6] Kandasw am y , A., Kum ar, C. S., Ramanathan, R. P ., Ja yaraman, S., & Malmurugan, N. (2004). Neural classiﬁcation of lung sounds using wa v elet co eﬃcien ts. Computers in biology and medicine, 34(6), 523-537. [7] K olter, J. Z., & Malo of, M. A. (2004, A ugust). Learning to detect malicious executables in the wild. In Pro ceedings of the tenth ACM SIGKDD in ternational conference on Kno wledge discov ery and data mining (pp. 470-478). A CM. [8] Nason, G. (2010). W a v elet metho ds in statistics with R. Springer Science & Business Media. [9] Omerhodzic, I., A vdako vic, S., Nuhano vic, A., & Dizdarevic, K. (2013). Energy distribution of EEG signals: EEG signal wa v elet-neural netw ork classiﬁer. arXiv preprint [10] P ati, Y. C., & Krishnaprasad, P . S. (1993). Analysis and synthesis of feedforw ard neural net w orks using discrete aﬃne w a v elet transformations. IEEE T ransactions on Neural Netw orks, 4(1), 73-85. [11] Sorokin, I. (2011). Comparing ﬁles using structural en tropy . Journal in computer virology , 7(4), 259. [12] Subasi, A. (2007). EEG signal classiﬁcation using w av elet feature extraction and a mixture of expert mo del. Exp ert Systems with Applications, 32(4), 1084-1093. [13] Sc h ultz, M. G., Eskin, E., Zadok, F., & Stolfo, S. J. (2001). Data mining metho ds for detection of new malicious executables. In Security and Priv acy , 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on (pp. 38-49). IEEE. [14] T abish, S. M., Shaﬁq, M. Z., & F aro oq, M. (2009, June). Malware detection using statistical analysis of b yte-level ﬁle con ten t. In Proceedings of the A CM SIGKDD W orkshop on Cyb erSecurit y and In telligence Informatics (pp. 23-31). ACM. [15] W o jnowicz, M., Chisholm, G., & W olﬀ, M. (2016, Marc h). Suspiciously Structured Entrop y: W av elet Decomp osition of Soft w are Entrop y Reveals Symptoms of Malware in the Energy Spectrum. In FLAIRS Conference (pp. 294-298). [16] W o jnowicz, M., Chisholm, G., W allace, B., W olﬀ, M., Zhao, X., & Luan, J. (2017). SUSPEND: Determining soft ware suspiciousness b y non-stationary time series mo deling of en trop y signals. Expert Systems with Applications, 71, 301-318. [17] W o jnowicz, M., Zhang, D., Chisholm, G., Zhao, X., & W olﬀ, M. (2016, Octob er). Pro jecting" b etter than randomly": How to reduce the dimensionalit y of very large datasets in a w a y that outperforms random pro jections. In Data Science and Adv anced Analytics (DSAA), 2016 IEEE International Conference on (pp. 184-193). IEEE. [18] W o jnowicz, M., Chisholm, G., W olﬀ, M., & Zhao, X. (2016). W av elet decomp osition of soft w are entrop y rev eals symptoms of malicious code. Journal of Inno v ation in Digital Ecosystems, 3(2), 130-140.

Wavelet decomposition of software entropy reveals symptoms of malicious code

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment