Improved evolutionary generation of XSLT stylesheets

Impro v ed ev olutionary generatio n of XSL T st ylesheets ∗ P ablo Garc ´ ıa-S´ anc hez, JLJ Laredo, JP Sevilla, Pedro Castillo, JJ Merelo Depto. Arquitectura y T ecnolog ´ ıa de Computadoras ETSIIT - Unive rsidad de Granada (Spain) E-Mail: jj@merelo.net Septem b er 3, 2021 Abstract This paper in tro duces a pro cedure based on genetic programming to evol ve XSL T programs (usually called stylesheets or logicsheets). XSL T is a general purp ose, d o cument-oriented functional language, generally used to transform XML d o cuments (or, in general, solve any p roblem t hat can b e co ded as an XML docu ment). The prop osed solution uses a tree repre- senta tion for th e stylesheets as w ell as div erse speciﬁc operators in order to obtain, in t h e stu died cases and a reasonable t ime, a XSL T stylesheet that p erforms the transformation. Several types of representation hav e b een compared, resulting in diﬀerent p erformance and d egree of success. Keywords : genetic programming, XML, XSL T, JEO, DR EAM, con- strained evolutionary computation, do cument transformation 1 In tro duction XML (eXtensible Markup Lang uage, [10, 14, 6] encompasses a set of s pe c iﬁca- tions with diﬀeren t semantics but a common syn tactic s tructure; XML do cu- ments must hav e a single ro o t element and paired tag s , with attributes, which can b e nes ted. Thu s, all XML do cuments have a tr ee structure (the so-c alled Do c umen t Ob ject Mo del – DOM– tree) with a single ro ot ele ment that con- tains(encapsulates) all the conten ts of the do cument. Optionally , the syntax o r semantics o f elemen ts and attributes may be determined by a Do cument Type Deﬁnition (DTD) or XSchema (equiv alent co ncept that uses XML for its def- inition, [9 ]), in w hich case the do c ument can be v a lida ted; how ever, in most applications what is called wel l-forme d XML is more than enough. Since the IT industr y has settled in diﬀerent XML dia le cts as information exchange forma t, there is a business need fo r pro grams that transform fr o m one ∗ Supported by pro jects TIN2007-68083-C02-01, and P06-TIC-02025. 1 < ?xml version= "1.0" ? > < h tml > < head > < title > T est page < /titl e > < /head > < b o dy > < h1 > Te st page < /h1 > < h2 > Fi rst test < /h2 > < p > Som e stuff < br / > Some more stuff < /p > < h2 > Se cond test < /h2 > < h2 > Th at’s another test < /h2 > < /b o dy > < /h tml > Figure 1: An ex ample simpliﬁed XHTML do c umen t. Lo ok s like HTML, but it has an XML syntax: mainly , tags must b e strictly paired. XML set of tags to another, extracting info r mation or combining it in man y po ssible ways; a typical example of this transformation could be the extraction of news headlines from a newspap er in Internet that uses XHTML (An XML version of the Hyp ertext Ma rkup Languag e (HTML) used in web pages, see ﬁgure 1). XSL T stylesheets (XML Stylesheet Language for T ra nsformations) [7], also called lo gicshe ets , are designed for this purpo s e: applied to an XML do cument, they pro duce another . Ther e are other p ossible so lutions: programs written in any languag e that w ork with text a s input and output, prog r ams using reg ular expressions or SAX ﬁlters [18], that pro c e ss each tag in a XML do cument in a diﬀerent wa y , a nd do not nee d to load into memory the whole XML do cument. How ever, they nee d external languages to w ork, while XSL T is a par t of the XML set of s tandards, and, in fact, XSL T logicshee ts ar e XML do cuments, which ca n be integrated within a n XML fr a mework; that is wh y XSL T is, if not the mos t common, at least a quite usua l way of tr ansforming XML do cument s. The amount of work needed for logicsheet creation is a problem that scales quadratically with the qua nt ity of initial and ﬁnal formats. F or n input and m output formats, n × m transforma tio ns will b e needed 1 . Considering that each conv ersion is a hand-written progra m and the initial and ﬁnal for mats can v a ry with certain freq ue nc y , any automation of the pro cess means a considera ble saving of eﬀort on the part of the programmers . The ob jective of this work is to ﬁnd the XSL T lo gicsheet that, from one input XML document, is a ble to o bta in an output XML do c ument tha t contain exclusively the infor mation that is considered imp o rtant fro m original XML 1 If an i n termediate language is used, j ust n + m , but this increases the complexit y of the transformation and decreases its sp eed. 2 do cuments. This informa tio n may b e ordered in any p ossible w ay , p ossibly in an order diﬀerent to the input do cument. This logicsheet will b e evolv ed using evolutionary o p er ators that will take into acco unt the str ucture of the prog ram and its co mpo nent s. This could b e co nsidered, in a w ay , Genetic Prog ramming, since XSL T lo g icsheets are XML do cuments that hav e a tree s tructure, but, since they have to follow grammatica l conv entions, it is b etter to g uide evolution using sp eciﬁc opera tors than a llow a ll t yp e of GP o p er ators. Thu s, XSL T provides a gener al mechanism for the asso ciatio n of patterns in the source XML do cument to the application of format rules to these ele- men ts, but in order to simplify the search space for the evolutionary alg orithm, only three instructions of XSL T will b e used in this w ork: template , which sets which XML frag ment will b e included when the element in its mat ch attribute is found; app ly-templates , which is used to select the elemen ts to whic h the transformatio n is going to b e applied and delegate c ontrol to the cor resp onding templates ; and value -of 2 , which simply includes the co nt ent of an XML do c- umen t int o the output ﬁle. This implies also a simpliﬁca tion of the g e neral XML-to-XML trans formation problem: w e will just extrac t information from the orig inal do cument , without adding new elements (tags) that did not exist in the or iginal do c ument. In fact, this makes the pro blem mor e s imila r to the creation o f an scr ap er , o r prog ram that extra c ts infor mation from legac y web- sites or do cuments. Thus, we in tend this pap er just as a pro o f of concept a nd initial p er fo rmance mea s urement, whose gene r alization, if not straig htforward, is at le a st possible. XSL T stylesheets co mbines XSL T comma nds with e m b edded XPath [8 ] ex- pressions to map XML do cuments into other s. F or instanc e , to extract all H2 elements in the XHTML e x ample shown in ﬁgur e 1 b oth XSL T log icsheets shown in ﬁgur e s 2 and 3 would b e v alid, but the seco nd one is simpler , making use of a single XPath expressio n, while the other one would obtain the same r esult using only XSL T templates. In a ddition, XPath provides a wa y to selec t groups o f el- ement s ( no de-sets ) and to ﬁlter them by using pr edicates allowing, for instance, to select the element that o ccupies a certain po sition within a no de- set. Previous ly , we published the initial XSL T evolution exp eriments [19], tes t- ing diﬀerent do cument structures and op erator s. In this pap er we will try to improv e on those results, choosing XSL T s tylesheet struc tur e and op erato rs so that co nv ergence to solution is assured. W e will try als o to exa mine the inﬂuence of the diﬀerent oper a tor r ates o n the result. The rest of the paper is structured as follows: the state of the art is pre- sented in section 2. Section 3 describ es the s olution presented in this work. Exp eriments are describ ed in section 4, with the automatic genera tion of XSL T stylesheets for t wo examples and ﬁnally the conclusions and po ssible lines of future w ork are presented in section 5. 2 With text used for easy visualization of the ﬁnal document 3 < ?xml version= "1.0" ? > < xsl:stylesheet v ersio n= "1.0" xmlns: xsl= "htt p://ww w.w3.org/1999/XSL/Transform" > < xsl:output m etho d= "xm l" indent=’y e s’/ > < xsl:tem plate m atc h= "/" > < output > < xsl:apply-tem plates select=’html’ / > < /output > < /xsl:tem plate > < xsl:tem plate m atc h=’h tml’ > < xsl:apply-tem plates select=’b o dy’/ > < /xsl:tem plate > < xsl:tem plate m atc h=’b o dy’ > < xsl:apply-tem plates select=’h2’/ > < /xsl:tem plate > < xsl:tem plate m atc h=’h2’ > < line > < xsl:v alue-of select=’.’ / > < /line > < /xsl:tem plate > < /xsl:s t ylesheet > Figure 2: Example XSL T logicsheet that extracts the conten t of h 2 tags to an XML ﬁle; each co nten t will b e co ntained in line tags. This exa mple shows an structure with templ ates for all e le men ts in the path that lea ds to the element being extra cted. 2 State of the art So far, very few pap ers ab out applying g enetic progra mming techniques to the automatic gener ation of XSL T logicsheets have b een published; one of these, by Scott Martens [13], pr esents a techn ique to ﬁnd XSL T stylesheets that tr ans- form a XML ﬁle into HTML b y using ge netic prog ramming. Martens works on simple XML do cuments, like the o nes shown in its article , and uses the UNIX diﬀ function as the basis fo r its ﬁtness function. He concludes that genetic progra mming is useful to obtain solutions to simple examples of the pr oblem, but it ne e ds unreasona ble execution times for complex examples and migh t not be a suita ble method to solve this kind of problems. Ho wev er, computing has changed a lot in the la test seven years, and the time for doing it is probably now, as w e attempt to pro ve in this pap er . Unaw are of this eﬀor t, and co ming from a co mpletely diﬀerent ﬁeld, Schmidt and W altermann [15] approached the problem taking into account that XSL T is a functional lang ua ge, and using functional language pro gram generation techn iques on it, in what they call inductive synthesis . First they create a non-recurs ive pr ogra m, and then, by identifying recurrent pa r ts, convert it into a recur sive pro gram; this is a generaliz a tion of the tech nique used to gener- ate pro grams in other pr o gramming languages such as LISP [4, 16], and used 4 < ?xml version= "1.0" ? > < xsl:stylesheet v ersio n= "1.0" xmlns: xsl= "htt p://ww w.w3.org/1999/XSL/Transform" > < xsl:output m etho d= "xm l" i nden t=’y es’/ > < xsl:tem plate m atc h= "/" > < output > < xsl:apply-tem plates select=’/html/b o dy/h2’ / > < /output > < /xsl:tem plate > < xsl:tem plate m atc h=’h2’ > < line > < xsl:v alue-of sel ect=’.’ / >< /line > < /xsl:tem plate > < /xsl:s t ylesheet > Figure 3 : Another example XSL T logicsheet for extracting h 2 tag s, in this cas e using an XPath ex pr ession to pr o cess just the nee de d no des. thoroughly since the eighties [5 ]. A few other author s hav e approa ched the gener al problem of generating XML do cument transfor mations knowing the o riginal and target s tructure of the do cuments, as represented by its DTD: Le inonen et al. [12, 1 1] hav e pro- po sed semi-a utomatic gener ation of tr ansformatio ns for XML do cuments; user input is needed to deﬁne the lab el ass o ciation. There are also freeware pro - grams that perfor m transfor mations on documents from a XSc hema to another one. Howev er, they mu st know b oth XSchemata in adv ance, and are not able to acco mplish genera l transformatio ns on well formed XML do cuments from examples. The automatic generation of XSL T log icsheets is also a sup e r-set of the pr ob- lem o f generating wr app ers , that is, pr ogra ms that extract information fro m websites, such as the one describ ed by Ben Miled et a l. in [3]. In fact, HTML is similar in str uc tur e to XML (and can actua lly b e XML in the shap e of XHTML), but these pr ogra ms do not gener ate new data (new ta gs), but only extract infor- mation already e x isting in w eb sites. This is what applications such as X-F etch W rapp er, develop e d by Republica 3 , do. The company that marketed it claims that it is able to p erform transfor mation b etw een any t wo XML forma ts fro m examples. An ywa y , it is not so c lear that trans formations ar e that straightfor- ward: according to a white pap er found at their website, it uses a do cument transformatio n langua ge. 3 This compan y no longer exists, and the pro duct seems to hav e b een discontin ued 5 3 Metho dology XSL T stylesheets ha ve been inserted into tree structur e s, making them evolv e us- ing v ariation o p e r ators. E very XSL T s tylesheet is ev aluated using a ﬁtness func- tion tha t is re la ted to the diﬀerence betw een generated XML and output XML asso ciated to the example. T he so lution has b een pr ogra mmed using JEO [2], a n evolutionary algor ithm libra r y develop ed a t Universit y o f Granada as part of the DREAM pro ject [1], which is av ailable at http:/ /www. dr- ea- m .org together with the r est of the pro ject. All sourc e c o de fo r the progr a ms used to r un the e x - per iments is a v ailable from https:// forja .rediris.es/websvn/wsvn/geneura/GeneradorXSLT/ under an open source licence. The generated XML do cument s ar e enca psulated within an XML tag whose name equals the ro ot element from the input XML; ea ch line us e s als o the tag line , s o that we can distinguish eas ily b etw een intended and unintended (generated by default templates, for instance) output lines. Next, structures used for evolution a nd op erato r s applied to them ar e des c rib ed. Thes e op era tors work on data str uctures and XP ath queries within them. The search space ov er possible stylesheets is exceedingly large. In addition, language gr ammar must be considered in orde r to av oid syntactically wrong stylesheet generation. Due to this, transfor mations are a pplied to predeter mined stylesheet structures which hav e b een selected, whic h will be descr ibe d next, along with the op era tors tha t will be applied to them. 3.1 T yp e 1 structure < xsl:tem plate m atc h= "/" > < xsl:apply-tem plates select= "/book " / > < /xsl:tem plate > < xsl:tem plate m atc h= "book" > < xsl:apply-tem plates select= "cha pter[2 ]" / > < xsl:apply-tem plates select= "cha pter[3 ]/para[5]" / > < xsl:apply-tem plates select= "chapt er[2] //line" / > < /xsl:tem plate > < xsl:tem plate m atc h= "title" > < line > < xsl:v alue-of select= "." / >< /line > < /xsl:tem plate > Figure 4: Example of XSL T stylesheet of type 1. An example of this structure is shown in ﬁgure 4. • The XSL T log icsheet will hav e three le vels of depth. First level is the ro ot element < xsl:stylesheet > which is co mmo n to all XSL T st ylesheets. • An undetermined q uantit y of < xsl:template match=... > instructions hangs from the ro ot element. 6 • The v alue of match attribute for the ﬁrst template that hangs o ﬀ the ro ot will be “/ ” . This template and its cont ent never w ill b e mo diﬁed by applying evolution o p er ators. The only instruction inside this elemen t will b e apply-templates , that will have a select attribute who se v a lue will be a “ /” follow ed by the r o ot element name, so that the rest o f templates included in the stylesheet will b e pro c e ssed. • The v alues of the match attributes for the r est of the templates will b e sim- ply tag names of the input XML. Every v alue will hav e a n undetermined nu mber of children, that will b e apply-template or value -of instructions . These instructions will hav e select a ttributes, who s e v alues will b e r elative XPaths, built ov er the template path. Those routes would include every po ssible XPath clause. value-of will b e used instead of apply-templates when the XP ath is self ( . ). This kind o f structure is quite unconstr ained, and relies heavily in the use of default templates. If an elemen t is not matched, the default template, which includes the text inside the element, is applied. F or the exa mple shown in ﬁg ure 4, default templates w ill be used for the para and chapter element, for insta nce. 3.2 T yp e 2 structure < xsl:tem plate m atc h= "/" > < xsl:apply-tem plates select= "/bo ok" / > < xsl:apply-tem plates select= "/book /titl e" / > < /xsl:tem plate > < xsl:tem plate m atc h= "/book" > < line > < xsl:v alue-of select= "chapt er[2] " / >< /line > < line > < xsl:v alue-of select= "cha pter[ 3]/para[ 5]" / >< /line > < line > < xsl:v alue-of select= "cha pter[ 2]//line " / >< /line > < /xsl:tem plate > < xsl:tem plate m atc h= "/book /titl e" > < line > < xsl:v alue-of select= "." / >< /line > < /xsl:tem plate > Figure 5: Example of XSL T stylesheet of type 2. An e x ample o f this structure is shown in ﬁgure 5. The main diﬀerences with the ﬁrst o ne a re: • The v alue of the match attribute for the ﬁrst template that hangs oﬀ the ro ot will be “/” to o , but, in this case it will hav e an indeterminate num ber of children, that will b e all apply-templates instructio ns , whose v alues fo r the select attribute will b e absolute XPaths in the input XML, tha t will include only s ingle slash-sepa rated tag names. 7 • The v alues for the match attributes for the other templates that hang from the XML ro o t will b e the same v a lues that had the s elect attributes of the apply-templates in the ﬁrst template. Therefore, there will b e as many template instructions a s the num ber of apply-templates in it, and they will be in the same order. • Ev ery template o f the prev ious section will hav e an undetermined n um- ber of children, and all of them will b e value-of instructions, where the v a lue for the selec t attribute will b e XPath r outes relative to the XPath absolute ro ute of the father template. These routes would include every mechanisms of XPath that the designed oper a tors allow. • If the abso lute route of a template has a maximum depth lev el inside the XML structure, its o nly value-of child will select the self element: “.”. This t yp e of structure is more heavily constra ine d than Type 1; search is thus easier, since less stylesheets are generated; being mo re constrained, how ever, m utation and cr ossov er ar e muc h more disruptive, and has a roug her la ndscap e than b efor e. 3.3 Genetic op erators The op e r ators may b e clas siﬁed in tw o diﬀere nt types: the ﬁrs t one co ns ists in op erator s that are common to the tw o structures and whose a ssignment is to mo dify the XPath routes that contains the attributes o f the XSL T instructions (spe c ially apply- template a nd v alue-of ). O p erators in the second g roup are used to mo dify the XSL T tree s tructure and ta ke diﬀerent shap e in each of them (so that the s tructure is kept). In order to ensure the existence of the element s (tags) added to the XPath expressions a nd XSL T instr uc tio n attributes, every time one o f them is needed it is r andomly selected from the input ﬁle. The common opera tors are: • XSL T reeMutatorXP ath(Add | Mutat e | Remove)Filter : Adds, changes num b er, or remov es a cardina l ﬁlter to any of the XPath ta gs that allow it. F or example: /book/ chapt er → /b ook/c hapte r[4] /book/ chapt er[2] → / book/c hapte r[4] /book/ chapt er[2] → / book/c hapte r • XSL TT reeM utatorXP athAddBranch : Adds a ne w tag to a n XPath, chosen randomly from the exis ting XPaths, obse rving the hierarch y of the input XML ﬁle tree: /book /chap ter → /boo k/cha pter/ title • XSL TT reeM utatorXP athSetSelf : Replac e s the deep est no de tag of a XPath route b y the self no de . 8 • XSL TT reeM utatorXP athSetDescendant : Removes o ne of the intermediate tags from a XPath route, r emaining a Descendant t yp e no de: /book /chapt er/title → /b ook// title . • XSL TT reeM utatorXP athRemoveBranch : Remov es the deepes t element tag of a XPath route, ascending a level in the XML tree. F or example: /book/ chapt er/ti tle → /book /chap ter . Other op erato r s change the DOM str ucture of the XSL T log icsheet, althoug h not a ll of them can b e applied to all XSL T structura l types : • XSL TT reeCros soverT emplate : Swaps template instructions sub-trees b e- t ween the tw o p ar ents . This is the only cr ossover-lik e op e r ator. • XSL TT reeM utator(Add | Mutate | Remove)T emplate : Inse rts, changes or r e- mov es a template. Insertion is perfo rmed on the ro ot element matching a random element . The choice of this random element giv es higher prior it y to the less deep er tags. The p osition of the new template inside the tree will be randomly selected, and its conten t will be ap ply-t emplat es or value- of tags with the select attribute con taining XPath routes rela tives to the parent template XP ath route rando mly generated using the XPath op erator s. Change op erates on a random no de, g enerating a new sub-tree; and remov a l also eliminates a rando m templa te (if there are more than t wo). • XSL TT ree(Add | Re move)Apply : It adds or removes an x sl:v alue-of statemen t to a ra ndomly selected template present in the tree. The position of the new leaf inside the sub-tree that matches the template also will b e randomly selected. The new ele men t is ra ndomly ge nerated from the ro ute that cont ains its parent template instructio n. The -Remove op erator also deletes the templa te no de if the removed child was the last re maining one, but it is not applied if there is a single template left. • XSL T reeMutate Apply(1 | 2) : Changes a randomly selected child (1) o r cr e- ates a r elative XPath from the one that co nt ains the father xsl:template and the XP ath of the leaf that we are going to mo dify (2). • XSL T reeSetT empl ateNull : It cho oses a sub-tree template from the XSL T tree and r e places its conten t by a single instruction < xsl :value-of select= ”.” > . 3.4 Fitness function Fitness is related to the diﬀerence b etw een the desired and the obtaine d output, but it has b een also de s igned so that evolution is help ed. Instead of us ing a single aggre g ative function, as we did in pr evious pa p e r s [19 ], ﬁtness is now a vector that includes the num be r o f deletions and additions nee ded to obta in the target output fro m the o btained output, a nd the resulting XSL T stylesheet length. The XSL T stylesheet is corr ect only if the num ber of deletions and additions is 9 0; and minimizing length helps remo ving useless statements from it. So, ﬁtnes s is minimized by comparing individuals a s follows: An individua l is co nsidered better than another • if the num ber of deletio ns is s ma ller, • if the num b er o f a dditions is smaller, b eing the num ber o f deletions the same, or • if the length is smaller, b eing the num ber of deletions/additions the same. Separating and prior itizing the num ber of de le tions he lps guide evolution, by trying to ﬁnd ﬁr st a s tylesheet that includes all elements in the targ et do cument , then eliminating unneeded elements, while, at the same time, reducing leng th. 4 Exp erimen ts and results T o test the algor ithm we hav e p er formed several exp eriments with diﬀerent XML input ﬁles and a single XML o utput ﬁle. The algor ithm has b een executed thirty times for each input XML. Seven diﬀerent input ﬁles hav e be e n used for Type 1, leaving o nly the har dest ones for Type 2. The same input ﬁle was used for sev- eral ex p er iments: a RSS feed fro m a weblog ( http: //gene ura.w ordpress.com ) and an XHTML ﬁle. All input and output ﬁles ar e av a ilable fro m our Subversion rep ository : ht tps:/ /forj a.rediris.es/websvn/wsvn/geneura/GeneradorXSLT/xml/ . T able 1: O p erator priorities (used for the r oulette wheel that randomly selects the oper ator to apply) used in the experiments. Op erator Prior ity XSL TT r eeMutatorXPathSetSelf 0.10 XSL TT r eeMutatorXPathSetDescendant 0.24 (Only T yp e 1) XSL TT r eeMutatorXPathRemov eBranch 0.27 (T yp e 2) 0 .39 (T yp e 1) XSL TT r eeMutatorXPathAddBranch 0.99 XSL TT r eeMutatorXPathAddFilter 0.45 (T yp e 2) 0 .53 (T yp e 1) XSL TT r eeMutatorXPathMutateFilter 0.64 (T yp e 2) 0 .69 (T yp e 1) XSL TT r eeMutatorXPathRemov eFilter 0.83 XSL TT r eeCross ov erT emplate 0.11 XSL TT r eeMutatorAddT emplate 0.2 XSL TT r eeMutatorMutateT emplate 0.10 XSL TT r eeMutatorRemoveT emplate 0.12 XSL TT r eeAddApply 0.1 XSL TT r eeMutateApply1 0.1 XSL TT r eeMutateApply2 0.14 XSL TT r eeRemov eApply 0.1 XSL TT r eeSetT emplateNull 0.03 10 The computer used to p erform the exp eriments is a Centrino Core Duo at 1.83 GHz, 2 GB RAM, and the Jav a Runtim e Environmen t 1 .6.0.01. The po pulation was 12 8 for all runs, and the termination condition was set to 2 0 0 generations o r until a solution w as found and selection was per formed via a 5-T ournament; 30 expe r iments w ere run, with diﬀer e nt r a ndom seeds, for each template type and input do cument . The XML and XSL T pro c e s sors were the default ones included in the JRE standard librar y . The op er ator rates used in the exper iment s, which were tuned heuristically , are shown in table 1. The new ﬁtness function, in genera l, yielded be tter results than previous ly . The algo r ithm w as able to ﬁnd a n adequate XSL T stylesheet within the pre- assigned num be r o f genera tions in mos t cas es. The bre a kdown of results p er input ﬁle is shown in table 2. T able 2 : Number of times, out of 30 exp eriments, a so lution is not found within the pr e deﬁned num b er of generations using type 1 XSL T structure. In general, the ﬁles are in incr easing complexit y order, that is wh y it gets harder to ﬁnd a solution in the la tes t exa mples. Input ﬁle Times solution not found 1 0 2 1 3 0 4 0 5 3 6 27 7 17 When a solution w as found, the nu mber o f generatio ns and time used to ﬁnd it a ls o v aries, and is shown in ﬁgure 6. In genera l, the exploratio n/exploitation balance seems to b e bias ed tow ards ex ploration. Being such a v ast and r o ugh search space makes that, after a few initial g enerations that create stylesheets with a small diﬀerence form the targ et, m utations are the main o p er ator at work, as is shown in ﬁgur e 7. This last ﬁgure als o sho ws a feature of this type of evolution: every change has a big inﬂuence o n ﬁtness, s inc e the intro duction of a single sta tement can a dd several (dozens ) lines to output. There is no lineal relatio n b et ween the num b er of mutations needed to re a ch a solution and the n umber of insertions/ deletions, which also means that a single m utation might hav e a big inﬂuence in ﬁtness, while several mutations might be needed to decrease ﬁtness b y a single line. Some additiona l exp er iments hav e b een made using type 2 structure ; in general, pro blems which are diﬃcult to a ttack using type 1 are not so diﬃcult using type 2. The s ame nu mber of exp eriments have been run (30) for every input/output ﬁle combination, but only input ﬁles #5, #6 and #7 hav e b een used. Results are s hown in ﬁgure 8. Once again, ﬁle #6 presents the highest diﬃcult y , but using this structure ra ises the num b er of succes sful exp er iment s to 2 6 (out of 3 0); it is able to ﬁnd the solution always for the other tw o input 11 1 2 3 4 5 6 7 10 50 500 5000 Computational effort Input file Generated stylesheets Figure 6: Lo garithmic boxplot of the num b er of ev aluations needed to ﬁnd the correct stylesheet us ing Typ e 1 structure. The diﬀerence among e asy (the ﬁrst ones) and diﬃcult (the last ones) is quite clear; while just a few hundred o f ev a luations, or at most a few tho us ands, are needed in ﬁles num b er 1 to 4, several thousands , on av era ge, a re needed in n umbers 5 and 6. Only runs when a solution was actually fo und have b e en considered to co mpute a verages. 12 0 10 20 30 40 50 60 20 40 60 80 100 120 Insertion/Deletion averages Generation Ins/Del Figure 7: Evolution of the av erage num b er of inser tions (black, line on top) and deletions (red or light g r ay) for a run o f ﬁle #6 which was a ble to ﬁnd a solution in aro und 7 0 gener ations. The nu mber of deletions decrea s es in the ﬁrst few g enerations, but, after that, it pro ce e ds more or less ra ndomly , explor ing the search space un til the solution is found; the n umber of insertions, ho wev er, decreases a bit after deletions’ dip and then incr e ases s lowly . 13 5 6 7 500 2000 10000 Computational effort, Type 2 stylesheet Input file Generated stylesheets Figure 8: Boxplot of the num ber of individuals generated to ﬁnd the optimum for the Type 2 structure. File #6 presents the maximum diﬃculty , needing on av erage aro und 2000 individuals. Please note that, even as ﬁnding the solution more often than using Type 1 structure, the num be r of ev a luations needed is smaller. 14 ﬁles. In g eneral, this structure whic h we have come to ca ll Type 2 b ea ts the ﬁrst one (T yp e 1) in success rate, num b er o f genera tions/ev aluations needed to achiev e it, and running time. The o nly adv an tage of Type 1 ov er Type 2 is that it has less constra ints, and, in some cas es, mig ht obtain b etter r esults; so, in general, our advice w ould b e to try t yp e 2 ﬁrst, and if it does no t yield a go o d result, try a ls o t yp e 1. 5 Conclusions, discussion, and future w ork In this pap er we present the results of an e volutionary alg orithm des igned to search the XSL T logicsheets that is able to make a particula r transformation from a XML do c umen t to another; one o f the adv antages of this applica tion is that resulting log icsheets can b e used directly in a pro duction environment , without the interven tion of a human op er ator; b esides, it tackles a real-world problem found in many org anizations. Bes ides , it is o pen sourc e so ft ware, av ail- able from the Subv ersio n rep ositor y https:/ /forj a.rediris.es/websvn/wsvn/geneura/GeneradorXSLT/xml / . In these initial exp eriments we hav e found which kind of XSL T template structure is the most adeq uate for evolution, namely , o ne that matches the se- lect attribute in apply-te mplates with the match attribute in templates, a nd an indeterminate num b er of v alue-of instructions w ithin each template; that is the one called Type 2; this result is co ns istent with those found in our pr evious pap er [19]. By constraining evolution this wa y , we restrict the sea r ch space to a more reaso nable s ize, a nd av oid the hig h degree of degenerac y of the pr ob- lem, with many diﬀer e nt s tructures y ielding the same res ult, that, if co mbin ed, would result in in v alid structures . In general, w e have also prov ed that a XSL T logicsheet can be found just from an input/output pair of XML do cuments for a wide r ange of examples, some of them particularly diﬃcult. The e x pe r iments have shown that the sear ch space is par ticularly ro ugh, with m utations in gene r al leading to huge changes in ﬁtness. The hierarchical ﬁtness used is pr o bably the ca use of having a big loss o f diversity at the b eginning of the evolutionary search, lea ding to the need of a higher level of ex plorations later during the alg orithm r un. This problem will ha ve to b e approached via explicit diversit y-preser v ation mechanisms, or by using a multiob jectiv e evolutionary algorithm, instead of the one used now. A dee p er understanding o f how diﬀerent op erator rates aﬀect the result will also help; for the time b eing, op erator rate tuning has b een very shallow, and gear ed tow ards obtaining the result. As such, running times and num ber of ev aluations obtained in this pap er ca n b e used as a baseline for future v ersio ns of the a lgorithm, or other alg orithms for the same problem. How ever, there are some questions a nd issues tha t will hav e to b e addressed in future papers : • Using the DTD (asso cia ted to a XML ﬁle) as a sour ce of information for conv ersions b etw een XML do c umen ts and for restrictions of the p os s ible v a riations. 15 • Adding diﬀerent lab els in the XSL T to allow the building of diﬀerent kinds of do c umen ts such as HTML or WML. • Considering the use of adv anced XML do cument co mparison to ols (i.e. XMLdiﬀ 4 ). • T esting evolution with other kind of to o ls, suc h as a chain of SAX ﬁlters. • Ob viously , testing diﬀeren t kinds and increa singly complex s et o f docu- men ts, and using several input a nd output do cuments at the sa me time, to test the g eneraliza tion capability of the pro cedure. • Using the iden tity tra nsform [17] as another frame for ev olution, as an alternative to the types (which w e hav e called 1 a nd 2) shown here. The ident ity tr a nsform puts every element found in the input do cument in the o utput do cument; element s can then be selectively eliminated via the addition of s ingle statemen ts. • T ackle diﬃcult problems from the p oint of view of a human op er a tor. In general, the XSL T stylesheets found here could have b een pro g rammed by a knowledgeable per son in around a n hour, but in some cases , in- put/output ma pping would no t b e so obvious at ﬁrst sig ht . This will mean, in general, increase also the XSL T statements use d in the stylesheet, and also in general, a dding new types of op erato rs. References [1] M. Arena s, P . Collet, A. Eib en, M. Jelasity , J. J. Merelo, B. Paec hter, M. Pr euß, and M. Sc ho enauer . A fr a mework for distributed evolutionary algorithms. Number 24 39 in Lecture Notes in Computer Science,LNCS, pages 665–6 75. Springer-V erla g, September 2002 . [2] M. G. Arenas, B . Dolin, J.-J. Merelo-Guer v´ os, P . A. Castillo, I. F. de Viana, and M. Schoenauer. JEO: Jav a Evolving Ob jects. In W. B. Langdon, E. Cant´ u-P az, K . Ma thias, R. Roy , D. Da vis, R. Poli, K. Balakr is hnan, V. Ho nav ar, G. Rudolph, J. W egener , L. Bull, M. A. Potter, A. C. Sch ultz, J. F. Miller, E. Burke, and N. Jonosk a., editors, Poster A c c epte d at GECCO 2002 , page 991, 2 002. [3] Z. Ben Miled, A. F aro o q, M. Mahoui, N. Li, M. Dipp old, and O. Bukhr es. A wr a pp er induction applica tion with knowledge base supp ort: A use case for initiation and ma int enance o f wrapp ers. In Pr o c e e dings - BIBE 2005: 5th IEEE Symp osium on Bioinf ormatics and Bio engine ering , volume 2 005, pages 65–72 , 2005. [4] A. Biermann. The inference of reg ula r LISP pr ogra ms fro m examples. IEEE T r ansactions on Systems, Man and Cyb ern et ics , 8(8):585–6 00, 1978. 4 Av ailable f rom Logilabs at http://www .logilab.org/859 16 [5] A. W. Bierma nn and G. Guiho, editors. Computer Pr o gr am Synthesis Metho dolo gies . Reidel, Dordrech t, 198 3. [6] T. Bray , J. Paoli, C. M. Sp erb erg-Mc Q ueen, and E. Maler. Exten- sible markup language (XML) 1 .0 (second edition). Av a ila ble from http:/ /www. w3.or g/TR/2000/REC-xml-20001006 , Nov ember 2000. [7] J. Clark . XSL transfor mations (XSL T), version 1 .0, W3 C reco mmendation 16 nov em b er 1 999. Av ailable from h ttp:// www.w 3.org/TR/xslt.html . [8] J. Clark and S. DeRose. XML path language (XP ath), ver- sion 1.0, W3C recommendatio n 16 nov em b er 1999. Av ailable from http:/ /www. w3.or g/TR/xpath , Nov ember 1999. [9] D. C. F allside. Xml schema part 0: Primer. Av ailable from http:/ /www. w3.or g/TR/xmlschema- 0/ . [10] E . R. Har old. XML Bible . IDG Bo oks worldwide, 1991. [11] E . Kuikk a, P . Leinonen, and M. Pen ttonen. T ow ards automating of doc u- men t structur e transforma tions. In Pr o c e e dings of the 2002 ACM Symp o- sium on Do cument Engine ering , pages 103–110 , 2002. [12] P . Leinonen. Automating XML document structure transformations . In Pr o c e e dings of the 2003 ACM Symp osium on Do cument Engine ering , pag es 26–28 , 200 3. [13] S. Mar tens. Automatic cre a tion of XML do cument conv ersion scripts by genetic progr amming. In Genetic Alg orithms and Genetic Pr o gr amming at Stanfor d , page 26 9 ﬀ., 20 00. [14] E . T. Ray . L e arning XML: cr e ating self-describing data . O´ Reilly , Ja nuary 2001. [15] U. Schmid a nd J. W altermann. Automatic synthesis of XSL- transformatio ns from example documents. In M. Hamza, editor, Artiﬁcial Intel ligenc e and Applic ations Pr o c e e dings (IASTED International Confer- enc e on Artiﬁcial Intel ligenc e and Applic ations (A IA 2004) , pages 252– 257, 2004. [16] P . D. Summers. A metho dolo gy for LISP progra m co nstruction from ex- amples. J. ACM , 24(1 ):1 61–1 7 5, 19 7 7. [17] Wikip edia . Ident ity transform — Wikipedia, The Free Encyclop edia, 2007. [Online; accessed 24-Ja nu ar y-200 8 ]: http:/ /en.w ikipe dia.org/wiki/Identity_transform . [18] Wikip edia . Simple API for XML — Wikip edia, the fr ee encyclop edia , 2 0 07. [Online; accessed 21-March-2007]. 17 [19] N. Zor zano, D. Merino, J. L. J. Laredo, J. P . Se v illa, P . Gar- cia, a nd J. J. Merelo. Evolving xslt stylesheets, 2007. http:/ /xxx. arxiv .org/abs/0712.2630 . 18

Improved evolutionary generation of XSLT stylesheets

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment