Authorship Attribution through Function Word Adjacency Networks

A method for authorship attribution based on function word adjacency networks (WANs) is introduced. Function words are parts of speech that express grammatical relationships between other words but do not carry lexical meaning on their own. In the WA…

Authors: Santiago Segarra, Mark Eisen, Alej

Authorship Attribution through Function Word Adjacency Networks
1 Authorship Attrib ution through Function W ord Adjacenc y Networks Santiago Segarra, Mark Eisen, and Alejandro Ribeiro Abstract —A method f or authorship attribution based on func- tion word adjacency networks (W ANs) is introduced. Function words are parts of speech that express grammatical relationships between other words b ut do not carry lexical meaning on their own. In the W ANs in this paper , nodes ar e function words and directed edges stand in f or the likelihood of finding the sink word in the ordered vicinity of the source word. W ANs of different authors can be interpr eted as transition probabilities of a Marko v chain and are ther efore compar ed in terms of their relative entropies. Optimal selection of W AN parameters is studied and attribution accuracy is benchmarked across a div erse pool of authors and varying text lengths. This analysis shows that, since function words are independent of content, their use tends to be specific to an author and that the relational data captured by function W ANs is a good summary of stylometric fingerprints. Attribution accuracy is observed to exceed the one achieved by methods that rely on word frequencies alone. Further combining W ANs with methods that rely on word frequencies alone, results in larger attribution accuracy , indicating that both sour ces of information encode different aspects of authorial styles. I . I N T RO D U C T I O N The discipline of authorship attribution is concerned with matching a te xt of unknown or disputed authorship to one of a group of potential candidates. More generally , it can be seen as a way of quantifying literary style or uncov ering a stylometric fingerprint. The most traditional application of authorship attribution is literary research, but it has also been applied in forensics [2], defense intelligence [3] and plagiarism [4]. Both, the av ailability of electronic texts and advances in computational power and information processing, ha ve boosted accuracy and interest in computer based authorship attribution methods [5]–[7]. Authorship attribution dates at least to more than a century ago with a work that proposed distinguishing authors by looking at word lengths [8]. This was later improved by [9] where the average length of sentences was considered as a determinant. A seminal dev elopment was the introduction of the analysis of function words to characterize authors’ styles [10] which inspired the de velopment of se veral methods. Function words are words like prepositions, conjunctions, and pronouns which on their own carry little meaning but dictate the grammatical relationships between words. The advantage of function words is that they are content independent and, thus, can carry information about the author that is not biased by the topic of the text being analyzed. Since [10], function Supported by NSF CAREER CCF-0952867 and NSF CCF-1217963. The authors are with the Department of Electrical and Systems Engineering, Univ ersity of Pennsylvania, 200 South 33rd Street, Philadelphia, P A 19104. Email: { ssegarra, maeisen, aribeiro } @seas.upenn.edu. Part of the results in this paper appeared in [1]. words appeared in a number of papers where the analysis of the frequency with which dif ferent words appear in a te xt plays a central role one way or another; see e.g., [11]–[14]. Other attribution methods include the stylometric techniques in [15], the use of vocab ulary richness as a stylometric marker [16]– [18] – see also [19] for a critique –, the use of stable words defined as those that can be replaced by an equiv alent [20], and syntactical markers such as taggers of parts of speech [21]. In this paper , we use function words to build stylometric fingerprints but, instead of focusing on their frequency of usage, we consider their relational structure. W e encode these structures as word adjacency networks (W ANs) which are asymmetric networks that store information of co-appearance of two function words in the same sentence (Section III). W ith proper normalization, edges of these networks describe the likelihood that a particular function word is encountered in the text giv en that we encountered another one. In turn, this implies that W ANs can be reinterpreted as Markov chains de- scribing transition probabilities between function words. Giv en this interpretation it is natural to measure the dissimilarity between dif ferent texts in terms of the relati ve entropy between the associated Markov chains (Section III-A). Markov chains hav e also been used as a tool for authorship attribution in [22], [23]. Howe ver , the chains in these works represent transitions between letters, not words. Although there is little intuiti ve reasoning behind the notion that an author’ s style can be modeled by his usage of individual letters, these approaches generate some what positive results. The classification accurac y of W ANs depends on v arious parameters regarding the generation of the W ANs as well as the selection of words chosen as network nodes. W e consider the optimal selection of these parameters and develop an adaptiv e strategy to pick the best network node set given the texts to attribute (Section IV). Using a corpus composed of texts by 21 authors from the 19th century , we illustrate the implementation of our method and analyze the changes in accuracy when modifying the number of candidate authors as well as the length of the text of known (Section V -A) and unknown (Section V -B) authorship. Further , we analyze how the similarity of styles between two authors influences the accuracy when distinguishing their te xts (Section V -C). W e then incorporate authors from the early 17th century to the corpus and analyze ho w differences in time period, genre, and gender influence the classification rate of W ANs (Sections VI-A to VI-C). W e also show that W ANs can be used to detect collaboration between sev eral authors (Section VI-D). W e further demonstrate that our classifier performs better than techniques based on function word frequencies 2 alone (Section VII). Perhaps more important, we show that the stylometric information captured by W ANs is not the same as the information captured by word frequencies. Consequently , their combination results in a further increase in classification accuracy . I I . P RO B L E M F O R M U L A T I O N W e are given a set of n authors A = { a 1 , a 2 , ..., a n } , a set of m known texts T = { t 1 , t 2 , ..., t m } and a set of k unknown texts U = { u 1 , u 2 , ..., u k } . W e are also gi ven an authorship attribution function r T : T → A mapping every known text in T to its corresponding author in A , i.e. r T ( t ) ∈ A is the author of text t for all t ∈ T . W e further assume r T to be surjectiv e, this implies that for e very author a i ∈ A there is at least one text t j ∈ T with r T ( t j ) = a i . Denote as T ( i ) ⊂ T the subset of kno wn texts written by author a i , i.e. T ( i ) = { t | t ∈ T , r T ( t ) = a i } . (1) According to the abov e discussion, it must be that | T ( i ) | > 0 for all i and { T ( i ) } n i =1 must be a partition of T . In Section III, we use the texts contained in T ( i ) to generate a relational profile for author a i . There exists an unknown attribution function r U : U → A which assigns each text u ∈ U to its actual author r U ( u ) ∈ A . Our objectiv e is to approximate this unkno wn function with an estimator ˆ r U built with the information provided by the attribution function r T . In partic- ular , we construct word adjacency networks (W ANs) for the known texts t ∈ T and unknown texts u ∈ U . W e attribute texts by comparing the W ANs of the unknown texts u ∈ U to the W ANs of the known texts t ∈ T . In constructing W ANs, the concepts of sentence, proximity , and function w ords are important. Every text consists of a sequence of sentences, where a sentence is defined as an index ed sequence of words between two stopper symbols. W e think of these symbols as grammatical sentence delimiters, but this is not required. For a giv en sentence, we define a directed proximity between two words parametric on a discount factor α ∈ (0 , 1) and a windo w length D . If we denote as i ( ω ) the position of word ω within its sentence the directed proximity d ( ω 1 , ω 2 ) from word ω 1 to word ω 2 when 0 < i ( ω 2 ) − i ( ω 1 ) ≤ D is defined as d ( ω 1 , ω 2 ) := α i ( ω 2 ) − i ( ω 1 ) − 1 . (2) In ev ery sentence there are two kind of words: function and non-function words [24]. While in (2) the words w 1 and w 2 need not be function words, in this paper we are interested only in the case in which both w 1 and w 2 are function words. Function words are words that express primarily a grammatical relationship. These words include conjunctions (e.g., and, or ), prepositions (e.g., in, at ), quantifiers (e.g., some, all ), modals (e.g., may , could ), and determiners (e.g., the, that ). W e exclude gender specific pronouns ( he, she ) as well as pronouns that depend on narration type ( I, you ) from the set of function words to avoid biased similarity between texts written using the same grammatical person – see Section III for details. The 30 function words that appear most often in our experiments are listed in T able I. The concepts of sentence, proximity , and function words are illustrated in the following example. Common Function W ords the and a of to in that with as it for b ut at on this all by which they so from no or one what if an would when will T ABLE I: Most common function words in analyzed texts. Example 1 Define the set of stopper symbols as { . ; } , let the parameter α = 0 . 8 , the window D = 4 , and consider the text “ A swarm in May is worth a load of hay; a swarm in June is worth a silver spoon; but a swarm in July is not worth a fly . ” The text is composed of three sentences separated by the delimiter { ; } . W e then divide the text into its three constituent sentences and highlight the function words a swarm in May is worth a load of hay a swarm in June is worth a silver spoon but a swarm in July is not worth a fly The directed proximity from the first a to swarm in the first sentence is α 0 = 1 and the directed proximity from the first a to in is α 1 = 0 . 8 . The directed proximity to worth or load is 0 because the indices of these words differ in more than D = 4 . Define the classification accurac y as the fraction of unknown texts that are correctly attributed. W ith I denoting the indicator function we can write the classification accuracy ρ as ρ ( ˆ r U ) = 1 k X u ∈ U I { ˆ r U ( u ) = r U ( u ) } . (3) W e use ρ ( ˆ r U ) to gauge performance of the classifier in Sections IV to VII. I I I . F U N C T I O N W O R D S A D JA C E N C Y N E T W O R K S As relational structures we construct W ANs for each text. These weighted and directed networks contain function words as nodes. The weight of a given edge represents the likelihood of finding the words connected by this edge close to each other in the text. Formally , from a gi ven te xt t we construct the network W t = ( F , Q t ) where F = { f 1 , f 2 , ..., f f } is the set of nodes composed by a collection of function words common to all W ANs being compared and Q t : F × F → R + is a similarity measure between pairs of nodes. Methods to select the elements of the node set F are discussed in Section IV. In order to calculate the similarity function Q t , we first divide the text t into sentences s h t where h ranges from 1 to the total number of sentences. W e denote by s h t ( e ) the word in the e -th position within sentence h of text t . In this way , we define Q t ( f i , f j ) = X h,e I { s h t ( e ) = f i } D X d =1 α d − 1 I { s h t ( e + d ) = f j } , (4) for all f i , f j ∈ F , where α ∈ (0 , 1) is the discount factor that decreases the assigned weight as the words are found further apart from each other and D is the windo w limit to consider that two words are related. The similarity measure in (4) is the sum of the directed proximities from f i to f j defined in 3 (2) for all appearances of f i when the words are found at most D positions apart in the same sentence. Since in general Q t ( f i , f j ) 6 = Q t ( f j , f i ) , the W ANs generated are directed. Example 2 Consider the same text and parameters of Exam- ple 1. There are four function words yielding the set F = { a, in, of, b ut } . The matrix representation of the similarity function Q t is Q t =     a in of but a 0 3 × 0 . 8 1 0 . 8 1 0 in 2 × 0 . 8 3 0 0 0 of 0 0 0 0 but 1 0 . 8 2 0 0     . (5) The total similarity value from a to in is obtained by summing up the three 0 . 8 1 proximity values that appear in each sen- tence. Although the word a appears twice in every sentence, Q ( a , a ) = 0 because its appearances are more than D = 4 words apart. Using text W ANs, we generate a network W c for ev ery author a c ∈ A as W c = ( F , Q c ) where Q c = X t ∈ T ( c ) Q t . (6) Similarities in Q c depend on the amount and length of the texts written by author a c . This is undesirable since we want to be able to compare relational structures among different authors. Hence, we normalize the similarity measures as ˆ Q c ( f i , f j ) = Q c ( f i , f j ) P j Q c ( f i , f j ) , (7) for all f i , f j ∈ F . In this way , we achie ve normalized networks ˆ P c = ( F , ˆ Q c ) for each author a c . In (7) we assume that there is at least one positively weighted edge out of ev ery node f i so that we are not dividing by zero. If this is not the case for some function word f i , we fix ˆ Q c ( f i , f j ) = 1 / | F | for all f j . Example 3 By applying normalization (7) to the similarity function in Example 2, we obtain the following normalized similarity matrix ˆ Q t =     a in of but a 0 0 . 75 0 . 25 0 in 1 0 0 0 of 0 . 25 0 . 25 0 . 25 0 . 25 bu t 0 . 61 0 . 39 0 0     . (8) Similarity ˆ Q t no longer depends on the length of the text t but on the relative frequency of the co-appearances of function words in the text. Our claim is that e very author a c has an inherent relational structure P c that serves as an authorial fingerprint and can be used towards the solution of authorship attrib ution problems. ˆ P c = ( F, ˆ Q c ) estimates P c with the available known texts written by author a c . A. Network Similarity The normalized networks ˆ P c can be interpreted as discrete time Markov chains (MC) since the similarities out of ev ery node sum up to 1. Thus, the normalized similarity between words f i and f j is a measure of the probability of finding f j in the words following an encounter of f i . In a similar manner , we can b uild a MC P u for each unkno wn text u ∈ U . Since e very MC has the same state space F , we use the relativ e entropy H ( P 1 , P 2 ) as a dissimilarity measure between the chains P 1 and P 2 . The relati ve entropy is given by H ( P 1 , P 2 ) = X i,j π ( f i ) P 1 ( f i , f j ) log P 1 ( f i , f j ) P 2 ( f i , f j ) , (9) where π is the limiting distribution on P 1 and we consider 0 log 0 to be equal to 0 . The choice of H as a measure of dissimilarity is not arbitrary . In fact, if we denote as w 1 a realization of the MC P 1 , H ( P 1 , P 2 ) is proportional to the logarithm of the ratio between the probability that w 1 is a realization of P 1 and the probability that w 1 is a realization of P 2 . In particular, when H ( P 1 , P 2 ) is null, the ratio is 1 meaning that a gi ven realization of P 1 has the same probability of being observed in both MCs [25]. W e point out that relati ve entropy measures have also been used to compare vectors with function word frequencies [26]. This is unrelated to their use here as measures of the relational information captured in function W ANs. Using (9), we generate the attrib ution function ˆ r U ( u ) by assigning the text u to the author with the most similar relational structure ˆ r U ( u ) = a p , where p = argmin c H ( P u , ˆ P c ) . (10) Whenev er a transition between words appears in an unkno wn text b ut not in a profile, the relativ e entropy in (10) takes an infinite value for the corresponding author . In practice we compute the relativ e entropy in (9) by summing ov er the non- zero transitions in the profiles, H ( P 1 , P 2 ) = X i,j | P 2 ( f i ,f j ) 6 =0 π ( f i ) P 1 ( f i , f j ) log P 1 ( f i , f j ) P 2 ( f i , f j ) . (11) Observe that if there is a transition between words that appears often in the text P 1 but nev er in the profile P 2 , the expression in (11) skips the relativ e entropy summand. This is undesirable because the often appearance of this transition in the text network P 1 is a strong indication that this text was not written by the author whose profile network is P 2 . The expression in (9) would capture this difference by producing an infinite value for the relative entropy . Howe ver , this infinite value is still produced if a transition between words does not appear in the author profile P 2 and appears just once in the text P 1 . In this case, the null contrib ution to the relati ve entropy in (11) is more reasonable than the infinity contribution in (9) because the rarity of the transition in both texts is an indication that the text and the profile belong to the same author . Our experiments show that the latter situation is more common than the former . T ransitions rare enough so as not to appear in a profile are, for the most part, also infrequent in all texts. This 4 is reasonable because rare combinations of function words are properties of the language more than of individual authors. W e hav e also explored the use of Laplace smoothing to a void infinite entropies – see e.g., [27, Chapter 13], but (11) still achiev es best results in practice. W e proceed to specify the selection of function words in F as well as the choice of the parameters α and D after the following remark. Remark 1 For the relativ e entropies in (10) to be well de- fined, the MCs P u associated with the unknown texts have to be ergodic to ensure that the limiting distrib utions π in (9) and (11) are unique. This is true if the texts that generated P u are suf ficiently long. If this is not true for a particular network, we replace π ( f i ) with the expected fraction of time a randomly initialized walk spends in state f i . The random initial function word is drawn from a distribution gi ven by the word frequencies in the text. I V . S E L E C T I O N O F F U N C T I O N W O R D S A N D W A N P A R A M E T E R S The classification accuracy of the function W ANs intro- duced in Section III depends on the choice of sev eral variables and parameters: the set of sentence delimiters or stopper symbols, the window length D , the discount factor α , and the set of function words F defining the nodes of the adjacency networks. In this section, we study the selection of these parameters to maximize classification accuracy . The selections of stopper symbols and window lengths are not critical. As stoppers we include the grammatical sentence delimiters ‘. ’, ‘?’ and ‘!’, as well as semicolons ‘;’ to form the stopper set { . ? ! ; } . W e include semicolons since they are used primarily to connect two independent clauses [24]. In any ev ent, the inclusion or not of the semicolon as a stopper symbol entails a minor change in the generation of W ANs due to its infrequent use. As window length we pick D = 10 , i.e., we consider that two words are not related if they appear more than 10 positions apart from each other . Larger v alues of D lead to higher computational complexity without increase in accuracy since grammatical relations of words more than 10 positions apart are rare. In order to choose which function words to include when generating the W ANs we present two different approaches: a static methodology and an adapti ve strate gy . The static ap- proach consists in picking the function words most frequently used in the union of all the texts being considered in the attri- bution, i.e, all those that we use to b uild the profile and those being attributed. By using the most frequent function words we base the attribution on repeated grammatical structures and limit the influence of noise introduced by unusual sequences of words which are not consistent stylometric markers. In our experiments, we see that selecting a number of functions words between 40 and 70 yields optimal accuracy . For way of illustration, we consider in Fig. 1a the attribution of 1,000 texts of length 10,000 words among 7 authors chosen at random from our pool of 19th century authors [28] for a fixed value of α = 0 . 75 and profiles of 100,000 words – see also Section V for a description of the corpus. The solid line in this figure represents the accuracy achie ved when using a network composed of the n most common function words in the texts analyzed for n going from 2 to 100. Accuracy is maximal when we use exactly 50 function words, but the dif ferences are minimal and likely due to random variations for values of n between n = 42 and n = 66 . The flatness of the accuracy curve is con venient because it shows that the selection of n is not that critical. In this particular example we can choose any value between, say n = 45 and n = 60 , without affecting reliability . In a larger test where we also vary the length of the profiles, the length of the texts attributed, and the number of candidate authors, we find that including 60 function words is empirically optimal. The adaptiv e approach still uses the most common function words but adapts the number of function words used to the specific attribution problem. In order to choose the number of function w ords, we implement repeated leav e-one-out cross validation as follows. For every candidate author a i ∈ A , we concatenate all the known texts T ( i ) written by a i and then break up this collection into N pieces of equal length. W e build a profile for each author by randomly picking N − 1 pieces for each of them. W e then attribute the unused pieces between the authors utilizing W ANs of n function words for n varying in a giv en interv al [ n min , n max ] . W e perform M of these cross validation rounds in which we change the random selection of the N − 1 texts that build the profiles. The value of n that maximizes accuracy across these M trials is selected as the number of nodes for the W ANs. W e perform attributions using the corresponding n word W ANs for the profiles as well as for the texts to be attributed. In our numerical experiments we have found that using N = 10 , n min = 20 , n max = 80 , and M v arying between 10 and 100 depending on the a vailable computation time are sufficient to find values of n that yield good performance. The dashed line in Fig. 1a represents the accuracy obtained by implementing the adapti ve strategy with N = 10 , n min = 20 , n max = 80 , and M = 100 for the same attrib ution problem considered in the static method – i.e., attribution of 1,000 texts of length 10,000 words among 7 authors for α = 0 . 75 and profiles of 100,000 words. The accuracy is very similar to the best correct classification rate achieved by the static method. This is not just true of this particular example but also true in general. The static approach is faster because it requires no online training to select the number of words n to use in the W ANs. The adaptiv e strategy is suitable for a wider range of problems because it contains less assumptions than the static method about the best structure to dif ferentiate between the candidate authors. E.g., when shorter texts are analyzed, experiments sho w that the optimal static method uses slightly less than 60 words. Like wise, the optimal choice of the number of words in the W ANs changes slightly with the time period of the authors, the specific authors considered, and the choice of parameter α . These changes are captured by the adaptiv e approach. W e advocate adaptation in general and reserve the static method for rapid attribution of texts or cases when the number of texts av ailable to b uild profiles is too small for effecti ve cross-validation. 5 0 10 20 30 40 50 60 70 80 90 100 0.4 0.5 0.6 0.7 0.8 0.9 1 Nr. of function words Accuracy (a) Attribution accuracy as a function of the network size. .2 .3 .4 .5 .6 .7 .8 .9 1.0 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 Alpha Accuracy (b) Attribution accuracy as a function of the discount factor α . Fig. 1: Both figures present the accuracy for the attribution of 1,000 texts of length 10,000 words among 7 authors chosen at random with 100,000 words profiles. (a) The solid line represents the accuracy achiev ed for static networks of increasing size. The dashed line is the accuracy obtained by the adapti ve method. (b) Accuracy is maximized for values of the discount factor α in the range between 0.70 and 0.85. T o select the decay parameter we use the adaptiv e leav e- one-out cross validation method for dif ferent values of α and study the variation of the correct classification rate as α v aries. In Fig. 1b we show the v ariation of the correct classification rate with α when attributing 1,000 te xts of length 10,000 words between 7 authors of the 19th century picked at random from our text corpus [28] using profiles with 100,000 words – see also Section V for a description of the corpus. As in the case of the number of words used in the W ANs there is a wide range of v alues for which v ariations are minimal and likely due to randomness. This range lies approximately between α = 0 . 7 and α = 0 . 85 . In a larger test where we also vary text and profile lengths as well as the number of candidate authors we find that α = 0 . 75 is optimal. W e found no gains in an adapti ve method to choose α. V . A T T R I B U T I O N A C C U R AC Y Henceforth, we fix the W AN generation parameters to the optimal values found in Section IV, i.e., the set of sentence delimiters is { . ? ! ; } , the discount factor is α = 0 . 75 , and the windo w length is D = 10 . The set of function words F is picked adaptiv ely for ev ery attribution problem by performing M = 10 cross validation rounds. The text corpus used for the simulations consists of authors from two dif ferent periods [28]. The first group corresponds to 21 authors spanning the 19th century , both American – such as Nathaniel Hawthorne and Herman Melville – and British – such as Jane Austen and Charles Dickens. For these 21 authors, we have an average of 6.5 books per author with a minimum of 4 books for Charlotte Bronte and a maximum of 10 books for Herman Melville and Mark T wain. In terms of words, this translates into an average of 560,000 words av ailable per author with a minimum of 284,000 words for Louisa May Alcott and a maximum of 1,096,000 for Mark T wain. The second group of authors corresponds to 7 Early Modern English playwrights spanning the late 16th century and the early 17th century , namely W illiam Shakespeare, George Chapman, John Fletcher, Ben Jonson, Christopher Marlowe, Thomas Middleton, and George Peele. For these authors we have an av erage of 22 plays per author with a minimum of 4 plays for Peele and a maximum of 47 plays written either completely or partially by Fletcher . In terms of word length, we count with an average length of 400,000 words per author with a minimum of 50,000 for Peele and a maximum of 900,000 for Fletcher . T o illustrate authorship attribution with function W ANs, we solve an authorship attribution problem with two candidate authors: Mark T wain and Herman Melville. F or each candidate author we are giv en fi ve kno wn texts and are asked to attribute ten unknown texts, five of which were written by T wain while the other five belong to Melville [28]. Every text in this attribution belongs to a different book and corresponds to a 10,000 word extract, i.e. around 25 pages of a paper back midsize edition. The fiv e known texts from each author are used to generate corresponding profiles as described in Section III. Relati ve entropies in (11) from each of the ten unknown texts to each of the two resulting profiles are then computed. Since relative entropies are not metrics, we use multidimen- sional scaling (MDS) [29] to embed the two profiles and the ten unknown texts in 2-dimensional Euclidean metric space with minimum distortion. The result is illustrated in Fig. 2a. T wain’ s and Melville’ s profiles are depicted as red and blue filled circles, respectively . Unknown texts are depicted as empty circles, where the color indicates the real author, i.e. red for T wain and blue for Melville. A solid black line composed of points equidistant to both profiles is also plotted. This line delimits the two half planes that result in attribution to one author or the other . From Fig. 2a, we see that the attribution is perfect for these two authors. All red (T wain) empty circles fall in the half plane closer to the filled red circle and all blue (Melville) empty circles fall in the half plane closer to the filled blue circle. W e emphasize that the W AN attributions are not based on these Euclidean distances but on the non- metric dissimilarities giv en by the relati ve entropies. Since 6 (a) MDS representation for two authors. (b) MDS representation for three authors. Fig. 2: (a) Perfect accuracy is attained for two candidate authors. Every empty circle falls in the half plane corresponding to the filled circle of their color . (b) One mistake is made for three authors. One green empty circle falls in the region attributable to the blue author . the number of points is small, the MDS distortion is minor and the distances in Fig. 2a are close to the relative entropies. The latter separate the points better , i.e., relativ e entropies are smaller for texts of the same author and larger for texts of different authors. W e also illustrate an attribution between three authors by creating a profile for Jane Austen using fiv e 10,000 word excerpts and adding fiv e 10,000 word excerpts of texts written by Jane Austen to the ten excerpts to attribute from T wain and Melville’ s books. W e then perform an attribution of the 15 texts to the three profiles constructed. An MDS approximate representation of the relative entropies between texts and profiles is sho wn in Fig. 2b where filled circles represent profiles and empty circles represent texts. Dif ferent colors are used to distinguish T wain (red), Melville (blue), and Austen (green). W e also plot the V oronoi tessellation induced by the three profiles, which specify the regions of the plane that are attributable to each author . Different from the case in Fig. 2a, attrib ution is not perfect since one of Austen’ s texts is mistakenly attributed to Melville. This is represented in Fig 2b by the green empty circle that appears in the section of the V oronoi tessellation that corresponds to the blue profile. Besides the number of authors, the other principal de- terminants of classification accuracy are the length of the profiles, the length of the texts of unknown authorship, and the similarity of writing styles as captured by the relativ e entropy dissimilarities between profiles. W e study these effects in sections V -A,V -B, and V -C, respectiv ely . A. V arying Pr ofile Length The profile length is defined as the total number of words, function or otherwise, used to construct the profile. T o study the effect of varying profile lengths we fix α = 0 . 75 , D = 10 , and vary the length of author profiles from 10,000 to 100,000 words in increments of 10,000 words. For each profile length, we attribute texts containing 25,000, 5,000 and 1,000 words, and for each given combination of profile and text length, we consider problems ranging from binary attribution to attribution between ten authors. T o b uild profiles, we use ten texts of the same length randomly chosen among all the texts written by a gi ven author . The length of each excerpt is such that the ten pieces add up to the desired profile length. E.g., to build a profile of length 50,000 words for Melville, we randomly pick ten excerpts of 5,000 words each among all the texts written by him. For the texts to be attributed, ho wev er , we always select contiguous extracts of the desired length. E.g., for texts of length 25,000 words, we randomly pick excerpts of this length written by some author – as opposed to the selection of ten pieces of dif ferent origin we do for the profiles. This resembles the usual situation where the profiles are built from several sources but the texts to attribute correspond to a single literary creation. For a giv en profile size and number of authors, several attribution experiments were ran by randomly choosing the set of authors among those from the 19th century [28] and randomly choosing the texts forming the profiles. The amount of attribution experiments was chosen large enough to ensure that every accuracy value in tables II - IV is based on the attribution of at least 600 texts. The accuracy results of attributing a text of 25,000 words are stated in T able II. This word length is equiv alent to around 60 pages of a midsize paperback novel – i.e., a nov ella, or a few book chapters – or the typical length of a Shakespeare play . In the last column of the table we inform the expected accuracy of random attribution between the candidate authors. The difference between the accuracies of the last column and the rest of the table indicates that W ANs do carry stylometric information useful for authorship attribution. Overall, attri- bution of texts with 25,000 words can be done with high accuracy even when attributing among a large number of authors if reasonably large corpora are av ailable to b uild author 7 Nr . of authors Number of words in profile (thousands) Rand. 10 20 30 40 50 60 70 80 90 100 2 0.927 0.964 0.984 0.985 0.981 0.979 0.981 0.986 0.992 0.988 0.500 3 0.871 0.934 0.949 0.962 0.968 0.975 0.982 0.978 0.974 0.978 0.333 4 0.833 0.905 0.931 0.949 0.948 0.964 0.963 0.968 0.969 0.977 0.250 5 0.800 0.887 0.923 0.950 0.945 0.951 0.953 0.961 0.961 0.969 0.200 6 0.760 0.880 0.929 0.932 0.937 0.941 0.948 0.952 0.950 0.973 0.167 7 0.755 0.851 0.909 0.924 0.937 0.943 0.937 0.957 0.960 0.957 0.143 8 0.722 0.841 0.898 0.911 0.932 0.941 0.938 0.947 0.952 0.955 0.125 9 0.683 0.855 0.882 0.905 0.915 0.931 0.932 0.944 0.952 0.955 0.111 10 0.701 0.827 0.882 0.910 0.923 0.923 0.934 0.935 0.943 0.935 0.100 T ABLE II: Profile length vs. accuracy for different number of authors (text length = 25,000) Nr . of authors Number of words in profile (thousands) Rand. 10 20 30 40 50 60 70 80 90 100 2 0.863 0.930 0.932 0.945 0.928 0.952 0.942 0.907 0.942 0.967 0.500 3 0.821 0.884 0.886 0.890 0.910 0.901 0.943 0.912 0.911 0.914 0.333 4 0.728 0.833 0.849 0.862 0.892 0.867 0.888 0.905 0.882 0.885 0.250 5 0.698 0.819 0.825 0.839 0.862 0.884 0.859 0.865 0.882 0.893 0.200 6 0.673 0.754 0.789 0.798 0.832 0.837 0.863 0.870 0.896 0.878 0.167 7 0.616 0.754 0.806 0.838 0.812 0.848 0.859 0.832 0.873 0.868 0.143 8 0.600 0.720 0.748 0.820 0.805 0.831 0.831 0.854 0.857 0.850 0.125 9 0.587 0.718 0.767 0.781 0.796 0.809 0.833 0.850 0.843 0.847 0.111 10 0.556 0.693 0.737 0.753 0.805 0.827 0.829 0.824 0.843 0.846 0.100 T ABLE III: Profile length vs. accuracy for different number of authors (text length = 5,000) Nr . of authors Number of words in profile (thousands) Rand. 10 20 30 40 50 60 70 80 90 100 2 0.738 0.788 0.747 0.823 0.803 0.803 0.802 0.800 0.812 0.793 0.500 3 0.599 0.698 0.690 0.737 0.713 0.744 0.724 0.726 0.757 0.698 0.333 4 0.528 0.638 0.640 0.672 0.658 0.663 0.656 0.663 0.651 0.707 0.250 5 0.491 0.561 0.598 0.627 0.686 0.621 0.633 0.661 0.674 0.632 0.200 6 0.469 0.549 0.578 0.593 0.626 0.594 0.598 0.617 0.606 0.582 0.167 7 0.420 0.469 0.539 0.551 0.583 0.564 0.603 0.593 0.583 0.598 0.143 8 0.392 0.454 0.544 0.540 0.572 0.551 0.583 0.589 0.563 0.599 0.125 9 0.385 0.449 0.489 0.528 0.519 0.556 0.551 0.580 0.560 0.576 0.111 10 0.353 0.410 0.466 0.480 0.506 0.536 0.529 0.542 0.556 0.553 0.100 T ABLE IV: Profile length vs. accuracy for different number of authors (text length = 1,000) profiles with 60,000 to 100,000 words. E.g., for a profile containing 40,000 words, our method achiev es an accuracy of 0.985 for binary attributions whereas the corresponding random accuracy is 0.5. As e xpected, accuracy decreases when the number of candidate authors increases. E.g., for profiles of 80,000 words, an accuracy of 0.986 is obtained for binary attributions whereas an accuracy of 0.935 is obtained when the pool of candidates contains ten authors. Observe that accuracy does not monotonically decrease when increasing the candidate authors due to the noise introduced by the random selection of authors and texts. Accuracy increases with longer profiles. E.g., when per- forming attributions of 25,000 word texts among 6 authors, the accuracy obtained for profiles of length 10,000 is 0.760 whereas the accuracy obtained for profiles of length 60,000 is 0.941. There is a saturation ef fect concerning the length of the profile that depends on the number of authors being considered. For binary attributions there is no major increase in accuracy beyond profiles of length 30,000. Ho wever , when the number of candidate authors is 7, accuracy stabilizes for profiles of length in the order of 80,000 words. There seems to be little benefit in using profiles containing more than 100,000 words, which corresponds to a short novel of about 250 pages. Correct attribution rates of shorter excerpts containing 5,000 words are shown in T able III for the same profile lengths and number of candidate authors considered in T able II. A text of this length corresponds to about 13 pages of a nov el – something in the order of the chapter of a book – or an act in a Shakespeare play . When considering these shorter texts, acceptable classification accuracy is achieved except for very short profiles and large number of authors, while reliable attribution requires a small number of candidate authors or a large profile. E.g., attribution between three authors with profiles of 70,000 words has an av erage accuracy of 0.943. While smaller than the corresponding correct attrib ution rate of 0.982 for texts of length 25,000 words, this is still a respectable number . T o achiev e an accuracy in excess of 0.9 for the case of three authors we need a profile of at least 50,000 words. For v ery short texts of 1,000 words, which is about the length of an opinion piece in a newspaper , a couple pages in a novel, or a scene in a Shak espeare play , we can provide indications of authorship but cannot make definitiv e claims. As shown in T able IV, the best accuracies are for binary attributions that hov er at around 0.8 when we use profiles longer than 40,000 words. For attributions between more than 2 authors, maximum correct attrib ution rates are achie ved for profiles containing 90,000 or 100,000 words and range from 0.757 for the case of three authors to 0.556 when considering 8 ten authors. These rates are markedly better than random attribution but not sufficient for definitive statements. The results can be of use as circumstantial evidence in support of attribution claims substantiated by further proof. B. V arying T ext Length In this section we analyze the effect of text length in attribution accuracy for varying profile lengths and number of candidate authors. Using α = 0 . 75 and D = 10 , we consider profiles of length 100,000, 20,000 and 5,000 words and vary the number of candidate authors from two to ten. The text lengths considered are 1,000 words to 6,000 words in 1,000 word increments, 8,000 words, and 10,000 to 30,000 words in 5,000 word increments. The fine resolution for short texts permits estimating the shortest te xts that can be attributed accurately . As in Section V -A, for ev ery combination of number of authors and text length, enough independent attribution experiments were performed to ensure that every accuracy v alue in tables V - VII is based on at least 600 attributions. For profiles of length 100,000 w ords, the results are reported in T able V. As done in tables II-IV, we state the expected accuracy of random attribution in the last column of the table. Accuracies reported tow ards the right end of the table, i.e. 20,000-30,000 words, correspond to the attrib ution of a dramatic play or around 60 pages of a novel, which we will refer to as long texts. Accuracies for columns in the middle of the table, i.e. 5,000-8,000 words, correspond to an act in a dramatic play or between 12 and 20 pages of a novel, which we will refer to as medium texts. The left columns of this table, i.e. 1,000-3,000 words, correspond to a scene in a play , 2 to 7 pages in a novel or an article in a newspaper , which we will refer to as short texts. For the attribution of long texts, we achiev e a mean accuracy of 0.988 for binary attributions which decreases to an average accuracy of 0.945 when the number of candidate authors is increased to ten. For medium texts, the decrease in accurac y is not very significant for binary attributions, with a mean accuracy of 0.955, but the accuracy is reduced to 0.856 for attributions among ten authors. The accuracy is decreased further when attributing short texts, with a mean accuracy of 0.894 for binary attributions and 0.700 for the case with ten candidates. This indicates that when profiles of around 100,000 are available, W ANS achie ve accuracies ov er 0.95 for medium to long texts. For short texts, acceptable classification rates are achiev ed if the number of candidate authors is between two and four . If we reduce the length of the profiles to 20,000 words, reasonable accuracies are attained for small pools of candidate authors; see T able VI. E.g, for binary attributions, the range of correct classification varies between 0.812 for texts of 1,000 words to 0.969 for texts with 30,000 words. The first of these numbers means that we can correctly attribute a newspaper opinion piece with accuracy 0.812 if we are given corpora of 20 opinion pieces by the candidate authors. The second of these numbers means that we can correctly attribute a play be- tween two authors with accuracy 0.969 if we are giv en corpora of 20,000 words by the candidate authors. Further reducing the Fig. 3: Binary attribution accuracy as a function of the inter- profile dissimilarity . Higher accuracy is attained for attribution between authors which are more dissimilar . profile length to 5,000 words results in classification accuracies that are acceptable only when we consider binary attrib utions and texts of at least 10,000 words; see T able VII. For shorter texts or larger number of candidate authors, W ANs can provide supporting e vidence but not definiti ve proof. C. Inter-Pr ofile Dissimilarities Besides the number of candidate authors and the length of the texts and profiles, the correct attribution of a text is also dependent on the similarity of the writing styles of the authors themselves. Indeed, repeated binary attributions between Henry James and W ashington Irving with random generation of 100,000 word profiles yield a perfect accuracy of 1.0 on the classification of 400 texts of 10,000 words each. The same exercise when attributing between Grant Allen and Robert Louis Stev enson yields a classification rate of 0.91. This occurs because the stylometric fingerprints of Allen and Stev enson are harder to distinguish than those of James and Irving. Dissimilarity of writing styles can be quantified by com- puting the relative entropies between the profiles [cf. (11)]. Since relative entropies are asymmetric, i.e., H ( P 1 , P 2 ) 6 = H ( P 2 , P 1 ) in general, we consider the average of the two relativ e entropies between two profiles as a measure of their dissimilarity . For each pair of authors, the relative entropy is computed based on the set of function words chosen adapti vely to maximize the cross validation accuracy . For the 100,000 word profiles of James and Irving, the inter-profile dissimilar- ity resulting from the av erage of relativ e entropies is 0.184. The inter-profile dissimilarity between Allen and Stevenson is 0.099. This provides a formal measure of similarity of writing styles which explains the higher accurac y of attrib utions between James and Irving with respect to attributions between Allen and Ste venson. The correlation between inter-profile dissimilarities and attribution accuracy is corroborated by Fig. 3. Each point in this plot corresponds to the selection of two authors at random from our pool of 21 authors from the 19th century . For each pair we select ten texts of 10,000 words each to generate profiles of length 100,000 words. W e then attribute ten of the 9 Nr . of authors Number of words in texts (thousands) Rand. 1 2 3 4 5 6 8 10 15 20 25 30 2 0.840 0.917 0.925 0.938 0.940 0.967 0.958 0.977 0.967 0.989 0.988 0.986 0.500 3 0.789 0.873 0.890 0.919 0.913 0.932 0.936 0.956 0.952 0.979 0.979 0.975 0.333 4 0.736 0.842 0.870 0.902 0.906 0.933 0.937 0.952 0.965 0.970 0.973 0.974 0.250 5 0.711 0.797 0.858 0.874 0.891 0.906 0.924 0.925 0.955 0.971 0.980 0.964 0.200 6 0.690 0.796 0.828 0.886 0.884 0.911 0.919 0.922 0.944 0.957 0.969 0.961 0.167 7 0.633 0.730 0.814 0.855 0.874 0.890 0.910 0.911 0.928 0.947 0.956 0.951 0.143 8 0.602 0.740 0.811 0.846 0.882 0.887 0.915 0.910 0.930 0.944 0.957 0.963 0.125 9 0.607 0.721 0.774 0.826 0.845 0.870 0.889 0.890 0.918 0.948 0.951 0.953 0.111 10 0.578 0.731 0.792 0.816 0.842 0.855 0.872 0.893 0.921 0.933 0.942 0.961 0.100 T ABLE V: T ext length vs. accuracy for different number of authors (profile length = 100,000) Nr . of authors Number of words in texts (thousands) Rand. 1 2 3 4 5 6 8 10 15 20 25 30 2 0.812 0.850 0.903 0.912 0.913 0.912 0.938 0.945 0.918 0.964 0.964 0.969 0.500 3 0.760 0.797 0.858 0.899 0.887 0.918 0.920 0.918 0.919 0.938 0.929 0.928 0.333 4 0.670 0.747 0.813 0.852 0.868 0.887 0.889 0.906 0.918 0.915 0.900 0.913 0.250 5 0.621 0.721 0.749 0.813 0.823 0.819 0.859 0.878 0.876 0.887 0.889 0.893 0.200 6 0.557 0.681 0.754 0.782 0.799 0.831 0.852 0.866 0.871 0.879 0.881 0.872 0.167 7 0.493 0.610 0.674 0.706 0.731 0.770 0.798 0.807 0.828 0.862 0.867 0.858 0.143 8 0.467 0.623 0.675 0.721 0.741 0.769 0.790 0.826 0.822 0.857 0.841 0.857 0.125 9 0.474 0.574 0.656 0.672 0.710 0.734 0.781 0.783 0.813 0.845 0.837 0.841 0.111 10 0.433 0.535 0.612 0.663 0.684 0.706 0.752 0.772 0.836 0.840 0.851 0.848 0.100 T ABLE VI: T ext length vs. accuracy for different number of authors (profile length = 20,000) Nr . of authors Number of words in texts (thousands) Rand. 1 2 3 4 5 6 8 10 15 20 25 30 2 0.672 0.740 0.747 0.707 0.803 0.823 0.788 0.848 0.820 0.802 0.827 0.832 0.500 3 0.547 0.623 0.626 0.653 0.744 0.669 0.712 0.757 0.736 0.764 0.734 0.733 0.333 4 0.452 0.487 0.528 0.597 0.652 0.623 0.623 0.662 0.682 0.661 0.632 0.694 0.250 5 0.403 0.510 0.535 0.538 0.505 0.573 0.618 0.592 0.681 0.606 0.638 0.570 0.200 6 0.372 0.457 0.480 0.485 0.529 0.518 0.545 0.577 0.605 0.631 0.599 0.601 0.167 7 0.349 0.382 0.460 0.469 0.475 0.504 0.522 0.539 0.528 0.568 0.588 0.562 0.143 8 0.302 0.390 0.453 0.440 0.473 0.510 0.465 0.517 0.541 0.530 0.534 0.549 0.125 9 0.296 0.347 0.370 0.427 0.477 0.439 0.485 0.492 0.506 0.530 0.557 0.532 0.111 10 0.254 0.337 0.373 0.405 0.413 0.427 0.455 0.487 0.480 0.460 0.443 0.463 0.100 T ABLE VII: T ext length vs. accuracy for different number of authors (profile length = 5,000) remaining excerpts of length 10,000 words of each of these two authors among the two profiles and record the correct attribution rate as well as the dissimilarity between the random profiles generated. The process is repeated twenty times for these two authors to produce the av erage dissimilarity and accuracy that yield the corresponding point in Fig. 3. E.g., consider two randomly chosen authors for which we ha ve 50 excerpts of 10,000 word av ailable. W e select ten random texts to form a profile and attribute 20 out of the remaining 80 excerpts – 10 for each author . After repeating this procedure twenty times we get the av erage accuracy of attributing 400 texts of length 10,000 words between the two authors. Besides the positiv e correlation between inter-profile dis- similarities and attribution accuracies, Fig. 3 shows that clas- sification is perfect for 11 out of 12 instances where the inter-profile dissimilarity exceeds 0.16. Errors are rare for profile dissimilarities between 0.10 and 0.16 since correct classifications average 0.984 and account for at least 0.96 of the attrib ution results in all but three outliers. For pairs of authors with dissimilarities smaller than 0.1 the av erage accuracy is 0.942. V I . M E T A A T T R I B U T I O N S T U D I E S W ANs can also be used to study problems other than attribution between authors. In this section we demonstrate that W ANs carry information about time periods, the genre of the composition, and the gender of the authors. W e also illustrate the use of W ANs in detecting collaborations. A. T ime W ANs carry information about the point in time in which a text w as written. If we build random profiles of 200,000 words for Shakespeare, Chapman, and Melville and compute the inter-profile dissimilarity as in Section V -C, we obtain a dissimilarity of 0.04 between Shakespeare and Chapman and of 0.17 between Shakespeare and Melville. Since inter-profile dissimilarity is a measure of difference in style, this values are reasonable given that Shakespeare and Chapman were contemporaries but Melville liv ed more than two centuries after them. T o further illustrate this point, in Fig. 4a we plot a two dimensional MDS representation of the dissimilarity between eight authors whose profiles were built with all their av ailable texts in our corpus [28]. F our of the profiles correspond to early 17th century authors – Shakespeare, Chapman, Jonson, 10 (a) MDS plot for authors of different time periods. (b) Heat map of inter-profile relativ e entropies. Fig. 4: (a) Authors from the early 17th century are depicted as blue stars while authors from the 19th century are depicted as red dots. Inter-profile dissimilarities are small within the groups and large between them. (b) High inter-profile relativ e entropies are illustrated with warmer colors. T wo groups of authors with small inter -profile relati ve entropies are apparent: the first se ven correspond to 17th century authors and the rest to 19th century authors. Marlowe Chapman Shakespeare (Com.) 11.6 7.7 Shakespeare (His.) 7.6 9.3 T ABLE VIII: Inter-profile dissimilarities (x100) between au- thors of dif ferent genres. and Fletcher – and are represented by blue stars while the other four – Doyle, Melville, Garland, and Allen – correspond to 19th century authors and are represented by red dots. Notice that authors tend to hav e a smaller distance with their contem- poraries and a larger distance with authors from other periods. This fact is also illustrated by the heat map of inter-profile relativ e entropies in Fig. 4b where bluish colors represent smaller entropies. Since heat maps allow the representation of asymmetric data, we directly plot the relativ e entropies instead of the symmetrized inter-profile dissimilarities. The first 7 rows and columns correspond to authors of the 17th century whereas the remaining 21 correspond to authors of the 19th century , where profiles were b uilt with all the av ailable texts. Notice that the blocks of blue color along the diagonal are in perfect correspondence with the time period of the authors, verifying that W ANs can be used to determine the time in which a text w as written. The average entropies among authors of the 17th century and among those of the 19th century are 0.096 and 0.098 respectiv ely , whereas the av erage entropies between authors of different epochs is 0.273. I.e., the relativ e entropy between authors of different epochs almost triples that of authors belonging to the same time period. B. Genr e Even though function words by themselv es do not carry content, W ANs constructed from a text contain, rather sur - prisingly , information about its genre. W e illustrate this fact in Fig. 5, where we present the relative entropy between 20 pieces of texts written by Shakespeare of length 20,000 words, where 10 of them are history plays – e.g., Richar d II, King John, Henry VIII – and 10 of them are comedies – e.g., The T empest, Measure for Measur e, The Mer chant of V enice . As in Fig. 4b, bluish colors in Fig. 5 represent smaller relati ve entropies. T wo blocks along the diagonal can be distinguish that coincide with the plays of the two dif ferent genres. Indeed, if we sequentially extract one text from the group and attribute it to a genre by computing the average relati ve entropies to the remaining histories and comedies, the 20 pieces are correctly attributed to their genre. More generally , inter-profile dissimilarities between authors that write in the same genre tend to be smaller than between authors that write in different genres. As an example, in T able VIII we compute the dissimilarity between two Shakespeare profiles – one built with comedies and the other with histories – and two contemporary authors: Marlo we and Chapman. All profiles contain 100,000 words formed by randomly picking 10 extracts of 10,000 words. Marlowe ne ver wrote a comedy and mainly focused on histories – Edwar d II, The Massacr e at P aris – and tragedies – The J ew of Malta, Dido –, while the majority of Chapman’ s plays are comedies – All F ools, May Day . Genre choice impacts the inter-profile dissimilarity since the comedy profile of Shakespeare is closer to Chapman than to Marlowe and vice versa for the history profile of Shake- speare. The inter-profile dissimilarity between Shakespeare profiles is 6.2, which is still smaller than any dissimilarity in T able VIII. This points to wards the conclusion that the identity of the author is the main determinant of the writing style but that the genre of the te xt being written also contributes to the word choice. In general, two texts of the same author but different genres are more similar than two texts of the same genre but different authors which, in turn, are more similar than two texts of dif ferent authors and genres. 11 Fig. 5: Heat map of relativ e entropies between 20 Shakespeare extracts. The first 10 texts correspond to history plays while the last 10 correspond to comedy plays. Relative entropies within texts of the same genre are smaller than across genres. Sh. Jon. Fle. Mid. Cha. Marl. 19.1 20.0 18.2 20.2 19.5 20.9 T ABLE IX: Relativ e entropies from T wo Noble Kinsmen to different profiles (x100). C. Gender W ord usage can be used for author profiling [30] and, in particular , to infer the gender of an author from the written text. T o illustrate this, we divide the 21 authors from the 19th century [28] into females – fiv e of them – and males. W e pick a gender at random and pick an excerpt of 10,000 words from any author of the selected gender . W e then b uild two 100,000 words profiles, one containing pieces of te xts written by male authors and the other by female authors. In order to av oid bias, we do not include an y te xt of the author from which the text to attribute was chosen in the gender profiles. W e then attribute the chosen text between the two gender profiles. After repeating this procedure 5,000 times, we obtain a mean accuracy of 0.63. Although this accuracy is lower than state- of-the-art gender profiling methods [31], the difference with random attribution, i.e. accuracy of 0.5, v alidates the fact that W ANs carry gender information about the authors. D. Collaborations W ANs can also be used for the attribution of texts written collaborativ ely between two or more authors. Since collabora- tion was a common practice for playwrights in the early 17th century , we consider the attribution of Early Modern English plays [28]. For a given play , we compute its relative entropy to six contemporary authors – Shakespeare, Jonson, Fletcher, Middleton, Chapman, and Marlo we – by generating 50 random profiles for each author of length 80,000 words and av eraging the 50 entropies to obtain one representati ve number . W e do not consider Peele in the analysis due to the short total length of av ailable texts. Sh. Jon. Fle. Mid. Cha. Marl. Sh. 19.1 19.2 17.9 19.0 19.1 19.3 Jon. 19.2 20.0 18.4 19.5 19.3 19.3 Fle. 17.9 18.4 18.2 18.4 18.2 18.1 Mid. 19.0 19.5 18.4 20.2 19.4 18.9 Cha. 19.1 19.3 18.2 19.4 19.5 19.4 Marl. 19.3 19.3 18.1 18.9 19.4 20.9 T ABLE X: Relativ e entropies from T wo Noble Kinsmen to hybrid profiles composed of two authors (x100). Sh. Jon. Fle. Mid. Cha. Marl. Sh. 17.6 16.8 17.3 16.7 17.1 18.2 Jon. 16.8 16.8 17.0 16.5 16.7 17.3 Fle. 17.3 17.0 18.7 17.6 17.4 17.9 Mid. 16.7 16.5 17.6 17.6 16.9 17.1 Cha. 17.1 16.7 17.4 16.9 17.5 17.8 Marl. 17.4 17.1 17.6 17.3 17.4 18.1 T ABLE XI: Relative entropies from Eastwar d Ho to hybrid profiles composed of two authors (x100). When two authors collaborate to write a play , the resulting word adjacenc y network is close to the profiles of both authors, ev en though these profiles are built with plays of their sole authorship. As an example, consider the play T wo Noble Kinsmen which is an accepted collaboration between Fletcher and Shakespeare [32]. In T able IX, we present the relati ve entropies between the play and the six analyzed authors. Notice that the two minimum entropies correspond to those who collaborated in writing it. Collaboration can be further confirmed by the construction of hybrid profiles, i.e. profiles built containing 40,000 words of two different authors. Each entry in T able X corresponds to the relativ e entropy from T wo Noble Kinsmen to a hybrid profile composed by the authors in the ro w and column of that entry . Notice that the diagonal of T able X corresponds to profiles of sole authors and, thus, coincides with T able IX. The smallest relati ve entropy in T able X is achieved by the hybrid profile composed by Fletcher and Shakespeare, which is consistent with the accepted attribution of the play . The attribution between hybrid profiles is not always ac- curate. For e xample, consider the play Eastwar d Ho which is a collaboration between three authors, two of which are Chapman and Jonson [32]. If we repeat the above procedure and compute the relative entropies between the play and the different pure profiles, we see that in fact the two smallest entropies are achiev ed for Jonson and Chapman; see the diagonal in T able XI. Ho wever , the smallest entropy in the whole table is achiev ed by the hybrid profile composed by Jonson and Middleton. The hybrid profile of Jonson and Chapman, the real authors, achiev es an entropy of 16.7, which is the second smallest among all profiles in T able XI. V I I . C O M PAR I S O N A N D C O M B I N A T I O N W I T H F R E Q U E N C Y B A S E D M E T H O D S Machine learning tools have been used to solve attribution problems by relying on the frequency of appearance of func- 12 Nr . of authors N. Bayes 1-NN 3-NN DT -gdi DT -ce SVM W AN V oting 2 2.6 3.5 5.2 12.2 12.2 2.7 1.6 0.9 4 6.0 9.2 12.4 25.3 25.5 6.8 4.6 3.3 6 8.1 11.7 15.2 31.9 32.2 7.9 5.3 3.8 8 9.6 15.4 19.2 36.4 37.2 11.1 6.7 5.2 10 10.8 16.7 21.4 42.1 42.1 11.5 8.3 6.0 T ABLE XII: Error rates in % achieved by different methods for profiles of 100,000 words and texts of 10,000 words. The W ANs achiev e the smallest error rate among the methods considered separately . V oting decreases the error even further by combining the relational data of the W ANs with the frequency data of other methods. tion words [33]. These methods consider the number of times an author uses dif ferent function words but, unlike W ANs, do not contemplate the order in which the function words appear . The most common techniques include nai ve Bayes [34, Chapter 8], nearest neighbors (NN) [34, Chapter 2], decision trees (DT) [34, Chapter 14], and support vector machines (SVM) [34, Chapter 7]. In T able XII we inform the percentage of errors obtained by different methods when attributing texts of 10,000 words among profiles of 100,000 words for a number of authors ranging from two to ten. For a giv en number of candidate authors, we randomly pick them from the pool of 19th century authors [28] and attribute ten excerpts of each of them using the different methods. W e then repeat the random choice of authors 100 times and av erage the error rate. For each of the methods based on function word frequencies, we pick the set of parameters and preprocessing that minimize the attribution error rate. E.g., for SVM the error is minimized when considering a polynomial kernel of degree 3 and normal- izing the frequencies by text length. For the nearest neighbors method we consider two strategies based on one (1-NN) and three (3-NN) nearest neighbors as gi ven by the l 2 metric in Euclidean space. Also, for decision trees we consider two types of split criteria: the Gini Div ersity Index (DT -gdi) and the cross-entropy (DT -ce) [35]. The W ANs achiev e a lower attribution error than frequency based methods; see T able XII. For binary attributions, naive Bayes and SVM achiev e error rates of 2.6% and 2.7% respec- tiv ely and, thus, outperform nearest neighbors and decision trees. Ho wever , W ANs outperform the aforementioned meth- ods by obtaining an error rate of 1.6%. This implies a reduction of 38% in the error rate. For 6 authors, W ANs achiev e an error rate of 5.3% that outperform SVMs achieving 7.9% entailing a 33% reduction. This trend is consistent across dif ferent number of candidate authors, with W ANs achieving an av erage error reduction of 29% compared with the best traditional machine learning method. More important than the fact that W ANs tend to outperform methods based on word frequencies, is the fact that they carry different stylometric information. Thus, we can combine both methodologies to further increase attribution accuracy . In the last column of T able XII we inform the error rate of majority voting between W ANs and the two best performing frequency based methods, namely , naive Bayes and SVMs. The error rates are consistently smaller than those achieved by W ANs and, hence, by the other frequency based methods as well. E.g., for attributions among four authors, voting achiev es an error of 3.3% compared to an error of 4.6% of W ANs. This corresponds to a 28% reduction in error . A veraging among attributions for dif ferent number of candidate authors, majority voting entails a reduction of 30% compared with W ANs. The combination of W ANs and function word frequencies halves the attribution error rate with respect to the current state of the art. V I I I . C O N C L U S I O N Relational data between function words was used as stylo- metric information to solv e authorship attrib ution problems. Normalized word adjacency networks (W ANs) were used as relational structures. W e interpreted these networks as Markov chains in order to facilitate their comparison using relati ve entropies. The accuracy of W ANs was analyzed for varying number of candidate authors, text lengths, profile lengths and different lev els of heterogeneity among the candidate authors, regarding genre, gender, and time period. The method works best when the corpora of known texts is of substantial length, when the texts being attributed are long, or when the number of candidate authors is small. If long profiles are av ailable – more than 60,000 words, corresponding to 150 pages of a midsize paperback book –, we demonstrated very high attribution accuracy for texts longer than a few typical novel chapters even when attributing between a large number of authors, high accuracy for texts as long as a play act or a novel chapter , and reasonable rates for short texts such as newspaper opinion pieces if the number of candidate authors is small. W ANs were also shown to classify accurately the time period when a text was written, to acceptably estimate the genre of a piece, and to have some predictive po wer on the gender of the author . The applicability of W ANs to identify multiple authors in collaborativ e works was also demonstrated. W ith regards to e xisting methods based on the frequency with which different function words appear in the text, we observed that W ANs exceed their classification accuracy . More importantly , we showed that W ANs and frequencies captured different stylometric aspects so that their combination is possible and ends up halving the error rate of existing methods. R E F E R E N C E S [1] S. Segarra, M. Eisen, and A. Ribeiro, “ Authorship attribution using function words adjacency networks, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2013 IEEE International Conference on , 2013, pp. 5563–5567. 13 [2] T . Grant, “Quantifying evidence in forensic authorship analysis, ” Inter- national Journal of Speech Language and the Law , vol. 14, no. 1, 2007. [3] A. Abbasi and H. Chen, “ Applying authorship analysis to extremist- group web forum messages, ” Intelligent Systems, IEEE , vol. 20, no. 5, pp. 67–75, Sept 2005. [4] S. Meyer zu Eissen, B. Stein, and M. Kulig, “Plagiarism detection without reference collections, ” in Advances in Data Analysis , ser . Studies in Classification, Data Analysis, and Knowledge Organization, R. Decker and H.-J. Lenz, Eds. Springer Berlin Heidelberg, 2007, pp. 359–366. [5] D. I. Holmes, “Authorship attribution, ” Computers and the Humanities , vol. 28, no. 2, pp. 87–106, 1994. [6] P . Juola, “ Authorship attribution, ” F oundations and T r ends in Informa- tion Retrieval , vol. 1, pp. 233–334, 2006. [7] E. Stamatatos, “ A survey of modern authorship attribution methods, ” Journal of the American Society for Information Science and T echnol- ogy , vol. 60, pp. 538–556, March 2009. [8] T . C. Mendenhall, “The characteristic curves of composition, ” Science , vol. 9, pp. 237–246, 1887. [9] G. U. Y ule, “On sentence-length as a statistical characteristic of style in prose: W ith application to two cases of disputed authorship, ” Biometrika , vol. 30, pp. 363–390, 1939. [10] F . Mosteller and D. W allace, “Inference and disputed authorship: The federalist, ” Addison-W esley , 1964. [11] J. F . Burrows, “an ocean where each kind...: Statistical analysis and some major determinants of literary style, ” Computers and the Humanities , vol. 23, pp. 309–321, 1989. [12] D. I. Holmes and R. S. Forsyth, “The federalist revisited: New directions in authorship attribution, ” Literary and Linguistic Computing , vol. 10, pp. 111–127, 1995. [13] D. L. Hoover , “Delta prime?” Literary and Linguistic Computing , vol. 19, no. 4, pp. 477–495, 2004. [14] H. van Halteren, R. H. Baayen, F . T weedie, M. Haverkort, and A. Neijt, “New machine learning methods demonstrate the existence of a human stylome, ” Journal of Quantitative Linguistics , v ol. 12, no. 1, pp. 65–77, 2005. [15] R. S. Forsyth and D. I. Holmes, “Feature-finding for test classification, ” Literary and Linguistic Computing , vol. 11, pp. 163–174, 1996. [16] G. U. Y ule, “The statistical study of literary vocabulary , ” CUP Ar chive , 1944. [17] D. I. Holmes, “V ocabulary richness and the prophetic voice, ” Literary and Linguistic Computing , vol. 6, pp. 259–268, 1991. [18] F . J. T weedie and R. H. Baayen., “How v ariable may a constant be? measures of lexical richness in perspective, ” Computers and the Humanities , vol. 32, pp. 323–352, 1998. [19] D. Hoov er, “ Another perspectiv e on vocabulary richness, ” Computers and the Humanities , vol. 37, pp. 151–178, 2003. [20] M. Koppel, N. Akiv a, and I. Dagan, “Feature instability as a criterion for selecting potential style markers, ” Journal of the American Society for Information Science and T echnology , vol. 57, pp. 1519–1525, September 2006. [21] D. Cutting, J. Kupiec, J. Pedersen, and P . Sib un, “ A practical part-of- speech tagger, ” Pr oceedings of the third confer ence on Applied Natural Language Pr ocessing , pp. 133–140, 1992. [22] D. V . Khmelev and F . T weedie, “Using marko v chains for identification of writers, ” Literary and linguistic computing , vol. 16, pp. 299–307, 2001. [23] O. V . K ukushkina, A. A. Polikarpov , and D. V . Khmele v , “Using literal and grammatical statistics for authorship attribution, ” Pr oblems of Information Tr ansmission , vol. 37, pp. 172–184, 2001. [24] R. Quirk, S. Greenbaum, G. Leech, and J. Svartvik, “ A comprehensiv e grammar of the english language, ” Longman , 1985. [25] G. K esidis and J. W alrand, “Relati ve entropy between markov transition rate matrices, ” IEEE T rans. Information Theory , vol. 39, pp. 1056–1057, May 1993. [26] Y . Zhao, J. Zobel, and P . V ines, “Using relative entropy for authorship attribution, ” in Information Retrieval T echnology , ser . Lecture Notes in Computer Science, H. Ng, M.-K. Leong, M.-Y . Kan, and D. Ji, Eds. Springer Berlin Heidelberg, 2006, vol. 4182, pp. 92–105. [27] C. D. Manning, P . Raghavan, and H. Sch ¨ utze, Introduction to Infor- mation Retrieval . New Y ork, NY , USA: Cambridge University Press, 2008. [28] S. Segarra, M. Eisen, and A. Ribeiro, “Compilation of texts used for the numerical experiments (journal materials), ” https:// fling.seas.upenn. edu/ ∼ maeisen/ wiki/ index.php?n=Main.T extAttribution2 , 2014. [29] M. A. A. Cox and T . F . Cox, “Multidimensional scaling, ” in Handbook of Data V isualization , ser . Springer Handbooks Comp.Statistics. Springer Berlin Heidelberg, 2008, pp. 315–347. [30] S. Argamon, M. Koppel, J. W . Pennebaker, and J. Schler , “ Automatically profiling the author of an anonymous text, ” Commun. ACM , vol. 52, no. 2, pp. 119–123, Feb . 2009. [31] M. K oppel, S. Argamon, and A. R. Shimoni, “ Automatically cate gorizing written texts by author gender, ” Literary and Linguistic Computing , vol. 17, no. 4, pp. 401–412, 2002. [32] E. A. B. Farmer and Z. Lesser, “Deep: Database of Early English Playbooks, ” http:// deep.sas.upenn.edu/ , 2007. [Online]. A vailable: http://deep.sas.upenn.edu/ [33] Y . Zhao and J. Zobel, “Ef fectiv e and scalable authorship attrib ution using function words, ” in Information Retrieval T echnology , ser . Lecture Notes in Computer Science, G. Lee, A. Y amada, H. Meng, and S. Myaeng, Eds. Springer Berlin Heidelberg, 2005, vol. 3689, pp. 174–189. [34] C. M. Bishop, P attern Recognition and Machine Learning (Information Science and Statistics) . Secaucus, NJ, USA: Springer-V erlag New Y ork, Inc., 2006. [35] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and r egr ession trees . CRC press, 1984.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment