Selecting Attributes for Sport Forecasting using Formal Concept Analysis

Selecting Attrib utes f or Sport F or ecasting using F ormal Concept Analysis Gonzalo A. Aranda-Corral 1 , Joaqu ´ ın Borrego-D ´ ıaz 2 and Juan Gal ´ an-P ´ aez 2 1 Department of Information T echnology , Univ ersidad de Huelv a, Spain gonzalo.aranda@dti.uhu.es 2 Department of Computer Science and Artiﬁcial Intelligence, Univ ersidad de Se villa, Spain jborrego@us.es, juangalan@us.es Abstract In order to address complex systems, apply pattern recongni- tion on their ev olution could play an ke y role to understand their dynamics. Global patterns are required to detect emer- gent concepts and trends, some of them with qualitative na- ture. Formal Concept Analysis (FCA) is a theory whose goal is to disco ver and to e xtract Kno wledge from qualitati ve data. It pro vides tools for reasoning with implication basis (and as- sociation rules). Implications and association rules are use- full to reasoning on previously selected attributes, providing a formal foundation for logical reasoning. In this paper we analyse how to apply FCA reasoning to increase conﬁdence in sports betting, by means of detecting temporal regularities from data. It is applied to build a Knowledge Based system for conﬁdence reasoning. Introduction Formal Concept Analysis (FCA) (Ganter & W ille 1999) is a mathematical theory for data analysis using formal contexts and concept lattices as key tools. Domains can be formally modelled according to the extent and the intent of each for- mal concept. In FCA, the basic data structure is a formal context (with a qualitati ve nature) which represents a set of objects and their properties and it is useful both to de- tect and to describe regularities and structures of concepts. It also provides a sound formalism for reasoning with such structures, mainly Stem Basis and association rules. There- fore, it is interesting to consider its application for reasoning with temporal qualitativ e data in order to discover temporal trends (Aranda-Corral et al. 2011). In this paper , FCA application scope is the chal- lenge of sports betting, speciﬁcally , the forecasting of soccer league’ s results. Forecasting sport results is a fast growing research area, because of its economic impact in betting markets as well as for its poten- tial application to problems with similar behaviour (mar - kets) (Inst. Engineering and T echnology 2010). Consider- ing sports betting as a complex system, soccer leagues rep- resent a challenging system with a huge amount of knowl- edge, av ailable through WWW , and its behaviour is weekly exhausti v e analysed by journalists, betting companies and supporters. Roughly speaking, three dimensions have been considered for analysing/synthesizing prediction systems: 1)Those which analyse information on teams (endogenous) versus those which analyse results (exogenous); 2)Those which exploit quantitati ve data versus those which exploit qualitativ e knowledge, and ﬁnally , 3)Statistic-based ones versus other methods. Usually , one can work with hybrid models, and rarely with pure qualitative and e xogenous rea- soning systems appear in literature, although their use is considered for experiments (for example, frugal methods (Goldstein & Gigerenzer 2009) and based on the recogni- tion heuristic (Goldstein & Gigerenzer 2002)) or as part of hybrid systems (see e.g. (Min et al. 2008)). There are two reasons that may justify this point. On the one hand, transformation from a lar ge quantitati ve dataset to a qualitati ve problem is faced with the selection of an acceptable threshold and the discovery of better rela- tions (see e.g. (Imberman et al. 1999)). On the other hand, a qualitativ e dataset must be accomplished with some amount of information based on conﬁdence, trust or probability of these data sets. The aim of this paper is to describe all researching work made for selecting and computing attribute sets related to soccer results, into a speciﬁc frame work: FCA, and starting from soccer match results, with no previous analysis of any other speciﬁc attributes. This task is pre vious to build an Ex- pert System for advising sport betting which could detects some kind of regularities on data. Concept lattices, which are computed from attribute values, represent a mathemat- ical structure of relationships among the concepts which are inv olved in selected sport events to study . Since this method is bet-oriented, its performance is ev aluated within a conﬁdence-based reasoning system. This sistem increases number of hits in soccer matches forecasting, discovering temporal trends by means of data mining and association rules reasoning. The analysis of attributes has been used in (Aranda-Corral et al. 2011) to describe a conﬁdence-based (and contextual) reasoning system for forecasting sports bet- ting. In this paper we analyse the attribute selection problem as a problem of selection of features that shape the behaviour Figure 1: FCA based model for prediction of qualitative features of Comple x Systems of the complex system that represents professional soccer leagues. Theoretical framew ork, on which this model is based on, will be presented at (Aranda-Corral et al. 2011b). Due to a really huge amount of information, attribute selec- tion advised by e xperts is mandatory . In f act, the system can be considered as a reasoning model based on bounded ra- tionality and recognizition heuristics. and focused on fea- tures which were considered as important by human ex- perts. Therefore, the system aims to forecast results, but it is designed based on bounded rationality models, instead of statistic models (although in the future hybrid models will be considered. The system is a ﬁrst prototype from a more general sys- tem, which are building to analyse qualitativ e features of Complex Systems (see Fig. 1), using FCA. The idea is to isolate qualitate attributes from (past) local interactions among components of complex system and to apply FCA tools in order to predict properties system’ s behavior in a near future. Background: F ormal Concept Analysis According to R. W ille, FCA (Ganter & W ille 1999) math- ematizes the philosophical understanding of a concept as a unit of thoughts composed of two parts: the extent and the intent. The extent covers all objects belonging to this con- cept, while the intent comprises of all common attributes valid for all the objects under consideration. It also allows the computation of concept hierarchies from data tables. In this section, we succinctly present basic FCA elements (the fundamental reference is (Ganter & W ille 1999)). A formal context M = ( O, A, I ) consists of two sets, O (objects) and A (attrib utes) and a relation I ⊆ O × A . Finite contexts can be represented by a 1-0-table (identifying I with a Boolean function on O × A ). See Fig. 2 for an example of formal conte xt about li ve beings. The FCA main goal is the computation of the concept lat- tice associated to the context. Given X ⊆ O and Y ⊆ A it deﬁnes X 0 := { a ∈ A | oI a for all o ∈ X } Y 0 := { o ∈ O | oI a for all a ∈ Y } A (formal) concept is a pair ( X, Y ) such that X 0 = Y and Y 0 = X . For example, concepts from living beings formal context (Fig. 2, left) is depicted in Fig. 2, right. Using this Fig. 2, each node is a concept, and its inten- sion (or extension) can be formed by the set of attrib utes (or objects) included along the path to the top (or bottom). E.g. The node tagged with the attribute Legs represents to the concept ( { Leg s, M obil ity , N eedW ater } , { C at, F r og } ) . In this paper it works with logical relations on attributes which are valid in the context. Logical expressions in FCA are implications between attributes . An implication is a pair of sets of attributes, written as Y 1 → Y 2 , which is true with respect to M = ( O , A, I ) according to the follo wing deﬁ- nition. A subset T ⊆ A r espects Y 1 → Y 2 if Y 1 6⊆ T or Y 2 ⊆ T . It says that Y 1 → Y 2 holds in M ( M | = Y 1 → Y 2 ) if for all o ∈ O , the set { o } 0 respects Y 1 → Y 2 . In that case, it is said that Y 1 → Y 2 is an implication of M . Deﬁnition. 1 Let L be a set of implications and L be an implication. 1. L follows fr om L ( L | = L ) if each subset of A r especting L also respects L . 2. L is complete if every implication of the context follows fr om L . 3. L is non-redundant if for eac h L ∈ L , L \ { L } 6| = L . 4. L is a (implication) basis for M if L is complete and non- r edundant. It can obtain a basis from the pseudo-intents (Guigues & Duquenne 1986) called Stem Basis (SB): L = { Y → Y 00 : Y is a pseudointent } Figure 2: Formal context, associated concept lattice and Stem Basis A SB for the formal context on liv e beings is provided in Fig. 2 (right). It is important to remark that SB is only an example of a basis for a formal context. In this paper any speciﬁc property of the SB can be used, and it can be replaced by any implication basis. It is possible to extend | = in relation to any propositional formula with propositional variables in A , by considering each object o ∈ M as a valuation v o on A deﬁning v o ( A ) = 1 ⇐ ⇒ ( o, A ) ∈ I Thus M | = F if and only if for any o ∈ O it holds that v o | = F . The Armstr ong rules (Armstrong 1974) pro vides a formal basis for implicational reasoning: X → X X → Y X ∪ Z → Y , X → Y , Y ∪ Z → W X ∪ Z → W A set of implications is closed if and only if the set is closed by these rules (Armstrong 1974). By deﬁning ` A as the proof relation by Armstrong rules, it holds that the impli- cational bases are ` A -complete: Theorem 2 Let L be an implicational basis for M , and L an implication. Then M | = L if and only if L ` A L In order to work with formal contexts, stem basis and association rules, the Conexp 1 software has been selected. It is used as a library to build the module which pro- vides the implications (and association rules) to the rea- soning module of our system. The reasoning module is a production system based on which was designed for (Aranda-Corral & Borrego-D ´ ıaz 2010). Initially it works with SB, and entailment is based on the following result: Theorem 3 Let L be a basis for M and { A 1 , . . . , A n } ∪ Y ⊆ A . The following conditions ar e equivalent: 1. S ∪ { A 1 , . . . A n } ` p Y ( ` p is the entailment with the pr oduction system). 2. S ` A A 1 , . . . A n → Y 3. M | = { A 1 , . . . A n } → Y . 1 http://sourceforge.net/projects/cone xp/ Association rules f or a a f ormal context W e can consider a Stem Basis as an adequate production system in order to reason. Ho wev er , Stem Basis is designed for entailing true implications only , without any exceptions into the object set nor implications with a low number of counterexamples in the conte xt. Another more important question arises when it works on predictions. In this case we are interested in obtaining meth- ods for selecting a result among all obtained results (ev en if they are mutually incoherent), and theorem 3 does not provide such a method. Therefore, it is better to consider association rules (with conﬁdence) instead of true implica- tions and the initial production system must be revised for working with conﬁdence. Researching on logical reasoning methods for associ- ation rules is a relativ ely recent promising research line (Balc ´ azar 2010). In FCA, association rules are implications between sets of attrib utes. Conﬁdence and support are de- ﬁned as usual. Recall that the support of X , supp ( X ) of a set of attrib utes X is deﬁned as the proportion of ob- jects which satisfy e very attribute of X , and the conﬁdence of a association rule is conf ( X → Y ) = supp ( X ∪ Y ) /supp ( X ) . Conﬁdence can be interpreted as an estimate of the probability P ( Y | X ) , the probability of an object sat- isfying every attribute of Y under the condition that it also satisﬁes e very one of X . Conexp softw are pro vides associa- tion rules (and their conﬁdence) for formal contexts. Reasoning under contextual selection. Logical F oundations The model (described in (Aranda-Corral et al. 2011b)) is composed of e vents (objects) which ha ve a number of prop- erties (attributes). They consitute a universal formal con- text M (which we call monster context following the tradi- tion in Model Theory). Thus M can be considered as the global memory from which subcontexts are extracted. Once the speciﬁc context is considered, it is also possible to con- sider background knowledge ∆ (in form of propositional logic formulas) which would be combined with the knowl- edge extracted from formal context (Stem basis or associa- tion rules). Figure 3: Context based reasoning system Deﬁnition. 4 Let M = ( O , A , I ) be the monster context, and let O be a set of objects. 1. A context on O is a context M = ( O 1 , A, I ) wher e O ⊆ O 1 ⊆ O 2. A contextual selection on O and M is a map s : O → P ( O 1 ) × P ( A ) 3. A contextual KB for an object o ∈ O w .r .t. a selection s with conﬁdence γ is a subset of association rules with conﬁdence gr eater or equal to γ of the formal context as- sociated to s ( o ) = ( s 1 ( o ) , s 2 ( o )) , that is, to the conte xt M ( s ( o )) := ( s 1 ( o ) , s 2 ( o ) , I  s 1 ( o ) × s 2 ( o ) ) (note that when condifence is 1 the contextual KB is a implicational ba- sis). Contextual KBs is useful for entailing attrib utes on an ob- ject. The reasoning model on M is argumentati ve, where the argument is based on KBs extracted from subcontexts (Aranda-Corral et al. 2011b): Deﬁnition. 5 Let L be an implication and ∆ a backgr ound knowledge. It is said that L is a possible consequence of M under ∆ , M | = ∆ ∃ L , if there exists M a nonempty subcontext of M such that M | = ∆ ∪ { L } . Note that by theorem 3, when ∆ is a set of implications, it holds that | = ∃ is equiv alent to ` ∃ which is deﬁned by: M ` ∃ L if there exists M | = ∆ a subcontext of M such that S ` p L (where S is a stem basis for M ). The role of attrib ute selection f or f ormal contexts Attributes are essentials in the contextual selection to build good formal contexts. Association rules are extracted from the contexts and those are used by the production system. By means of these association rules and some initial facts based on the match we want to forecast the production sys- tem infers the conﬁdence (probability) for each one of the three possible results of a match, home team wins, draw or away team wins. Thus attributes constitute one of the most important and sensitive parts of the system. They are sensi- tiv e because on ho w they represent the beha vior of the teams will depend the accuracy of the inferred results. Conﬁdence-based reasoning system The reasoning system works on facts of the type ( a, c ) , where a is an attribute and c is the estimated probability of the trueness of a , which we also call conﬁdence (by similarity with the same term for association rules). See (Aranda-Corral et al. 2011) for a more detailed description of the reasoning system. The system has a module for a conﬁdence-based rea- soning system (Fig. 3). Its entries for a match T eam 1 - T eam 2 are: the contextual Kno wledge basis for a thresh- old giv en as rule set and attribute values for the current match (except 1,X,2) as facts, all of them with a conﬁ- dence (whose value depends on the reasoning mode, see below). The production system is executed and the output is a triple < (1 , c 1 ) , ( X, c x ) , (2 , c 2 ) > of attribute, conﬁ- dence for this match. The attribute with greater conﬁdence is selected as the prediction. Production system execution is standard, with sev eral modes for conﬁdence computing of results based in uncertain reasoning in Expert Systems (Giarratano & Riley 2005). Any attribute/fact a is initialized with conﬁdence conf ( a ) := |{ o : oI a }| + 1 | O | + 1 Attributes and f ormal contexts for soccer league For both selecting data and building contexts, some assump- tions on forecasting in soccer league matches have been con- sidered. Reconsiderations of such decisions can be easily computed in the system. First, we consider that the regular- ity of team’ s behaviour only depends on the contextual se- lection that has been considered. This contextual selection is obtained by taking matches from the last X weeks back- wards, starting from the week just before the one we want to forecast.Second, since FCA methods are used to discover regularity features, thus it does not consider forecasting ex- ceptions (unexpected results). Therefore, the model can be considered as a starting point for betting expert who would adjust attributes, in order to more personalised criteria. Figure 4: Concept Lattice for the match M ´ alaga-Sevilla (week 31, season 2009-10) These attributes have to be computed and used to entail the forecasting. This analysis is assisted by Conexp. Con- Exp software is used to compute and analyze the concept lat- icces associated to the temporal contexts. In order to select most interesting attributes for the system, starting from an initial conﬁguration, user can compute the associated con- cept lattice and check it. In this way , attributes goodness (and thresholds) can be ev aluated to reconsider current at- tribute selection. For example, in Fig. 4, the concept lattice associated to contextual selection for M ´ alaga-Sevilla match is sho wn. This contextual selection is obtained from a given attribute selection and last 38 weeks matches before. In this concept lattice, the attribute I D 1 T 16 is deﬁned by: ’the budget of team 2 is greater than γ 1 times the budget of team 1 ’, where γ 1 is the threshold the expert must estimate. In the concept lattice we can observe that the biggest con- cept containing the attributes team 2 w ins and I D 1 T 16 cov ers the about the 10% of the objects owned by the ﬁrst attribute, therefore it is suggested to use the second attribute for reasoning with association rules to get a prediction. The system computes the v alue of an amount of attributes on objects. Experimentally a boolean combination of at- tributes is possible. Once the temporal context has been computed, the system can build contextual selections by se- lecting the match and the attribute set. The selection of at- tributes was made by considering four kinds of factors: those related with the classiﬁcation, the history of teams’ matches in the recent past, results of direct matches and other non related results, as for example the difference between team budgets. Se venteen relev ant attributes were selected.The attribute set has three special attributes, T eam 1 wins (1), T eam 2 wins (2) and draws (X). W ith respect to data source, they are automatically ex- tracted from RSSSF Archiv e 2 . Objects are matches and attributes are a list of features, including temporal stamp (week, year). Data was collected for the past four years. Actually the size of the context is about 300 objects and 18 attributes (although sev eral of them are parametrized, see 2 http://www .rsssf.com section bellow). Thus, | I | is about 5,100 pairs. Attribute selection W e have chosen a small set of attributes with man y possi- bilities through a few customizable parameters. When these parameters are having set up with proper values, the set of attributes will represent team’ s logical behavior . Recall that formal concept analysis works with qualitati ve attributes and all teams information which we work with are quantitativ e data. Thus it is necessary to con vert quantitativ e attributes into qualitativ e ones. This task is left to users by choosing a proper threshold to each attribute. Before choosing the set of base attributes , we have carried out a analysis on information about soccer results . The aim hav e been to discov er which factors are more inﬂuential in teams behavior and which ones are less inﬂuential. First of all, we have collected any interesting factor found, and after analyzing each one, indi vidually , we ha v e chosen most suitable ones. Examined factors can be classiﬁed in four different categories (see T able 1): those related to season’ s classiﬁcation, those related to previous team’ s results, those related to historical direct matches and any other factors. It is worth to note that to increase possibilities of the attribute set, and considering the Boolean nature of formal context attributes, we have added the option to create new ones by means of logical combinations of these attributes. According to considered factors, the system computes a base set of 18 attrib utes, which are customizable by some parameters. This will let us to obtain a diverse set of at- tributes. In T able 2 attrib utes are speciﬁed. Four parameters are used: • Threshold: Parameter to be used to translate quantitative attribute v alues into qualitative ones. • T eam: Recall that in the formal context considered, ob- jects are matches but attributes belongs to team proper- ties. This parameter will set the team from object (match) on which attribute will be considered. It has two possible values: { HOME, A W A Y } . Thus, usually , we will have twice each attribute at conte xt, once for home team and once for away . • Number of Matches: sets the number of past matches to be considered when some attributes are computed, e.g. the ones associated to previous team results. • Kind of matches: sets past matches type to be taken into account to compute some attributes, considering home/away team’ s condition at matches. Three possible values: { MA TCHES AS HOME TEAM, MA TCHES AS A W A Y TEAM, ALL MA TCHES } W ith these parameters, and the possibility to compound attributes, it is possible to build a detailed attributes set. Note that experiments show that simplest and most logical at- tributes give a good team behavior representation. Although we consider that a versatile attributes set, as above described, was necessary because of a huge number of factors can de- termine the result of a soccer match. T ask of customizing the attribute set is left to users, and it is the most important one in forecasting process. Thus, a basic soccer knowledge should be required. The goodness of customization will de- termine system results. Computing problems The way of competition causes to take into account some special situations for computing attributes v alues. In this section we describe the main problems emerged and how they were ﬁxed. Roughly speaking, these main problems concerns to initial matches in season. Beginning of a new season: week 0 This problem is not hard, but as many others unav oidable, and a solution becomes essential. It happens when com- puting an attribute value related to league standings to fore- cast ﬁrst week of a season. As any previous week has been played yet, there is not way to b uild a standing table. When teams in current season remain in the same league as last, a trivial solution is to take into account positions and matches in last weeks of pre vious season. If the team played in a higher division than last season, it will be at the ﬁrst position in the standing. Otherwise, if the team played in a lower di vision, it will be considered at last position. Missing matches in attribute computation Other problem, closely related to pre vious one, is when not enough previous matches are available to compute an at- tribute. Solution pass through taking lasts matches of last season as if they were in a continuous temporal line. This is not so simple, because of some teams were not playing at same division last season. Indeed, when playing in a lower or higher division, difﬁculty of division changes and matches cannot be compared into the same way . Therefore, we need to handle the situation of a team playing in a differ- ent division from current season di vision. Other troubled situation where there are not enough matches for attribute computation is to compute results for directed matches between two teams because of there is only a few of such matches in the data source. For these two related situations we offer two solutions. First is to compute attribute with a null value, but in this way we are giving a fake information to the system. W e are setting that attribute is not true but, in fact, we hav e not in- formation enought to determine it, so a better approach is required. Chosen solution is based on adjusting attribute’ s threshold. The value of this threshold is decreased propor- tionally to relation between number of required matches and number of av ailable matches. Threshold γ is revised by γ new = γ old · number of match results av ailable number of match results needed When number of required matches is too high and number of av ailable matches is lo w , it looks like we are giving fake information to system again, but our experience sho ws that collateral ef fects of this approach are worthless compared to compute attributes with a null v alue. Attribute selection vs expert system beha viour In general terms, current base attribute set beha vior forecast the most possible results of a match is quite good, in regular conditions. Even so, some experiments, in order to study attribute’ s behavior , hav e been de veloped. Strict attributes An attribute is strict when only a few objects can satisfy it, because of its threshold is too high. By working with sets of strict attributes, we can assure that they estimate the teams behavior better than other sets. Thus, with strict attributes, we will hav e very reliable estimates, but just only for v ery few matches, and non for most of others. In the other hand, using less strict attributes, system will produce less reliable estimations but for a big scope. So it is essential to ﬁnd a balance between these two opposite situations: reliability of attribute set against number of matches without informa- tion. A good solution could be to build and use different attribute sets, ones more strict and others less. Thus, less strict attribute sets will be used when strict ones fail doing an estimation. T rends towards the victory of the home team It is a fact that, in soccer, it is more probable a victory from home team than away team. T o deal with this, we offer two dif ferent approaches. First, modelling the teams behav- ior and second computing conﬁdence values. For model- ing teams behavior (attribute set customization) it is a good practice to use attributes with low exhausti ve thresholds for Factors Correlation Degr ee Used? Associated to the classiﬁcation in the league T eam in the ﬁrst classiﬁcation lev el medium/high yes T eam in the last classiﬁcation lev el medium/high yes Difference between team’ s classiﬁcations medium/high yes T eam was in a dif ferent league last year medium no T eam socred a important number of goals (in the last matches) medium/low no Associated to pre vious results of the team Number of consecutiv e won matches. high yes Number of consecutiv e lost matches. high yes Number of consecutiv e dra ws. medium yes Number of non consecuti ve won matches in pre vious weeks. high yes Number of non consecutiv e lost matches in previous weeks. high yes Number of non consecuti v e draws in previous weeks. medium/high yes Points collected in previous weeks. medium/high yes Factors related with directed matches (incluidas pre vious years) Number of wins in previous directed matchs medium/high yes number of losts in previous directed matchs medium/high yes number of draws in pre vious directed matchs medium/high yes Other Factors Number of red cards collected by the team’ s players. low no Wheather the day and the city where the match took place medium no (hard to parametrize) Motiv ation because of the fans support when playing as home team. high no (hard to parametrize, subjectiv e) T eam hires a new coach. high no (only useful when new coach hired) Some players of the team are selected for their Na- tional T eam. medium/Low no (relev ant for some nationalities) Difference between team’ s budgets. high yes One or more important team’ s players are injured. medium no (hard to automatically collect the data) Cups collected in the lasts years. low no (only for a few of teams) T able 1: Factors considered for selecting/building attrib utes home team and more exigent threshold for attributes related to away team. Therefore, it will be easier for home team to satisfy an attribute than away team. It is possible to imitate this trend based on this approach. Around 50% of played matches ﬁnish with victory of home team. This means that the attribute v alue, correspond- ing to matches result, will be ’home team victory’ around 50% of objects from formal context. As consequence of the former , many rules from the inferred association rules will contain the attribute ’ result = home team victory’ within their conclusions. Thus when forecasting a match the sys- tem will infer, in most of cases, ’home team victory’ as con- sequence of overestimation conﬁdence value for this result. It is possible to av oid this effect easily , just applying a de- creasing (reduction) factor over conﬁdence for ’home team victory’. It is estimated by means of experiments. Results Follo wing the process described above, an experiment was run for the Spanish premier soccer league from 2009-10. At- tributes were selected according the experience of an expert, and contextual KB is computed (in Fig. 5 a KB fragment for M ´ alaga-Sevilla match is shown). From this selection ` ∃ is computed for each match in each week. 2009-10 season: Experiments with the system show fore- casts of about 58.16% by a contextual selection based on the previous 38 matches of each team. Such a per- centage of hits for a qualitativ e reasoning system may be considered as an acceptable result comparable with ex- pectable results of experts (Goldstein & Gigerenzer 2009; Attribute Conﬁgurable parameters 1) Number of non consecutiv e won matches in previous weeks > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 2) Number of non consecutiv e lost matches in previous weeks > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 3) Number of non consecutive draws in pre vious weeks > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 4) Points collected in previous matches > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 5) Position in the classiﬁcation based on pre vious matches > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 6) Number of positions ov er the opponent in the classiﬁ- cation based on previous matches > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 7) Number of positions under the opponent in the classi- ﬁcation based on previous matches > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 8) Number of wins in pre vious directed matchs (included previos leagues) > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 9) Number of losts in previous directed matchs (included previos leagues) > threshold < threshold > < T eam > < Number of Matchs > < Matchs > 10) Number of drawns in previous directed matchs (in- cluded previos leagues) > threshold < threshold > < Number of Matchs > < Matchs > 11) Position in the classiﬁcation > threshold < threshold > < T eam > < Matchs > 12) Number of positions over the opponent in the classiﬁcation > threshold < threshold > < T eam > < Matchs > 13) Number of positions under the opponent in the classiﬁcation > threshold < threshold > < T eam > < Matchs > 14) Number of consecutiv e won matches > threshold < threshold > < T eam > < Matchs > 15) Number of consecutiv e lost matches > threshold < threshold > < T eam > < Matchs > 16) Number of consecutiv e dra ws > threshold < threshold > < T eam > < Matchs > 17) T eam’ s budget Y times bigger than opponent’ s budget (Y > threshold) < threshold > < T eam > 18) T eam’ s budget Y times smaller than opponent’ s bud- get (Y > threshold) < threshold > < T eam > T able 2: Attributes and parameters Figure 5: KB fragment from Fig. 4 Andersson et al. 2003). Experiments with other contextual selections shows an increase in the number of hits by about 7% in the second half of the season. The reason is that data from the ﬁrst half provides more recent information on teams and past matches. 2010-11 season: According to the idea commented abov e, we have e v aluated the system in the second half of 2010-11 soccer season. A way to ev aluate how good is this fore- casting sistem is comparing number of successes in our pool with the most popular betting selections. This popular se- lections are collected from the most v oted results for each match, published at state agency web that controls soccer pools. In Fig. 6 both results are compared. Our hits are in blue and popular ones in green and last seventeen weeks from 2010-11 season are represented. Note that Spanish soc- cer pools are ov er 15 matches. Conclusions and Future W ork The challenge to detect emergent concepts for reasoning about complex systems represents an exciting researching ﬁeld. Concepts with qualitati ve nature are extracted from data only considering partial features of complex system dy- Figure 6: Correct predictions on the last 17 weeks of the season 2010-11 Figure 7: Comparative of correct predictions on the whole season 2010-11. Percentages namics, a partial understanding of system. In this paper, FCA is applied to this aim with a speciﬁc application. The selection attribute problem based on FCA-based reasoning system for sport forecasting is analysed. In fact, the reason- ing system is a computational logic model for bounded ra- tionality . The model is concerned with association rule rea- soning and it does not use -in its current form- more sophis- ticated probability tools (as for example (Min et al. 2008)). As is stated in (Goldstein & Gigerenzer 1996), the theory of probabilistic mental models assumes that inferences about unknown states of the world are based on probability cues (Brunswik 1955). It can say that conﬁdence of association rules extracted from subconte xts play the role of probability cues. Any statistical approaches hav e been taken into account, because of it was not the aim of this paper . Although a comparativ e study of our system against C4.5 classiﬁer has been done. For this, two different attribute selections have been considered and used for both, C4.5 classiﬁer and our system. The experiment is to forecast all matches (380) in season 2010-11. In order to stimate each match result, considering N (weeks) as timestamp, previous matches are used to build contextual selection (or trainining set in C4.5) from weeks N − 1 to N − 19 (190 objects). Fig. 7 shows the percentage of correct predictions for our system and C4.5 classiﬁer , using both attribute selections. Other cols are also shown: ’user’ s most v oted results’, local team al- ways win and two random generated. These ’ random gen- erated’ cols were built assuming different weigths per re- sult. It means, < 1 : 55% , X : 23% , 2 : 22% > and < 1 : 65% , X : 18% , 2 : 17% > were used, where 1 , X , 2 are the probabilities for forecasting a match with the result: local team wins, drawn and a way team wins, respecti vely . It is worth to note that, while classiﬁer achiev es high- est performances (58,68%) when number of matches in- crease from 190 to 380, our system reaches this highest performance (59,74%) using only 190 instances. This con- clusion is based on our system use some fast and frugal (Goldstein & Gigerenzer 1996) methods, and these are de- signed to achiev e aceptable results using as less as possible resources. The relationship of our proposal with Recognition Heuris- tics (Goldstein & Gigerenzer 2002) (roughly speaking, if one of the possibilities is recognized and the other is not, then infer that the recognized object has the higher value with respect to the criterion) is not clear . W e may assert that our model recognises trends in contexts. Trends (rep- resented as association rules) can be considered as a kind of recognizing method, though. The system is based on bounded rationality models instead of statistic models, al- though in future hybrid models will be considered. In the short term, we carry on extending our system in or- der to be able to combine the results of two or more attrib ute sets with dif ferent e xigency le vel. Therefore the system will return only one result and more reliable. In the long term, we aim to extend the model in orfer to obtain a general system to detect emergent concepts in Comple x Systems After some real betting experiments during current season (2010-2011) with one customized attribute set, we have ob- served another intriguing fact. If we take a look to number of successful predictions per week, we are able to distin- guish some groups of consecutive weeks in which number of correct predictions is under or ov er the av erage. Recall that these predictions are the logical inferred results by one customized attribute set. This suggests that it could be pos- sible to ﬁnd another attrib ute set, with a different parameters customization, which it will accomplish the correct predic- tions of ﬁrst attribute set. It means that when ﬁrst attribute set produce bad forecasting, second should produce good ones, and vice versa. The reason of this is that each match there is not only one possible logical result. It means, when one of ﬁrsts teams of current ranking plays against one of lasts team, attending to ranking criteria, the logical result of this match would be that ﬁrst one wins. But if we attend to others, like ﬁrst team lost last week and second team won last 5 weeks, this results would be different. Future works pass through for ﬁnding these complementary attribute sets and detecting when their behaviors change during season in order to select the proper attrib ute set to forecast each week. Finally , we are also analyzing how to ﬁnde a weight for matches which allows the system to work with matches from different divisions, simultaneously . Note that a winning match at ﬁrst division will have a higher weight than a win- ning at second. This will be really useful at the beginning of season because of we need to compute attributes related to previous matches results and teams which are inv olved played at different di visions last season. Acknowledgements Supported by TIN2009-09492 project of Spanish Ministry of Science and Innovation, and Excellence pr oject TIC-6064 of J unta de Andaluc ´ ıa coﬁnanced with FEDER founds. References Why Spain will win..., Engineering & T echnology 5 June - 18 June 2010. J. A. Alonso-Jim ´ enez, G. A. Aranda-Corral, J. Borrego-D ´ ıaz, and M. M. Fern ´ andez-Lebr ´ on, M. J. Hidalgo-Doblado, Extend- ing Attribute Exploration by Means of Boolean Deriv ativ es, Proc. 6th Int. Conf. Concept Lattices and Their Applications (CLA2008), pp. 121-132 (2008). P . Andersson, M. Ekman, J. Edman, Forecasting the fast and fru- gal w ay: A study of performance and information-processing strategies of experts and non-experts when predicting the W orld Cup 2002 in soccer, W orking Paper Series in Business Administration 2003:9, Stockholm School of Economics. G. A. Aranda-Corral, J. Borrego-D ´ ıaz, Reconciling Knowledge in social tagging web services. Proc. 5th Int. Conf. Hybrid AI Systems (HAIS 2010), LN AI, vol. 6077. Springer -V erlag, Berlin, 383-390 (2010). G. A. Aranda-Corral, J. Borrego-D ´ ıaz, J. Gal ´ an-P ´ aez, Conﬁdence- Based Reasoning with Local T emporal Formal Contexts. to appear in IW ANN 2011, LNCS (2011). G. A. Aranda-Corral, J. Borrego-D ´ ıaz, J. Gal ´ an-P ´ aez, Bounded Ra- tionality for Data Reasoning based on Formal Concept Anal- ysis. T o appear in DEXA W orkshop D ALI (2011). W . Armstrong, Dependency structures of data base relationships. Proc. of IFIP Congress, Genev a, 580-583 (1974). J.L. Balc ´ azar , Redundancy , Deduction Schemes, and Minimum- Size Bases for Association Rules, Logical Methods in Com- puter Science 6(2):1-23 (2010). E. Brunswik, Representati ve design and probabilistic theory in a functional psychology . Psychological Review , (62):193-217 (1955). B. Ganter and R. Wille. Formal Concept Analysis - Mathematical Foundations. Springer , 1999. J. C. Giarratano, G.D. Riley , Expert Systems: Principles and Pro- gramming. Brooks/Cole Publishing Co ( 2005). D. G. Goldstein, G. Gigerenzer, Reasoning The Fast and Frugal W ay: Models of Bounded Rationality , Psychological Revie w 103(4): 650-669 (1996). D. G. Goldstein, G. Gigerenzer, Models of ecological rationality: the recognition heuristic, Psychological revie w , 109(1): 75- 90 (2002). D.G. Goldstein, G. Gigerenzer , Fast and frugal forecasting. Inter- national Journal of Forecasting, 25, 760-772 (2009). Guigues, J.-L., Duquenne, V .: Familles minimales d’ implica- tions informati ves resultant d’un tableau de donnees binaires. Math. Sci. Humaines 95, 5–18 (1986). S. P . Imberman, B. Domanski, R. A. Orchard: Using Booleanized Data T o Discover Better Relationships Between Metrics. Int. CMG Conference 1999: 530-539 B. Min, J. Kim, C. Choe, H. Eom, R. I. McKay , A compound framew ork for sports results prediction: A football case study . Know .-Based Syst. 21(7):551-562. 2008

Selecting Attributes for Sport Forecasting using Formal Concept Analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment