PREMA: Principled Tensor Data Recovery from Multiple Aggregated Views

P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 1 P R E M A : Pr incipled T ensor Data Reco v er y from Multiple Aggregated Vie ws F aisal M. Almutairi, Charilaos I. Kanatsoulis, and Nicholas D . Sidiropoulos Abstract —Multidimensional data hav e become ubiquitous and are frequently encountered in situations where the information is aggregated ov er multiple data atoms. The aggregation can be ov er time or other features, such as geog raphical location. W e often hav e access to multiple aggregated views of the same data, each aggregated in one or more dimensions , especially when data are collected or measured by diff erent agencies. For instance , item sales can be aggregated temporally , and ov er groups of stores based on their location or afﬁliation. Howe ver , data mining and machine lear ning models beneﬁt from detailed data f or personalized analysis and prediction. Thus, data disaggregation algorithms are becoming increasingly important in various domains. The goal of this paper is to reconstruct ﬁner-scale data from multiple coarse views, agg regated ov er different (subsets of) dimensions. The proposed method, called P R E M A , le ver ages low-rank tensor f actor ization tools to fuse the multiple vie ws and provide recovery guarantees under cer tain conditions. P R E M A can tackle challenging scenarios, such as missing or partially obser ved data, doub le aggregation, and e ven blind disaggregation (without knowledge of the agg regation patterns) using a variant of P R E M A called B - P R E M A . T o showcase the effectiv eness of P R E M A , the paper includes e xtensive experiments using real data from diff erent domains: retail sales, crime counts, and weather observations. Index T erms —Data disaggregation, tensor decomposition, tensor mode product, multidimensional (tensor) data, m ultiview data F 1 I N T R O D U C T I O N D ATA aggregation is the process of summing (or aver- aging) multiple data samples from a certain dataset, which results in data resolution reduction and compr ession. The most common type of aggregation is temporal aggr e- gation . For example, the annual income is the aggregate of the monthly salary . Aggregation over other attributes is also common, e.g., data get aggregated geographically (e.g., population of New Y ork) or according to a deﬁned afﬁliation (e.g., employees of Company X). The latter is known in economics as contemporaneous aggregation [1]. The differ ent types of aggr egation ar e often combined, e.g., the number of for eigners who visited dif ferent US states in 2019 can be aggregated in time, location (states), and afﬁliation (nationality). In some cases, it is the data collection process that limits data resolution in the ﬁrst place, e.g., Store X r ecords item sales only on a monthly basis. Aggregated data also exist for other reasons, the most important being data summariza- tion. In particular , aggr egated data enjoy concise repr esenta- tions, which is critical in the era of data deluge. Aggr egation also beneﬁts various other purposes, including scalabil- ity [2], communication and storage costs [3], and privacy [4]. Aggregated data are common in a wide range of domains, such as economics, health care [5], education [6], wireless • F . M. Almutairi and C. I. Kanatsoulis are with the Department of Elec- trical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA. E-mails: (almut012, kanat003)@umn.edu. • N. D. Sidiropoulos is with the Department of Electrical and Computer Engineering, University of V irginia, Charlottesville, V A 22903, USA. E-mail: nikos@virginia.edu The work of C. I. Kanatsoulis and N. D. Sidiropoulos was partially supported by the National Science Foundation under Grants NSF IIS-1704074, and NSF ECCS-1608961. communication, signal and image processing, databases [7], and smart grid systems [8]. Unfortunately , the favorable properties of data aggrega- tion come with major shortcomings. A plethora of data min- ing and machine learning tasks strive for data in ﬁner granu- larity (disaggr egated), thus data aggregation is undesirable. Along the same lines, algorithms designed for personal- ized analysis and accurate prediction signiﬁcantly beneﬁt from enhanced data resolution. Analysis results can differ substantially when using aggregated versus disaggregated data. Particularly , studies in the ﬁeld of economics show that data aggr egation results in information loss and misleading conclusions at the individual level [9], [10]. Furthermore, in supply chain management, resear chers have concluded that aggregating sales over time, products, or locations has a negative impact on demand forecasting [11]. On the other hand, disaggregation prior to analysis is very effective in environmental studies [12], and leads to richer ﬁndings in learning analytics [13]. The pr evious discussion r eveals a clear trade-off between the need for data aggregation and the beneﬁt of disaggre- gated data. This has motivated numerous works in devel- oping algorithms for data disaggregation. In general, the task of data disaggregation is an inverse ill-posed problem. In order to handle the problem, classic techniques exploit side information or domain knowledge, in their attempt to make the problem overdetermined and consequently enhance the disaggregation accuracy . Some common prior models, imposed on the target higher resolution data, in- volve smoothness, periodicity [14], and non-negativity plus sparsity over a given dictionary [15]. Such prior constraints are invoked when no other information is available about the data to be disaggregated. An interesting question arises when a dataset is aggre- P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 NDE Mean LS H-Fuse HomeRun CSMF PREMA B-PREMA Fig. 1: P R E M A is ef fective with real data. gated over more that one dimension. This is a popular resear ch problem in the area of business and economics going back to the 70’s [16], [17]. In this case temporal and contemporaneous aggregated data are available [18]. For instance, given a country consisting of regions, we are interested in estimating the quarterly gross r egional product (GRP) values, given the annual GRP per region (temporal aggregate) and the quarterly national series (contempora- neous aggregate) [19]. Another notable example appears in healthcare, where data are collected by national, regional, and local government agencies, health and scientiﬁc or- ganizations, insurance companies and other entities, and are often aggregated in many dimensions (e.g., temporally , geographically , or by gr oups of hospitals), often to preserve privacy [5]. Algorithms have been developed to integrate the mul- tiple aggregates in the disaggregation task [16], [17], [18], [19], [20]. The general disaggregation problem is ill-posed, which is clearly undesirable, even with multiple aggregates. Therefor e, the majority of these works resort to adopting linear regression models with priors and additional infor- mation. However , it is unclear whether these formulations can identify the tr ue disaggregated dataset under reasonable conditions. In this context, identiﬁability has not received the attention it deserves, likely because guaranteed recovery is consider ed mission impossible under realistic conditions. W ith multiview data aggregated in differ ent ways, however , the problem can be well-posed, as we will show in this paper . Our work is inspired by the following question: Is the dis- aggregation task possible when the data are: 1) multidimensional, and 2) observed by different agencies via diverse aggregation mechanisms? This is a well motivated problem due to the ubiquitous presence of data with multiple dimensions (thr ee or more), also known as tensors, in a large number of applications. Note that aggregation often happens in more than one dimensions of the same data as in the previously explained examples. The informal deﬁnition of the pr oblem is given as follows: Informal Problem 1 (Multidimensional Disaggr egation). • Given: two (or mor e) observations of a multidimen- sional dataset, each repr esenting a differ ent coarse view of the same data aggregated in one dimension (e.g., temporal and contemporaneous aggregates). • Recover: the data in higher resolution (disaggr e- gated) in all the dimensions. W e propose P R E M A : a framework for fusing the mul- tiple aggregates of multidimensional data. The proposed approach repr esents the target high resolution data as a tensor , and models that tensor using the canonical polyadic decomposition (CPD) to reduce the number of unknowns, while capturing correlations and higher-or der statistical de- pendencies across dimensions. P R E M A employs a coupled CPD appr oach and estimates the low-rank factors of the target data, to perform the disaggregation task. This way , the originally ill-posed disaggregation problem is transformed to an overdetermined one, by leveraging the uniqueness properties of the CPD. P R E M A is ﬂexible in the sense that it can perform the disaggregation task on partially observed data, or data with missing entries. This is practically impor- tant as partially observed data appear often in real-world applications. Our P R E M A approach takes into account several well- known challenges that often emerge in real-life databases: the available measurements can have different scales (e.g., mixed monthly and yearly sums), gaps in the timeline (i.e., periods with no value reported), or time overlap (i.e., peri- ods covered by more that one measurement). W e also pro- pose a variant of P R E M A called B - P R E M A that handles the disaggregation task in cases where the aggregation pattern is unknown. The proposed framework not only provides a disaggregation algorithm, but it also gives insights that can be exploited in creating more accurate data summaries for database applications. Interestingly , our work also provides insights on when aggr egation does not pr eserve anonymity . W e evaluated P R E M A on real data from differ ent do- mains, i.e., retail sales, crime counts, and weather obser- vations. Experimental results show that the proposed algo- rithm reduces the disaggregation error of the best baseline by up to 67%. Figure 1 shows the Normalized Disaggre- gation Error (NDE) of P R E M A and the baselines with real data of the weekly sales counts of items in differ ent stores of a r etail company (CRA dataset, described in Section 4.1). W e ar e given two observations: 1) monthly sales aggregates per store, and 2) weekly sales aggregated over groups of stores ( 94 stores are geographically divided into 18 areas). P R E M A outperforms all the competitors, even if the disag- gregation pattern is unknown (B - P R E M A )—all the baselines use the aggregation information. The fact that the naive mean (Mean) gives a large error , indicates that the data are not smooth and the task is difﬁcult. In summary , the contributions of our work are: • Formulation : we formally deﬁne the multidimen- sional data disaggregation task from multiple views, aggregated across dif ferent dimensions, and provide an ef ﬁcient algorithm. • Identiﬁability: the considered model can provably transform the original ill-posed disaggr egation prob- lem to an identiﬁable one. • Ef fectiveness: P R E M A recovers data with large im- provement over the competing methods on r eal data. • Unknown aggregation: the proposed model works even when the aggr egation mechanism is unknown. • Flexibility : P R E M A can disaggregate partially ob- served data. Reproducibility: The datasets we use are publicly available; our code is also available online 1 . Preliminary results of part of this work were presented in the Proceedings of the Paciﬁc-Asia Conference on Knowledge 1 Code is available in https://github.com/FaisalAlmutairi/Prema P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 3 T ABLE 1: Symbols and Deﬁnitions Symbol Deﬁnition x , X , X V ector , matrix, tensor X n Mode-n matricization (unfolding) k . k F Frobenius norm of a matrix/tensor X T T ranspose of matrix X vec(.) V ectorization operator of a matrix/tensor [ [ . ] ] Kruskal operator , e.g., X ≈ [ [ A , B , C ] ] ◦ Outer product ⊗ Kronecker product  Khatri-Rao product (column-wise Kronecker) ~ Hadamard (element-wise) product Discovery and Data Mining (P AKDD) 2020 [21]. In this journal version, the problem formulation is generalized to handle aggregated data with missing entries. Although accounting for missing entries makes the problem more complicated, our proposed models and careful algorithmic design yield an algorithmic framework that is efﬁcient and comparable to [21] (which does not handle missing entries), both in terms of accuracy and computational complexity . W e also provide identiﬁability pr oofs, detailed model and complexity analy- sis, and conduct extensive experiments. The rest of the paper is structur ed as follows. W e explain the needed background and the related work in Section 2, and introduce our proposed method in Section 3. Then, we explain our experimental setup in Section 4 and show the experimental results in Section 5. Finally , we summarize conclusions and take-home points in Section 6. 2 B AC K G R O U N D & R E L AT E D W O R K In this section, we review some tensor algebraic tools uti- lized by the proposed framework, deﬁne the disaggregation problem, and provide an overview of the related work. T able 1 summarizes the main symbols and operators used throughout the paper . 2.1 T ensor Algebra T ensors are multidimensional arrays indexed by three or more indices, ( i, j, k, ... ) . A thir d-order tensor X ∈ R I × J × K consists of three modes: rows X (: , j, k ) , columns X ( i, : , k ) , and ﬁbers X ( i, j, :) . Moreover , X ( i, : , :) , X (: , j, :) , and X (: , : , k ) denote the i th horizontal, j th lateral, and k th frontal slabs of X , r espectively . T ensor decomposition (CPD/P ARAF AC): The outer prod- uct of two vectors ( a ◦ b ) results in a rank-one matrix. A rank-one third-order tensor X ∈ R I × J × K is an outer product of three vectors: X ( i, j, k ) = a ( i ) b ( j ) c ( k ) , ∀ i ∈ { 1 , ..., I } , j ∈ { 1 , ..., J } , and k ∈ { 1 , ..., K } , i.e., X = ( a ◦ b ◦ c ) , where a ∈ R I , b ∈ R J , and c ∈ R K . The Canonical Polyadic Decomposition (CPD) (also known as P ARAF AC) of a third-order tensor X ∈ R I × J × K decomposes it into a sum of R rank-one tensors [22], i.e., X = R X r =1 a r ◦ b r ◦ c r (1) where R is the tensor rank and repr esents the minimum number of outer products needed, and a r ∈ R I , b r ∈ R J , and c r ∈ R K . For brevity , we use X = [ [ A , B , C ] ] to denote the relationship in (1). A ∈ R I × R , B ∈ R J × R , and C ∈ R K × R are the factor matrices with columns a r , b r and Fig. 2: Illustration of mode product with ( I u < I ), ( J v < J ), and ( K w < K ). c r respectively , i.e., A = [ a 1 a 2 . . . a R ] , and likewise for B and C . CPD uniqueness: An important property of the CPD is that A , B , C are essentially unique under mild conditions. CPD identiﬁability is established by the following theorem: Theorem 1. [23] Let X = [ [ A , B , C ] ] with A : I × R , B : J × R , and C : K × R . Assume I ≥ J ≥ K without loss of generality . If R ≤ 1 16 J K , then the decomposition of X in terms of A , B , and C is essentially unique, almost surely – i.e., for almost every ( A , B , C ) except for a set of Lebesgue measure zero. Essential uniqueness means that A , B , C are unique up to common column permutation and scaling (scaling a column of one matrix that is compensated by counter- scaling the corresponding column of another matrix). The CPD is also essentially unique, even if the tensor is incomplete (has missing entries). Several results have estab- lished CPD identiﬁability of tensors with missing entries, considering ﬁber sampled [24], regularly sampled [25] or randomly sampled tensors [26]. The conditions for unique- ness are in general stricter compared to the case where the full tensor is available. T ensor matricization (unfolding): There are three differ ent ways to unfold (obtain a matrix view of) a thir d-order tensor X of size I × J × K . First, the mode-3 unfolding is obtained by the vectorization and parallel stacking of the fr ontal slabs X (: , : , k ) as follows [27] X 3 = [ vec ( X (: , : , 1)) , ..., vec ( X (: , : , K ))] ∈ R I J × K (2) Equivalently , we can expr ess X 3 using the CPD factor matrices as X 3 = ( B  A ) C T . In the same vein, we may consider horizontal slabs to express the matricization over the ﬁrst mode X 1 := [ vec ( X (1 , : , :)) , ..., vec ( X ( I , : , :))] = ( C  B ) A T ∈ R J K × I (3) or lateral slabs to obtain mode-2 unfolding X 2 := [ vec ( X (: , 2 , :)) , ..., vec ( X (: , J , :))] = ( C  A ) B T ∈ R I K × J (4) Mode product: It is the operation of multiplying a tensor by a matrix in one particular mode, e.g., mode-1 pr oduct of matrix U ∈ R I u × I and tensor X ∈ R I × J × K corresponds to multiplying every column X ( i, : , k ) of the tensor by U [28]. Similarly , mode-2 (mode-3) product corresponds to multi- plying every row (ﬁber) of X by a matrix. Applying mode- 1, mode-2, and mode-3 products to a third-order tensor X ∈ R I × J × K jointly is repr esented using the following notation: Y = X × 1 U × 2 V × 3 W ∈ R I u × J v × K w (5) P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 4 where “ × n ” denotes the product over the n th mode, U ∈ R I u × I , V ∈ R J v × J , and W ∈ R K w × K . Mode-1 multiplica- tion results in r educing the tensor size in the ﬁrst dimension if ( I u < I ), similarly with the other modes; see Fig. 2. Moreover , if rows of U are binary vectors with more than one 1 , then each horizontal slab of Y is a sum of horizontal slabs of X that corr espond to the 1 ’s in a particular r ow in U . In the same vein, V and W could aggr egate the lateral and frontal slabs, r espectively . The mode product is also reﬂected in the CPD of the tensor , i.e., if X in the operation in (5) admits X = [ [ A , B , C ] ] , then Y = [ [ UA , VB , W C ] ] . 2.2 Disag gregation Prob lem The goal of the disaggr egation task is to estimate a particular dataset in a higher resolution, given observations in lower resolution. In this subsection we present a high level linear algebraic view of disaggregation. This reveals the challenge of the task, which is the relationship between equations versus unknowns; detailed analysis follows in the next section. In the disaggregation problem, we are given a set of measurements y ∈ R I u aggregated over the dataset x ∈ R I , and our goal is to ﬁnd x . This can be cast as a linear inverse problem y = Ux , wher e U ∈ R I u × I is a ‘fat’ aggregation ma- trix that relates the measurements to the unknown variables. In this work, we consider the case where the target high- resolution data are multidimensional (tensor). Speciﬁcally , let X ∈ R I × J × K be the target high-resolution third-order tensor . In the considered problem, we are given two sets of observations, each aggregated over one or more different dimension(s). This is common when data are reported by differ ent agencies, resulting in multiple views of the same information. The key insight is that the given aggregates can be modeled as mode pr oduct(s) of X by an aggregation matrix in a particular mode(s). T o see this, consider tensor X ∈ R 4 × 2 × 2 , a simple example of a set of observations aggregated over the ﬁrst mode can be expr essed as  1 1 0 0 0 0 1 1  | {z } U ∈ R 2 × 4 ×    x 111 x 121 x 112 x 122 x 211 x 221 x 212 x 222 x 311 x 321 x 312 x 322 x 411 x 421 x 412 x 422    | {z } X T 1 ∈ R 4 × (2 × 2) =  y 111 y 121 y 212 y 122 y 211 y 221 y 212 y 222  | {z } Y T 1 ∈ R 2 × (2 × 2) (6) where X 1 and Y 1 are mode-1 unfolding of X and Y , respectively . The same idea applies when the aggregation is over the second (third) mode using mode-2 (mode-3) product, respectively . In practical settings, the number of available aggregated measurements is much smaller than the number of variables (i.e., I u  I ), resulting in an under- determined, ill-posed problem. This is the major challenge of disaggregation, even when more than one set of aggre- gates are available. An even mor e challenging case appears when one of the available observation sets is aggregated over more than one mode/dimension simultaneously (e.g., Y ∈ R I u × J v × K , where I u < I and J v < J ). For instance, sales ar e reported for categories rather than individual items and over groups of stor es. This is a double aggregation over stores and items, and the proposed method can work un- der such a challenging scenario. Moreover , the aggregated observations might be partially observed (i.e., Y 1 in (6) has missing entries). This makes the problem mor e complicated, however our appr oach ef ﬁciently handles data with missing entries. 2.3 Related W ork Data disaggregation and fusion: The problem of data in- tegration and fusion [29], [30] from multiple sources has attracted the attention of several communities, due to the increasing access to all kinds of data, especially in database applications. A very challenging task in data integration, is that of recovering a sequence of events (e.g., time series) from multiple aggregated reports [15], [31], [32], [33]. A common approach is to formulate the problem as linear least squares as in (6). In practice, however , the number of available aggregated samples is often signiﬁcantly smaller than the length of the target series, resulting in an under- determined system of equations. T o resolve this, previous algorithms have r esorted to T ikhonov-type r egularization of the ill-posed problem to impose some domain knowledge constraints, e.g., smoothness and periodicity [14]. Fusing multiple observations aggregated in dif ferent dimensions for disaggregation purposes is a well studied task in the ﬁeld of ﬁnance and economics [16], [17], [18], [19], [20]. The considered approaches try to exploit lin- ear relations between the target series in high resolution and the available aggregated measurements. However , this results in an under-determined linear system, even with multiple aggregates. Therefor e, the majority of these works assume linear regression models with priors and additional information. Moreover , it is unclear whether the assumed models are identiﬁable, i.e., the model is not guaranteed to disaggregate the data. (Coupled) tensor factorization: T ime series analysis, for various applications, is moving towards modern high- dimensional methods. For example, matrix and tensor fac- torization have been used in demand forecasting [34], min- ing and information extraction from complex time-stamped series [35], and prediction of unknown locations in spatio- temporal data [36]. Data shar e common dimension(s) in a wide spectr um of applications. In such cases, coupled factorization techniques are commonly used to fuse the information for various objectives. For example, coupled factorization is often em- ployed to integrate contextual information into the main data [37]. In recommender systems, for instance, we have a (user × item × time) tensor and a (user × featur es) matrix. In this case, the tensor and the features matrix are coupled in the user mode [38]. Coupled tensor factorization has also been proposed for image processing [39], remote sensing [40], and medical imaging problems [25], [41]. Closest to our work is the approach in [42], which employs a coupled CPD to fuse a hyperspectral image with a multispectral image, to produce a high spatial and spectral resolution image. T o our knowledge, this work and its confer ence version [21] are the ﬁrst that propose a tensor factorization approach to tackle data disaggr egation applications. P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 5 Fig. 3: Overview of P R E M A . 3 P R O P O S E D F R A M E W O R K : P R E M A Multidimensional data ar e indexed by multiple indices, e.g., ( i, j, k ) . Therefor e, they can naturally be repr esented as a tensor X ∈ R I × J × K . The different modes repr esent the physical dimensions of the data (e.g., time stamps, locations, items, users). For the sake of simplicity of exposition, we focus on three-dimensional data in our formulations and algorithms. However , the proposed framework can handle more general cases with data of higher dimensions. In the remainder of this section, we give a detailed description and analysis of P R E M A . Particularly , we state the problem and explain the proposed model in high level in Section 3.1, formulate P R E M A in Section 3.2, and present the main algorithm in Section 3.3. W e discuss the complexity of P R E M A in Section 3.4, and identiﬁability in section 3.5. Finally , we introduce B - P R E M A in Section 3.6, to tackle the disaggregation problem in the case where the aggregation matrices ar e unknown. 3.1 Pr oblem & Model Overview Multidimensional aggregation is common when data are collected or r eleased by dif ferent agencies, resulting in multiple views of the same dataset. W e will explain the concept with the example of retail sales, which we use in the experiments. Estimating the retail sales in higher resolution enables accurate forecasting of future demand, and planing of economically efﬁcient commerce. There are two sources of data used for this forecasting task: 1) Point of Sale (POS) data at the store-level, commonly aggregated in time (temporal aggregate Y t ); and 2) historical orders made to the suppliers by the retailers’ Distribution Centers (DC or ders), aggr egated over their multiple stores (contem- poraneous aggregate Y c ). In particular , DC order data are immediately available to the suppliers, whereas the POS data are owned by the retailers. Both DC order and POS data are used to forecast demand, and especially POS data are vital in predicting futur e orders [43]. For that r eason, many retailers share POS with their suppliers to assist in forecast- ing orders and avoid shortage or excess in inventory [44]. In a more restricted scenario, the second source collects information about each category of items rather than each item individually . Oftentimes, data are partially observed, i.e., Y t and Y c have missing entries. In this example, not all items are of fered in all stor es during all the considered time stamps. The question that arises is whether we can fuse these sources to reconstr uct high-resolution data in stores, items, and time dimensions. Formally , we ar e interested in the following: Problem 1 ( Multidimensional Disaggregation). • Given two aggregated views of three-dimensional data X ∈ R I × J × K : Y t ∈ R I × J × K w , and Y c ∈ R I u × J × K (or Y c ∈ R I u × J v × K ), with I u < I , J v < J , and K w < K , and possibly missing entries. • Recover the original disaggregated multidimen- sional data X ∈ R I × J × K . Note that each aggr egated view is the result of the mode product of the target data with an aggregation matrix. In particular Y t = X × 3 W , where W ∈ R K w × K is an aggregation matrix with K w < K , and Y c = X × 1 U , wher e U ∈ R I u × I is an aggregation matrix with I u < I . In the case where one view is jointly aggregated in 2 dimensions, e.g., sales are aggregated over groups of stores and groups of items, Y c = X × 1 U × 2 V , where V ∈ R J v × J is an aggregation matrix with J v < J . P R E M A aims to fuse the differ ent available aggr egates in order to estimate the multidimensional data in the desired higher resolution. At a higher level, the main idea behind the proposed method is that the target multidimensional data, X ∈ R I × J × K , admits a CPD model. Therefor e, it can be well approximated using its CPD factors A , B , C (i.e., X = [ [ A , B , C ] ] ). Exploiting the low-rank modeling helps in reducing the number of unknown variables, especially if the data are highly correlated. Then, the CPD factors of the two aggregated observations ar e Y t = [ [ A , B , W C ] ] (7) Y c = [ [ UA , VB , C ] ] (8) P R E M A learns the factor matrices A , B , and C by applying a coupled CPD model on the available aggregates with respect to the available observations. Note that up to this point, we have not explained how missing entries in Y t and Y c are treated, which will be discussed in the next section. Figure 3 illustrates the high level picture of our model. 3.2 P R E M A : Form ulation If we have the original (disaggregated) data in the tensor X with missing entries, a common way to estimate its CPD factors is by adopting a least squares criterion to minimize the difference between the original tensor X and its CPD P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 6 [ [ A , B , C ] ] with respect to the available (observed) entries. This can be done by adding a weight tensor that masks the available entries, i.e., minimize A , B , C k Ω ~ ( X − [ [ A , B , C ] ]) k 2 F (9) where Ω is deﬁned as Ω ( i, j, k ) = ( 1 , if X ( i, j, k ) is available 0 , otherwise (10) Fortunately , many real life data exhibit low-rankness due to the correlation between the elements within each dimension (e.g., stores, items, time stamps), i.e., R in (1) is small r elative to the size of the tensor . In the considered disaggregation task, we only have aggregated views of the multidimensional data (i.e., com- pressed version of the target tensor X ). These aggregated views can have missing elements for various application- speciﬁc reasons such as privacy , lack of data collection, or absence of events. W e use the fact that the aggregated tensors share the same factors (up to aggregation) as shown in equations (7) and (8) to jointly decompose Y t and Y c by means of coupled tensor factorization. T o this end, we obtain the following formulation: min A , B , C F ( A , B , C ) := k Ω t ~ ( Y t − ([ [ A , B , W C ] ]) k 2 F + k Ω c ~ ( Y c − ([ [ UA , VB , C ] ]) k 2 F (11) where Ω t ∈ { 0 , 1 } I × J × K w and Ω c ∈ { 0 , 1 } I u × J v × K are weight tensors with ones at the indices of the available entries in Y t and Y c , respectively , and zeros elsewhere. As a result, the CPD factors A , B , and C are learned with respect to the available data. One could add a r egularization parameter λ to control the weight between the two terms, however , we observed that it does not signiﬁcantly affect the disaggregation performance. Note that if we have additional aggregated observations, we can incorporate them using the same concept. Enforcing non-negativity constraints on the factors seems natural if we are dealing with count data, however , we empirically observed that it does not improve the disaggr egation accuracy . 3.3 P R E M A : Algorithm The optimization in (11) is non-convex, and NP-hard in general. T o tackle it, we derive a Block Coordinate Descent (BCD) algorithm that updates the thr ee variables in an alternating fashion. Starting from initial factors A (0) , B (0) , and C (0) , at every iteration k ∈ N , we cyclically update each factor while ﬁxing the other two. Each update is a step in the direction of the negative gradient of F with respect to the corr esponding factor . T o simplify the expr essions, let us deﬁne e A = UA , e B = VB , and e C = W C . The partial derivative of the above objective function F w .r .t. A is as follows—the derivations are deferred to Appendix A. ∂ F ∂ A = ∇ A F = 2  Ω t 1 ~ (( e C  B ) A T − Y t 1 ) | {z } E t  T  e C  B  + 2 U T  Ω c 1 ~ (( C  e B ) e A T − Y c 1 ) | {z } E c  T  C  e B  . (12) where Y t 1 , Y c 1 , Ω t 1 , and Ω c 1 are mode-1 unfolding of the corresponding tensors. Similarly , we derive the derivatives of F w .r .t. B and C using mode-2 and mode-3 unfoldings of the tensors, r espectively , and get the following equations: ∇ B F = 2  Ω t 2 ~ (( e C  A ) B T − Y t 2 )  T  e C  A  + 2 V T  Ω c 2 ~ (( C  e A ) e B T − Y c 2 )  T  C  e A  , (13) ∇ C F = 2 W T  Ω t 3 ~ (( B  A ) e C T − Y t 3 )  B  A  + 2  Ω t 3 ~ (( e B  e A ) C T − Y c 3 )  T  C  e A  (14) W ith the above gradient expressions at hand, we have es- tablished the update direction for each block (factor), which is the negative gradient of F with respect to each factor: A = A − α ∇ A F , (15) B = B − β ∇ B F , (16) C = C − γ ∇ C F . (17) W e now seek to select the step-size terms α , β , and γ . W e use the exact line sear ch approach for this task. At every iteration k ∈ N , α is chosen to minimize F along the line { A − α ∇ A F | α ≥ 0 } argmin α ≥ 0 F  A − α ∇ A F  (18) Luckily , in our case, the above optimization can be solved optimally without extra heavy computations. The optimal solution to (18) is as follows (refer to Appendix B for derivations). α = max  0 , e T t g t + e T c g c g T t g t + g T c g c  , (19) where e t = vec ( E t ) , e c = vec ( E c ) , with E t and E c are as deﬁned in (12), and g t = vec ( Ω t 1 ~ (( e C  B ) ∇ A F T )) , (20) g c = vec ( Ω c 1 ~ (( C  e B )( U ∇ A F ) T )) . (21) Note that E t and E c are already computed in (12). W e have also computed ( e C  B ) and ( C  e B ) in (12), which are needed to obtain g t amd g c , respectively . Thus, the exact line sear ch step only requir es: • Multiplying the transpose of the gradient ∇ A F ∈ R I × R by a K w J × R matrix in (20) (and U ∇ A F ∈ R I u × R by a K J v × R matrix in (21)). • Computing the inner products in (19). In a similar fashion, β and γ are obtained by solving the following optimization functions, r espectively: β = argmin β ≥ 0 F  B − β ∇ B F  (22) γ = argmin γ ≥ 0 F  C − γ ∇ C F  (23) The solutions to the above are similar to the case of α , but with mode-2 and mode-3 tensor unfoldings. W e pr ovide an illustrative example of deriving the solution to (18), (22)-(23) in Appendix B. The overall steps of P R E M A ar e summarized in Algorithm 1. P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 7 Algorithm 1 : P R E M A (11) input: Y t , Y c , U , V , W , R Initialize: A , B , C (refer to Appendix C) Repeat • Update A using (15), (12), and (19) • Update B using (16), (13), and (22) • Update C using (17), (14), and (23) Until criterion is met (max. # iterations) output: A , B , C W e observed empirically that a careful initialization for the factor matrices in Algorithm 1 results in a better dis- aggregation accuracy , and substantially reduces the opera- tional time (i.e., reduces the requir ed number of iterations). Thus, we design a careful initialization method based on CPD. First, we set the missing entries to zero, then perform CPD on one tensor to get initial estimates of two factors. Then, we solve a system of linear equations using the other tensor to obtain an initial estimate of the thir d factor . For in- stance, from CPD( Y t ) we get A , B , and e C . Then, we obtain C by solving the linear system Y c 3 =  ( VB )  ( UA )  C T . This way , we establish an initial guess for A , B , and C . W e provide detailed initialization steps in Appendix C. 3.4 P R E M A : Complexity Anal ysis The complexity of P R E M A is determined by the matrix multiplication operations required to obtain the gradients and the step size terms. The products in the gradient ex- pressions have the dominant computational cost. Ther efore, we break down the computational complexity below using the gradient w .r .t. A in (12); the complexity of computing the gradients w .r .t B and C are similar . Recall (12): ∇ A F = 2  Ω t 1 ~ (( e C  B ) A T − Y t 1 ) | {z } E t ∈ R J K w × I  T  e C  B  + 2 U T  Ω c 1 ~ (( C  e B ) e A T − Y c 1 ) | {z } E c ∈ R J v K × I u  T  C  e B  1) Computing the two Khatri-Rao products costs O ( K w J R + K J v R ) , wher e R is the rank. 2) The cost of multiplying ( e C  B ) with A T , and ( C  e B ) with e A T is O ( I J K w R + I u J v K R ) . 3) The element-wise products ( ~ ) cost O ( nnz ( Ω t ) + nnz ( Ω c )) . 4) Multiplying E T t and E T c with the Khatri-Rao prod- ucts costs O ( R ( nnz ( Ω t ) + nnz ( Ω c ))) . 5) In the worst case where U and Ω c 1 have no zeros, the cost of multiplying U T with E T c ( C  e B ) is O ( I I u R ) . The dominant cost terms are in the 2 nd point above. Thus, the overall complexity is O ( I J K w R + I u J v K R ) . Since R is usually very small relative to the size of the tensors in real data, the complexity is linear with the size of Y t and Y c . 3.5 P R E M A : Identiﬁability Analysis After intr oducing the model and the algorithm, we establish the identiﬁability of the P R E M A model. As mentioned ear- lier , the multidimensional disaggregation task is an inverse ill-posed problem. Considering a low rank CPD model on the data, results in a tensor disaggregation problem with a unique solution. In other words, the optimal solution of (11) is guaranteed to be unique, under mild conditions, and identify the original ﬁne-resolution tensor almost sur ely . For the sake of simplicity we ﬁrst assume that Y t does not have any missing values. Proposition 1. Let X ∈ R I × J × K be the target tensor to disaggregate with CPD X = [ [ A , B , C ] ] of rank R . Also let Y t ∈ R I × J × K w = X × 3 W and Y c ∈ R I u × J v × K = Ω c ~ ( X × 1 U × 2 V ) be the two aggregated sets of observations. Assume that A , B and C ar e drawn from some absolutely continuous joint distribution with re- spect to the Lebesque measure in R ( I × J × K ) R , and that ( A ? , B ? , C ? ) is an optimal so lution to problem (11). Also assume that the number of observed entries at each frontal slab of Y c is greater than or equal to R . Then, b X = [ [ A , B , C ] ] disaggregates Y t , Y c to X almost sur ely if R ≤ 1 16 min { I J, I K w , J K w , 16 I u J v } . The proof is intuitive and parallels recent results ob- tained in the hyperspectral imaging literature [42]. Proof sketch: W e use Theorem 1 to claim identiﬁability of Y t . Then factors A , B can be identiﬁed up to common per- mutation and scaling. The solution for C is obtained via solving an overdetermined linear system of equations using Y c . This way permutation and scaling is preserved and the target tensor is recovered as X = [ [ A , B , C ] ] . In the case where Y t has missing entries, identiﬁability depends on the pattern of missings. Speciﬁcally , the results in [24], [25], [26] can be employed, when the available measurements are ﬁber , regularly or randomly sampled respectively . The conditions are more restrictive compar ed to the case of fully observed tensor , but guarantee identiﬁability of A , B up to common permutation and scaling. The solution for C is the same as in the previous case. The detailed proof is presented in Appendix D. 3.6 B - P R E M A : P R E M A with Unknown Aggregation In most practical applications, the aggregation details are known. However , there exist cases with limited knowledge on how the data are aggregated, i.e., we do not know (or have partial knowledge of) U , V , and W . W e consider the case where each available view is aggregated in one dimension, and propose the following formulation to get the factors of the disaggregated tensor ( A , B , and C ): min A , B , C , e A , e C L ( A , B , C , e A , e C ) := k Ω t ~ ( Y t − [ [ A , B , e C ] ]) k 2 F + k Ω c ~ ( Y c − [ [ e A , B , C ] ]) k 2 F + µ R ( C , e C ) (24) Where e A = UA , and e C = WC are treated as separate variables since we do not know U and W , and R is a regularization function. This problem is more challenging than (11) as the number of variables has been increased, with the same number of equations. Another challenge is that there is a scaling ambiguity between the factors of the two tensors if we omit the regularization term in (24). Scaling and counter-scaling the factors of the tensor Y t (or Y c ) does not change its estimated value, or the value of the P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 8 cost function in (24). For example, scaling A by a λ , and e C by 1 /λ does not change the value of b Y t 1 = ( e C  B ) A T , and as a result, it gives the same cost value. However , this scaling changes the estimated value of the disaggregated tensor b X 1 = ( C  B ) A T . This is because tensor X shares factors with both Y t and Y c . T o over come this, we observe that the temporal aggr egation W in most aggregated data is non-overlapping and includes all the time ticks 2 . This means that the respective column sums of C and e C should be equal. W e exploit this observation by choosing the following regularization term for (24) R ( C , e C ) = k 1 T C − 1 T e C k 2 2 , which r econciles for the scaling ambiguity . In order to tackle the problem above, we derive a BCD algorithm, in the same fashion as Algorithm 1. The steps are summarized in Algorithm 2. W e alternate between updating the ﬁve variables. In each update, we take a step in the direction of the negative gradient w .r .t. the corresponding variable. The derivations of the gradients are shown in Appendix A. The step size parameters α, ρ, β , γ , and σ are chosen using the exact line search explained in section 3.3 and Appendix B. T o initialize the factors in Algorithm 2, we set the missing entries to zer o, then we use Tensorlab and compute ( CPD ( Y c )) to get e A , B , and C . T o get an initial estimate of e C , we exploit the fact that the temporal aggregates are the summation over consecutive time stamps in most real data. Therefor e, we sum every consecutive w = K K W rows in C . This way we approximate the temporal aggregation process in a very intuitive way , the true aggr egation matrix being unknown 3 . 4 E X P E R I M E N T A L D E S I G N In this section, we provide a detailed description of the setup we use in our experiments. First, we describe the data used in the experiments. Then, we explain the aggregation applied on these data to generate aggregated views. Last, we present the evaluation metrics and baselines used for comparison. 4.1 Datasets W e evaluate P R E M A using the following public datasets, which ar e readily available online: DFF 4 : Retail sales data, called Dominick’s Finer Foods (DFF), collected by the James M. Kilts Center , University of Chicago Booth School of Business. DFF used to be a grocery store chain based in the Chicago area until all of its stores were closed. Sales, in this dataset, are divided into category- speciﬁc ﬁles. In particular , each ﬁle contains the weekly sales (i.e., number of sold units) of items belonging to a speciﬁc category (e.g., cheese, cookies, soft drinks, etc) in about 100 stores. DFF data contain the geographical locations of the different stores, which we use to aggregate stores into 2 Known overlap, e.g., 50% , can be treated similarly – as in this case every atom is counted twice. 3 In the experiments, we make sure that the true temporal aggrega- tion and the estimated one do not align. 4 https://www .chicagobooth.edu/research/kilts/datasets/dominicks Algorithm 2 : B - P R E M A input: Y t , Y c , R , µ Initialize: e A , B , C , ← CPD ( Y c ) e C ( k w , :) ← P w × k w k = w ( k w − 1)+1 C ( k , :) A ← solve Y t 3 = A ( e C  B ) T Repeat • α ← argmin α ≥ 0 L ( A − α ∇ A L ) ; A = A − α ∇ A L • ρ ← argmin ρ ≥ 0 L ( e A − ρ ∇ e A L ) ; e A = e A − ρ ∇ e A L • β ← argmin β ≥ 0 L ( B − β ∇ B L ) ; B = B − β ∇ B L • γ ← argmin γ ≥ 0 L ( C − γ ∇ C L ) ; C = C − γ ∇ C L • σ ← argmin σ ≥ 0 L ( e C − σ ∇ e C L ) ; e C = e C − σ ∇ e C L Until termination criterion is met (max. # iterations) output: A , B , C groups. W e create ground truth three-dimensional tensors, using 10 differ ent category-speciﬁc datasets. This way , a (stores × items × weeks) tensor is formed for each category . These 10 department-speciﬁc datasets are listed as the ﬁrst group in T able 2—we use the three bold letters acronym for these categories in the results. W e pick the 50 most popular items from each category . Note that this results in an ‘incomplete’ tensor , owing to the fact that not all items were offer ed in all stores, or they were offer ed only for part of the time in some stores. These tensors have varying statistics (see T able 2), which allows thorough testing and analysis. W e also form an additional (stores × items × weeks) tensor that contains items from all the 10 different categories combined, 50 items from each (namely Mixed DFF in T able 2). W almart 5 : Historical weekly sales data for 99 differ ent de- partments in 45 W almart stor es located in dif ferent r egions. A (stores × departments × weeks) tensor is created from these data. The resulting tensor is complete and has no missing entries. The size of each store (in square feet) is included in the data (we use this information to form groups of stor es). Crime 6 : Reported incidents of crimes that occurred in the city of Chicago from 2001 to present. Each incident is marked with its beat (police geographical area), and a code indicating the crime type. There ar e 304 geographical areas and 388 crime types in total. Using this dataset, we form a (locations (by beat) × crime types × months) tensor . W eather 7 : Daily weather observations from 49 stations in Australia. These observations contain 17 dif ferent variables, e.g., min temperature, max temperature, cloud, humidity , wind, etc. W e form a (station (location) × variables × days) tensor using one year of daily observations. T able 2 summarizes the differ ent datasets described above, with their size, maximum and average values, Stan- dard Deviation (SD), and percentage of missing entries and zeros. These datasets are the ground truth in our experi- ments, and repr esented by X ∈ R I × J × K . 5 https://www .kaggle.com/c/walmart-recruiting-store-sales- forecasting/data 6 https://www .kaggle.com/chicago/chicago-crime/activity 7 http://www .bom.gov .au/climate/data/ P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 9 T ABLE 2: Summary of datasets and their statistics. Dataset ( X ) Size Max A vg SD % (missing entries) % (zero entries) BA T h Soap 93 × 50 × 266 52 0.79 1.34 44.73 33.37 B ottled J ui C es 93 × 50 × 393 12288 13.76 50.08 8.79 9.19 CHE eses 93 × 50 × 393 18176 26.65 88.29 8.59 5.51 COO kies 94 × 50 × 390 14080 16.00 56.86 9.81 7.57 CRA ckers 94 × 50 × 382 14080 8.21 29.61 14.21 7.57 C anned SO up 93 × 50 × 379 34494 40.46 133.42 8.64 4.54 F abric S o F teners 93 × 50 × 397 7168 5.68 18.84 18.64 27.48 GRO oming 93 × 50 × 272 232 1.94 2.94 7.66 32.66 P aper T o W els 93 × 50 × 389 19712 45.36 117.82 36.72 23.49 S oft DR inks 93 × 50 × 391 18944 48.81 155.09 8.58 11.18 Mixed DFF 93 × 500 × 230 17610 19.01 71.30 15.30 17.83 W almart 45 × 99 × 143 6.93e+05 1.05e+04 1.99e+04 0 33.84 Crime 304 × 388 × 221 325 0.26 1.47 0 91.56 W eather 49 × 17 × 365 1038 10.23 95.65 0 93.30 4.2 Aggregation Conﬁguration The aggregated observations (compr essed tensors), that are used as inputs to the disaggregation methods, are generated from X following two practical scenarios described below: Scenario A : The multidimensional data, we aim to disag- gregate, are repr esented by X ∈ R I × J × K . Instead of the full tensor X , we are given two aggregated views: 1) tem- porally aggregated tensor Y t = X × 3 W , i.e., aggr egated in the thir d dimension; and 2) contemporaneously aggregated tensor Y c = X × 1 U , aggregated in the ﬁrst mode (e.g., stores/locations dimension). W e use the 10 category-speciﬁc datasets from DFF and W almart data to test this scenario. The stores are aggregated according to their geographical locations in the DFF datasets, and based on their sizes in W almart data. W e also test this scenario on W eather data, where the temporal aggregate represents the weather observations averaged over a course of time, and the con- temporaneous aggregate is the average of the observations over a geographical r egion. Scenario B : In this scenario, two aggregated views of X are given: 1) similar to the previous scenario, temporally aggregated tensor Y t = X × 3 W ; and 2) contemporaneously aggregated tensor Y c = X × 1 U × 2 V , aggregated in the ﬁrst and second dimensions (e.g., sales counts that are jointly aggregated over gr oups of stores and groups of items). W e use Mixed DFF and Crime data to test this scenario. The stores are aggregated into groups according to their locations in Mixed DFF data, whereas items are aggregated according to their categories. In Crime data, locations and types are grouped based on the closeness in geographical location and similarity in crime type, r espectively . Note that when V = I , this yields to Scenario A. Evidently , this scenario is more challenging since the second observation is aggregated in two modes, i.e., double aggregation, r esulting in fewer measurements. The difﬁculty of the pr oblem also depends on the aggre- gation level , i.e., the number of data points (e.g., weeks, items, or stores) in one sum. Fewer aggr egated measur ements result in more challenging problems from an “equations versus unknowns standpoint. W e test the disaggregation performance using dif ferent aggregation levels for each dimension. 4.3 Ev aluation Baselines & Metrics W e evaluate the disaggregation performance of the pro- posed method using the Normalized Disaggregation Error (NDE = k X − b X k 2 F / k X k 2 F ), where b X is the estimated data. The baseline methods are described next. Note that we compare to state-of-art approaches in the time series dis- aggregation literature as well as methods developed to fuse multiple views of multidimensional data, but for different tasks. T o the best of our knowledge our work is the ﬁrst to perform disaggregation on multidimensional data from multiple views. Mean: This baseline assumes that the constituent data atoms (entries in X ) have equal contribution in their aggregated samples. The ﬁnal estimate of Mean is the average of the estimation from the temporal and the contemporaneous aggregates. For example, the contemporaneous aggregate reports 100 units sold in 10 stores in the ﬁrst week of January , and the temporal one tells us that 80 units were sold in January (4 weeks) in Stor e 1. Then, Mean estimation of week 1 and store 1 is (100 / 10 + 80 / 4) / 2 = 15 LS: This baseline is inspired by [19], [20], where a least squares criterion is adopted on the linear relationship be- tween the target time series in high resolution and the available aggregates. The resulting linear system is under- determined, thus, these works assume a linear regr ession model between the target series and some set of indicators. In their context, indicators are time series available in high resolution that are expected to display similar ﬂuctuations to the target series. For example, the stock price of an oil company is a linear combination of the stock prices of other relevant companies. This assumption requir es additional data that are not available in our datasets. Therefor e, we resort to the minimum-norm solution min X k vec ( Ω t 3 T ) ~  vec ( Y t 3 T ) − f W vec ( X 3 T )  k 2 2 + k vec ( Ω c 3 ) ~  vec ( Y c 3 ) − e U vec ( X 3 )  k 2 2 (25) where f W = I ⊗ W and e U = I ⊗ V ⊗ U . H-Fuse: [14] This baseline constrains the solution to the LS baseline above to be smooth, i.e., it penalizes lar ge differ ences between adjacent time ticks. HomeRun: [15] T o circumvent the indeterminacy of the linear system in the time series disaggr egation problem, this baseline solves for the disaggregated series in the fr equency domain. More speciﬁcally , HomeRun searches for the coefﬁ- cients of the Discrete Cosine T ransform (DCT) that repr esent the target high-resolution series. The key point is that the number of non-negligible DCT coef ﬁcients of the time series is much smaller than its length. In other words, the DCT is used as a sparsifying dictionary to reduce the number of variables. HomeRun also imposes smoothness and non- negativity constraints. P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 10 CMTF: Couple Matricized T ensor Factorization has been widely used, to fuse multiple views of multidimensional data, in the hyperspectral imaging application [45], [46]— the work in [45] adds non-negativity constraints. These images are three-dimensional tensors, and the motivation behind these works is to exploit the low-rankness of the matricized image. W e compare to this model because real world multidimensional data are often well-approximated using low-rank, as we will show empirically . Using our notation, CMTF solves min A , B k Ω t 3 ~ ( Y t 3 − A ( WB ) T ) k 2 F + k Ω c 3 ~ ( Y c 3 − ( V ⊗ U ) AB T ) k 2 F (26) W e solve (26) using a BCD algorithm with exact line search. Similar to P R E M A , a good initialization for the low-rank factors improves the performance of CMTF . T o ensure fair comparison, we initialize using SVD with missing entries set to be zer os. In addition to the above baseline methods, we also test the estimation of the target disaggregated data with the following oracle baseline. CPD : W e ﬁt a CPD model directly to the ground truth tensor X with respect to the observed entries . W e use the Matlab-based package Tensorlab to compute the CPD. Then, we reconstruct b X fr om the learned factors ( A , B , C ). This baseline can also serve as a lower bound for the error produced by the proposed method P R E M A . 5 E X P E R I M E N T A L R E S U LT S In this section, we evaluate the performance of P R E M A and B - P R E M A in terms of disaggregation accuracy using real data. The two aforementioned aggregation scenarios (refer to Section 4.2) are considered with differ ent aggre- gation levels. In the experiments, we choose the rank R for P R E M A (and the CPD baseline) based on Proposition 1, unless stated otherwise. On the other hand, for CMTF , we perform a grid sear ch and show the results with the best R . W e run 10 iterations of the CPD step in the initialization of P R E M A in Algorithm 1 (or B - P R E M A in Algorithm 2) using Tensorlab , then run 10 iterations of the iterative procedur e in the algorithms. W e set µ = 100 for B - P R E M A in Algorithm 2. All experiments were performed using Matlab on a Linux server with an Intel Core i7–4790 CPU 3.60 GHz processor and 32 GB memory . 5.1 Results on Scenario A T wo aggregated views Y t , Y c are observed. T able 3 shows the disaggregation error in terms of NDE, achieved by the proposed method and the baselines on the 10 category- speciﬁc datasets from DFF . The proposed methods, P R E M A and B - P R E M A , along with the CPD oracle ar e shown under 3 differ ent ranks ( R = 10 , R = 25 , R = 40 ). In Y t , the weekly sales counts are observed on a monthly basis, while in Y c , the 93 (or 94 for some categories) stores are clustered geographically into 18 areas. This means that the measurements in the temporal aggregate Y t are about 25% of the original size, and the number of the contemporane- ously aggregated measur ements in Y c is only 19 . 35% of the disaggregated data size. "mod agg" "high agg" 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 NDE Mean LS H-Fuse HomeRun CSMF PREMA B-PREMA CPD (oracle) (a) FSF dataset "mod agg" "high agg" 0 0.1 0.2 0.3 0.4 0.5 0.6 NDE Mean LS H-Fuse HomeRun CSMF PREMA B-PREMA CPD (oracle) (b) PTW dataset Fig. 4: P R E M A works well with extr eme aggregation. For all datasets in T able 3, except BA T , P R E M A markedly outperforms the baselines—to highlight the improvement, we make the smallest error in bold and underline the second smallest. The naive mean (Mean) is good enough with BA T dataset because it is smooth (SD = 1 . 34 ) and has the largest percentage of missing entries, compared to the other datasets. The time series methods, H-Fuse and HomeRun, do not perform well with these datasets because they are designed for smooth and quasi-periodic data, respectively . T o pr ovide an example, we noticed that HomeRun improves the error of LS and H-Fuse baselines with CRA data, and found that CRA exhibit more periodicity compared to the rest of the categories. Comparing P R E M A with CPD, we see that P R E M A achieves error very close to CPD of the ground truth data with the same rank, e.g., with GRO, PTW , and SDR datasets. By looking at the performance of B - P R E M A in the table, we can see that the proposed algorithm works remarkably well when the aggregation matrices are unknown. For example, with GRO data and R = 40 , the NDE of B - P R E M A is 0 . 2472 , while NDE = 0 . 2284 with CPD. B - P R E M A disaggregates with smaller , or very similar , error compared to the baselines that uses the aggregation pattern information—see results with CRA, FSF , GRO, and SDR datasets. W ith all datasets, there is always a wide range of R under which the pr oposed algorithm works similarly well. Next, we examine the performance when we change the level of aggregation from moderate (“mod agg”) to very high (“high agg”). The disaggregation error is shown with two datasets from DFF data, FSF and PTW , in Figur e 4, and with W almart and W eather 8 datasets in Figure 5. The aggregation levels in Figure 4 are: 1) monthly basis measurements (every 4 weeks) in Y t , and the 93 stores are divided geographically into 18 areas (“mod agg”); and 2) quarterly samples (every 12 weeks) in Y t , and the stores are divided into only 9 areas (“high agg”). The rank R for P R E M A , B - P R E M A , and CPD is set to 40 in this ﬁgure. By comparing the moderate and high aggregation levels in Figure 4, we conclude that P R E M A is more robust with aggressive aggregation where only few samples are avail- able. W ith “high agg”, the number of aggregation samples is only 8 . 56% of the original size in the temporal aggregate, and 9 . 68% in the contemporaneous aggregate. In this case, the NDE of the best baseline is 3 . 04 (1 . 68) x the error of P R E M A with FSF (PTW) dataset, respectively . PTW dataset is more challenging as it has relatively high percentage of 8 HomeRun is excluded from results with W eather data as it has non-negativity constraints. P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 11 T ABLE 3: NDE of the pr oposed methods and the baselines using the 10 category-speciﬁc datasets. Dataset BA T BJC CHE COO CRA CSO FSF GRO PTW SDR %missings 44 . 73% 8 . 79% 8 . 59% 9 . 81% 14 . 21% 8 . 64% 18 . 64% 7 . 66% 36 . 72% 8 . 58% SD 1.34 50.08 88.29 56.86 29.61 133.42 18.84 2.94 117.82 155.09 Mean 0.3284 0.4441 0.3118 0.3596 0.5217 0.3309 0.5609 0.2464 0.2994 0.2860 LS 0.3328 0.6077 0.4650 0.6224 0.5889 0.4664 0.5982 0.2831 0.4593 0.5420 H-FUSE 0.3411 0.6437 0.4870 0.6414 0.5726 0.4885 0.6451 0.2863 0.4719 0.5644 HomeRun 0.3461 0.6453 0.4818 0.6284 0.5376 0.4856 0.6496 0.2877 0.4662 0.5594 CMTF 0.4254 0.1818 0.1954 0.1783 0.7455 0.1564 0.1930 0.2908 0.2577 0.1633 P R E M A , R=10 0.5203 0.1978 0.1756 0.1757 0.2587 0.2057 0.2019 0.3198 0.2844 0.2039 P R E M A , R=25 0.5079 0.1684 0.1516 0.1371 0.2624 0.1373 0.1790 0.2581 0.2132 0.1438 P R E M A , R=40 0.4972 0.1572 0.1491 0.1318 0.2589 0.1332 0.1747 0.2458 0.1969 0.1348 CPD (oracle), R=10 0.4782 0.0937 0.0723 0.1205 0.0776 0.0776 0.0810 0.2919 0.2356 0.1329 CPD (oracle), R=25 0.4345 0.0586 0.0419 0.0676 0.0518 0.0476 0.0494 0.2448 0.1358 0.0822 CPD (oracle), R=40 0.4109 0.0443 0.0321 0.0532 0.0438 0.0345 0.0399 0.2284 0.1007 0.0605 B - P R E M A , R=10 0.5242 0.3012 0.3525 0.2207 0.3080 0.1752 0.2090 0.3156 0.3594 0.2008 B - P R E M A , R=25 0.5002 0.3583 0.3553 0.2496 0.2976 0.1756 0.1892 0.2557 0.3758 0.1539 B - P R E M A , R=40 0.4914 0.3909 0.3823 0.2942 0.3042 0.1825 0.1846 0.2472 0.3963 0.1620 "mod agg" "high agg" 0 0.01 0.02 0.03 0.04 0.05 NDE Mean LS H-Fuse HomeRun CSMF PREMA B-PREMA CPD (oracle) (a) W almart dataset "mod agg" "high agg" 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 NDE (log scale) (b) W eather dataset 9 Fig. 5: P R E M A works with differ ent data. missing entries ( 36 . 72% ). Moreover , with no knowledge of the aggregation pattern, B - P R E M A outperforms all baselines that have access to the aggregation information with FSF data. Although, B - P R E M A has NDE larger than Mean and CMTF with “mod agg” on PTW data, it becomes superior to all baselines when the aggr egation level is high. W ith W almart data in Figure 5 (a), “mod agg” means that weeks are aggregated into months in Y t , and the 45 stores are divided into 15 groups, whereas time is aggre- gated quarterly (12 weeks) and stores are clustered into 9 groups in “high agg”. CMTF works slightly better when the aggregation is moderate, owing to the fact that the second mode in W almart data is departments as apposed to items in DFF data. Departments are less correlated than items from the same category . As a result, the advantage of tensor models over the matricized tensor in capturing the higher- order dependencies becomes less clear . However , P R E M A is more immune to aggressive aggregation. In the “high level” case, The NDE of CMTF is 1.71 times the error of P R E M A . Even without access to the aggr egation information, B - P R E M A signiﬁcantly reduces the err or of the baselines. In Figure 5 (b), “mod agg” corresponds to the daily weather measurements averaged into weekly samples, and the 49 stations are averaged over 13 stations. On the other hand, the daily measurements are averaged over monthly samples, and the 49 stations are clustered into 7 stations in the “high agg” case. P R E M A , CMTF , and H-Fuse perform similarly with W eather data 9 (it has 93 . 30% zeros) with moderate aggregation. The size of the second dimension of W eather data is small ( J = 17 ), thus, the advantage of a tensor model over a matricized tensor model is less clear . H-Fuse works well with this data as it penalizes the 9 HomeRun is excluded from results with W eather data as it has non-negativity constraints. (a) (b) (c) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 NDE Mean CSMF PREMA CPD (oracle) (a) Mixed DFF dataset (a) (b) (c) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 NDE Mean CSMF PREMA CPD (oracle) (b) Crime dataset Fig. 6: P R E M A works with double aggr egation (Scenario B). large jumps between the adjacent time ticks (i.e., days), and weather data are well suited for such constraint. Neverthe- less, P R E M A improves the error of CMTF and H-Fuse when the aggregation level is high. Although B - P R E M A does not work as well as with other data, it still has smaller error than the simple baselines (Mean and LS), especially with aggressive aggregation. 5.2 Results on Scenario B The contemporaneous aggregate Y c in this scenario is ag- gregated in two dimensions: stores and items with Mixed DFF data, or crime locations and types with Crime data 10 . W e test this with three differ ent aggregation levels with each data. Difﬁculty (i.e., level of aggregation), increases as we move from case (a) to (c)—Figur e 6 shows the performance for these three cases. W ith Mixed DFF data, these levels are: a) Y t aggregates weeks into monthly samples, while Y c groups the 93 stor es into 18 areas with no aggr egation over the items, b) samples in Y t have monthly resolution, and Y c groups the stores into 18 areas and items into groups of 10 , and c) Y t contains temporal aggregates for each quarter of the year , and Y c groups stores into 18 areas and items into groups of 25 . One can see that the naive mean totally fails and its err or exceeds 1 in case (c) with Mixed DFF data in Figure 6 (a). Notwith- standing, P R E M A works well with double aggregation and few available samples. W ith Crime data, the aggregation levels are: a) Y t ag- gregates the months into quarterly resolution, while Y c clusters both the crime locations and types into groups of 5 , b) Y t has a quarterly time resolution, and Y c aggregates 10 LS, H-Fuse, and HomeRun ar e excluded from this comparison as they run out of memory . P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 12 both the locations and types into groups of 10 , and c) Y t aggregates the months into bi-yearly resolution, and Y c groups the crime locations and types into groups of 20 . Figure 6 (b) shows the performance with these levels using Crime data. These data are challenging as they have 91 . 56% zero values and small SD. P R E M A reduces the err or of Mean signiﬁcantly . Although CMTF performs slightly better with the ﬁrst two levels, P R E M A becomes superior with extreme aggregation. 6 C O N C L U S I O N S In this work, we proposed a novel framework, called P R E M A , for fusing multiple aggregated views of multidi- mensional data. The pr oposed method leverages the pr oper- ties of tensors in estimating the low-rank factors of the target data in higher resolution. The assumed model is provably transforming a highly ill-posed problem to an identiﬁable one. P R E M A works with partially observed data, and can disaggregate effectively , even without any knowledge of the aggregation mechanism (B - P R E M A ). Experimental re- sults on real data show that the proposed algorithm is very effective, even in challenging scenarios, such as data with double aggregation and high level of aggregation. The contributions of our work are summarized as follows: • Formulation : we formally deﬁned the problem of multidimensional data disaggr egation fr om views aggregated in dif ferent dimensions. • Identiﬁability: The considered tensor model prov- ably converts a highly ill-posed problem to an iden- tiﬁable one. • Ef fectiveness: P R E M A reduced the disaggregation error of the competing alternatives by up to 67% . • Unknown aggregation: B - P R E M A works even when the aggr egation mechanism is unknown. • Flexibility : P R E M A can perform disaggregation on partially observed data. A P P E N D I X A D E R I VAT I O N O F G R A D I E N T E X P R E S S I O N S The terms in (11) and (24) can be divided into two types: 1) CPD of a tensor , with some aggregation matrices mul- tiplied with the factors; and 2) the regularization term R in (24). Because the gradient of a sum is the sum of the gradients, it is enough to show the derivation of the gradients using the function below . This function consists of two terms, each repr esents one of the terms types listed above. Consider: min A , B , C k Ω ~ ( X − ([ [ UA , VB , WC ] ]) k 2 F | {z } T + k 1 T C − 1 T e C k 2 2 | {z } R (27) where Ω is as deﬁned in (10), and X ∈ R I × J × K is our data tensor . Note that all the CPD terms in (11) and (24) are similar to the term T , with one (or more) of the aggre- gation matrices { U , V , W } is equal to I . Thus, the term T generalizes all the CPD terms in our models. Using mode-3 unfolding, T is equivalent to T = k Ω 3 ~  X 3 − (( VB )  ( UA ))( WC ) T  k 2 F (28) V ectorizing the above, we get T = k Sx − S (( VB )  ( UA )  ( W C )) 1 k 2 F (29) where x = vec ( X 3 ) , and S ∈ { 0 , 1 } N × I J K is a fat matrix with one 1 in each row to select the available entries in x , where N = nnz ( Ω ) . Thus, the r ole of S with x , is similar to the role of Ω with the tensor form X . Equation (29) is also equivalent to T = k Sx − S  I ⊗ (( VB )  ( UA ))  ( W ⊗ I ) c k 2 F (30) where c = vec( C ). W e show the derivative of T and R w .r .t. C (derivatives w .r .t. A and B are derived similarly by using mode-1 and mode-2 unfolding and rotating the factors accordingly). First, we derive the gradient of T w .r .t. C : ∇ C T = 2( W T ⊗ I )( I ⊗ ( VB  UA ) T ) S T S ( I ⊗ ( VB  · UA ))( W ⊗ I ) c − 2( W T ⊗ I )( I ⊗ ( VB  UA ) T ) S T Sx = 2( I ⊗ ( VB  UA ) T )( W T ⊗ I ) S T S ( I ⊗ ( VB  · UA ))( W ⊗ I ) c − 2( I ⊗ ( VB  UA ) T )( W T ⊗ I ) S T Sx = 2( I ⊗ ( VB  UA ) T )( W T ⊗ I ) S T S  ( I ⊗ ( VB  · UA ))( W ⊗ I ) c − x  (31) W e can use mode-3 unfolding to rewrite the ﬁnal equation in (31) above as ∇ A T = 2 W T  Ω 3 ~ ( b X 3 − X 3 )  T  ( VB )  ( UA )  (32) where b X 3 =  ( VB )  ( UA )  ( W C ) T . The gradient above can be computed ef ﬁciently by the following steps: 1) Compute L = Ω 3 ~ ( b X 3 − X 3 ) . 2) Compute M = L T  ( VB )  ( UA )  . 3) Compute 2 W T M Next, the derivative of R w .r .t. C is ∇ C R = 2( 11 T C − 11 T e C ) (33) A P P E N D I X B D E R I VAT I O N O F S T E P S I Z E E X P R E S S I O N S The step size terms in both Algorithm 1 and 2 are chosen using the exact line search optimization method. As men- tioned earlier , the function (27) generalizes all the terms in P R E M A and B - P R E M A models. Thus, we use (27) to show how to ﬁnd the step size γ associated with updating C as an illustrative example. In this case, the exact line search chooses γ to be the minimizer of argmin γ ≥ 0 F  C − γ ∇ C F  (34) where F = L + R , which are as deﬁned in (27). Plugging the variable C − γ ∇ C F into (27) and rearranging the terms, we get argmin γ ≥ 0 k Ω 3 ~  Y 3 −  ( VB )  ( UA )  W T C T  | {z } E + γ Ω 3 ~  ( VB  UA ) W T ∇ C F T  | {z } D k 2 F + k 1 T C − 1 T e C | {z } e T − γ 1 T ∇ C F | {z } d T k 2 2 (35) P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 13 One can see that at the optimal solution to (35), we have: − vec ( E ) T = γ vec ( D ) T (36) e T = γ d T (37) Multiplying (36) by vec( D ) and (37) by d , and summing up the r esulting two equations, we get − vec ( E ) T vec ( D ) + e T d = γ ( vec ( D ) T vec ( D ) + d T d ) (38) Respecting the non-negativity constraint, we can see that the optimal solution is γ = max  0 , − vec ( E ) T vec ( D ) + e T d vec ( D ) T vec ( D ) + d T d  , (39) A P P E N D I X C I N I T I A L I Z A T I O N A L G O R I T H M The initialization steps of Algorithm 1 are as follows Set missing entries in Y t , and Y c to zer os. if V = I and K > I then e A , B , C ← CPD ( Y c ); A ← solve Y t 1 = (( W C )  B ) A T else A , B , e C ← CPD ( Y t ); C ← solve Y c 3 = (( VB )  ( UA )) C T end if Note that the missing entries are set to 0 only in the initial- ization steps. W e use the Matlab-based package Tensorlab to compute the CPD in the initialization (e.g., CPD ( Y c )). A P P E N D I X D P R O O F O F P R O P O S I T I O N 1 Let X ∈ R I × J × K be the target tensor to disag- gregate with CPD X = [ [ A , B , C ] ] of rank R and Y t ∈ R I × J × K w = X × 3 W . Then under the conditions of Proposition 1, Y t admits a unique CPD Y t = [ [ A t , B t , C t ] ] . Since it is unique it holds that: A t = AΠΛ 1 , B t = BΠΛ 2 , C t = W CΠΛ 3 , (40) where Π is a permutation matrix and Λ 1 , Λ 2 , , Λ 3 are diagonal matrices such that Λ 1 Λ 2 Λ 3 = I . In the case where Y t has missing entries the conditions under which [ [ A t , B t , C t ] ] are identiﬁable are stricter and depend on the pattern of misses. W e can use the conditions in [24], [25], [26] for ﬁber , regular and random sampling respectively . So far factors A , B have been identiﬁed up to column permutation and scaling. What remains to be proven is that: Ω c ~ Y c = Ω c ~ ( X × 1 U × 2 V ) = Ω c ~ ([ [ UA , VB , C ] ]) (41) yields a solution for C c such that C c = CΠΛ 3 . Equation (41) can be equivalently written as: Ω c y c = Ω c ( C  VB  UA ) 1 = Ω c ( I ⊗ ( VB  UA )) c , (42) where y c , c are vectorized versions of Y c , C T respectively and Ω c ∈ { 0 , 1 } L c × I u J v K is a fat selection matrix, that selects the available entries of y c . Now let e A = UA and e B = VB . Following [42, Lemma 1] e A , e B are drawn from absolutely continuous non-singular distributions. Also let P = e B  e A . Since I u J v ≥ R the determinant of any R × R submatrix of P is a non-trivial analytic function of e A , e B . Ther efore any R × R minor of P is non-zer o almost sur ely [47, Lemma 3] and any R rows of P ar e independent. T aking a closer look at matrix G = I ⊗ ( VB  UA ) = I ⊗ ( e B  e A ) we observe that it is an I u J v K × K R block diagonal matrix of the form: G =      P 0 . . . 0 0 P . . . 0 . . . . . . . . . . . . 0 0 . . . P      =      G 1 G 2 . . . G K      (43) Each block G k corresponds to the k − th frontal slab of Y c and the rows between different G k ’s are independent by construction. Since we have assumed that the minimum number of observed entries for each frontal slab is greater than or equal to R , then Ω c G has full column rank equal to K R and the solution for c in (44) is unique with probability 1. Plugging A t , B t in equation (44) we get: Ω c y c = Ω c ( C  VB t  UA t ) 1 (44) = Ω c ( C  VBΠΛ 2  UAΠΛ 1 ) 1 (45) Then the unique solution for C satisﬁes C c = CΠΛ 3 and b X = [ [ A t , B t , C c ] ] disaggregates Y t , Y c to X almost surely . R E F E R E N C E S [1] A. Silvestrini and D. V eredas, “T emporal aggregation of univariate and multivariate time series models: a survey ,” J. of Econ. Surveys , vol. 22, no. 3, pp. 458–497, 2008. [2] S. Uludag, K.-S. Lui, K. Nahrstedt, and G. Brewster , “Analysis of topology aggregation techniques for qos routing,” ACM Comput- ing Surveys (CSUR) , vol. 39, no. 3, p. 7, 2007. [3] N. S. Patil and P . Patil, “Data aggregation in wireless sensor network,” in IEEE intl conf. on comput. intell. and computing r esearch , vol. 6, 2010. [4] E. Shi, H. Chan, E. Rieffel, R. Chow , and D. Song, “Privacy- preserving aggregation of time-series data,” in Annual Network & Distributed System Security Symposium (NDSS) . Internet Society ., 2011. [5] Y . Park and J. Ghosh, “Ludia: An aggregate-constrained low- rank reconstruction algorithm to leverage publicly released health data,” in Proc. of the 20th ACM SIGKDD intl. conf. on Knowledge discovery and data mining . ACM, 2014, pp. 55–64. [6] R. H. Ellaway , M. V . Pusic, R. M. Galbraith, and T . Cameron, “De- veloping the role of big data and analytics in health professional education,” Medical teacher , vol. 36, no. 3, pp. 216–222, 2014. [7] I. Motakis and C. Zaniolo, “T emporal aggregation in active database rules,” in Proc. of the 1997 ACM SIGMOD Intl. Conf. on Mgmt. of Data , ser . SIGMOD ’97. New Y ork, NY , USA: ACM, 1997, pp. 440–451. [8] Z. Erkin, J. R. T roncoso-Pastoriza, R. L. Lagendijk, and F . P ´ erez- Gonz ´ alez, “Privacy-preserving data aggregation in smart metering systems: An overview ,” IEEE Signal Process. Mag. , vol. 30, no. 2, pp. 75–86, 2013. [9] W . A. Clark and K. L. A very , “The effects of data aggregation in statistical analysis,” Geo. Analysis , vol. 8, no. 4, pp. 428–438, 1976. [10] T . A. Garrett, “Aggregated versus disaggregated data in r egression analysis: implications for inference,” Economics Letters , vol. 81, no. 1, pp. 61–65, 2003. [11] Y . Jin, B. D. W illiams, M. A. W aller , and A. R. Hofer , “Masking the bullwhip effect in retail: the inﬂuence of data aggregation,” Intl. J. of Physical Distrib. & Logistics Mgmt. , vol. 45, no. 8, pp. 814–830, 2015. P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 14 [12] M. Lenzen, “Aggregation versus disaggregation in input–output analysis of the environment,” Econ. Sys. Research , vol. 23, no. 1, pp. 73–89, 2011. [13] D. Cole, “The effects of student-faculty interactions on minority students’ college grades: Differences between aggregated and disaggregated data.” J. of the Pr ofessoriate , vol. 3, no. 2, 2010. [14] Z. Liu, H. A. Song, V . Zador ozhny , C. Faloutsos, and N. Sidiropou- los, “H-fuse: Ef ﬁcient fusion of aggregated historical data,” in Pr oc. of the 2017 SIAM Int. Conf. on Data Mining , Houston, T exas, USA, April 2017, pp. 786–794. [15] F . M. Almutairi, F . Y ang, H. A. Song, C. Faloutsos, N. Sidiropoulos, and V . Zadorozhny , “Homerun: scalable sparse-spectrum recon- struction of aggregated historical data,” Proc. of the VLDB Endow- ment , vol. 11, no. 11, pp. 1496–1508, 2018. [16] N. Rossi et al. , “A note on the estimation of disaggregate time series when the aggr egate is known,” The Review of Econ. and Stats. , vol. 64, no. 4, pp. 695–696, 1982. [17] G. C. Chow and A.-l. Lin, “Best linear unbiased interpolation, distribution, and extrapolation of time series by related series,” The review of Econ. and Stats. , pp. 372–375, 1971. [18] J. M. Pav ´ ıa-Miralles, “A survey of methods to interpolate, dis- tribute and extra-polate time series,” J. of Service Science and Mgmt. anagement , vol. 3, no. 04, p. 449, 2010. [19] J. M. Pav ´ ıa-Miralles and B. Cabr er-Borr ´ as, “On estimating contem- poraneous quarterly regional gdp,” J. of Forecasting , vol. 26, no. 3, pp. 155–170, 2007. [20] T . Di Fonzo, “The estimation of m disaggregate time series when contemporaneous and temporal aggregates are known,” The Rev . of Econ. and Stats. , pp. 178–182, 1990. [21] F . M. Almutairi, C. I. Kanatsoulis, and N. D. Sidiropoulos, “T endi: T ensor disaggregation from multiple coarse views,” in Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining . Springer , May 2020, pp. ???–??? [22] F . L. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” Journal of Mathematics and Physics , vol. 6, no. 1-4, pp. 164–189, 1927. [23] L. Chiantini and G. Ottaviani, “On generic identiﬁability of 3- tensors of small rank,” SIAM J. on Matrix Analys. and App. , vol. 33, no. 3, pp. 1018–1037, 2012. [24] M. Sørensen and L. De Lathauwer , “Fiber sampling approach to canonical polyadic decomposition and application to tensor completion,” SIAM Journal on Matrix Analysis and Applications , vol. 40, no. 3, pp. 888–917, 2019. [25] C. I. Kanatsoulis, X. Fu, N. D. Sidiropoulos, and M. Akc ¸ akaya, “T ensor completion from regular sub-nyquist samples,” arXiv preprint arXiv:1903.00435 , 2019. [26] M. Ashraphijuo and X. W ang, “Fundamental conditions for low- cp-rank tensor completion,” The Journal of Machine Learning Re- search , vol. 18, no. 1, pp. 2116–2145, 2017. [27] N. D. Sidiropoulos, L. De Lathauwer , X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos, “T ensor decomposition for signal processing and machine learning,” IEEE T ransactions on Signal Processing , vol. 65, no. 13, pp. 3551–3582, 2017. [28] T . G. Kolda and B. W . Bader , “T ensor decompositions and applica- tions,” SIAM review , vol. 51, no. 3, pp. 455–500, 2009. [29] M. Lenzerini, “Data integration: A theoretical perspective,” in Proc. of the 21st ACM SIGMOD-SIGACT -SIGAR T symposium on Princ. of database sys. ACM, 2002, pp. 233–246. [30] X. L. Dong and F . Naumann, “Data fusion: resolving data conﬂicts for integration,” Proc. of the VLDB Endowment , vol. 2, no. 2, pp. 1654–1655, 2009. [31] C. Faloutsos, H. V . Jagadish, and N. Sidiropoulos, “Recovering information from summary data,” PVLDB , vol. 1, no. 1, pp. 36–45, 1997. [32] C. Sax and P . Steiner , “T emporal disaggregation of time series,” The R Journal , vol. 5, no. 2, pp. 80–87, 2013. [33] F . Y ang, F . M. Almutairi, H. A. Song, C. Faloutsos, N. D. Sidir opou- los, and V . Zadorozhny , “T urbolift: fast accuracy lifting for histor- ical data recovery ,” The VLDB Journal , pp. 1–20, 2020. [34] H.-F . Y u, N. Rao, and I. S. Dhillon, “T emporal regularized matrix factorization for high-dimensional time series prediction,” in Ad- vances in neural information processing systems , 2016, pp. 847–855. [35] Y . Matsubara, Y . Sakurai, C. Faloutsos, T . Iwata, and M. Y oshikawa, “Fast mining and for ecasting of complex time-stamped events,” in Proc. of the 18th ACM SIGKDD intl. conf. on Knowl. discovery and data mining . ACM, 2012, pp. 271–279. [36] K. T akeuchi, H. Kashima, and N. Ueda, “Autoregressive tensor factorization for spatio-temporal predictions,” in 2017 IEEE Intl. Conf. on Data Mining (ICDM) . IEEE, 2017, pp. 1105–1110. [37] F . M. Almutairi, N. D. Sidiropoulos, and G. Karypis, “Context- aware recommendation-based learning analytics using tensor and coupled matrix factorization,” IEEE Journal of Selected T opics in Signal Processing , vol. 11, no. 5, pp. 729–741, 2017. [38] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos, “T ensors for data mining and data fusion: Models, applications, and scal- able algorithms,” ACM T rans. Intell. Syst. T echnol. , vol. 8, no. 2, pp. 16:1–16:44, Oct. 2016. [39] C. I. Kanatsoulis, X. Fu, N. D. Sidiropoulos, and W .-K. Ma, “Hyperspectral super-r esolution: Combining low rank tensor and matrix structure,” in 2018 25th IEEE International Conference on Image Processing (ICIP) . IEEE, 2018, pp. 3318–3322. [40] ——, “Hyperspectral super-resolution via coupled tensor factor- ization: Identiﬁability and algorithms,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 3191–3195. [41] C. I. Kanatsoulis, N. D. Sidiropoulos, M. Akc ¸ akaya, and X. Fu, “Regular sampling of tensor signals: Theory and application to fmri,” in ICASSP 2019-2019 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 2932– 2936. [42] C. I. Kanatsoulis, X. Fu, N. D. Sidiropoulos, and W . Ma, “Hy- perspectral super-resolution: A coupled tensor factorization ap- proach,” IEEE T . on Signal Process. , vol. 66, no. 24, pp. 6503–6517, Dec 2018. [43] B. D. W illiams and M. A. W aller , “Creating order forecasts: point- of-sale or order history?” J. of Business Logistics , vol. 31, no. 2, pp. 231–251, 2010. [44] Y . Jin, B. D. W illiams, T . T okar , and M. A. W aller , “Forecasting with temporally aggregated demand signals in a retail supply chain,” Journal of Business Logistics , vol. 36, no. 2, pp. 199–211, 2015. [45] N. Y okoya, T . Y airi, and A. Iwasaki, “Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion,” IEEE T ransactions on Geoscience and Remote Sensing , vol. 50, no. 2, pp. 528–537, 2011. [46] M. Sim ˜ oes, J. Bioucas-Dias, L. B. Almeida, and J. Chanussot, “A convex formulation for hyperspectral image superresolution via subspace-based regularization,” IEEE T ransactions on Geoscience and Remote Sensing , vol. 53, no. 6, pp. 3373–3388, 2014. [47] R. C. Gunning and H. Rossi, Analytic functions of several complex variables . American Mathematical Soc., 2009, vol. 368. Faisal M. Almutairi received the B.Sc degree with ﬁrst-class honor in electr ical engineering from Qassim University (QU), Qassim, Saudi Arabia, in 2012. He received the M.S. degree in electrical engineering and the M.S. degree in Industrial Engineering from the University of Minnesota (UMN), Minneapolis, MN, USA, in 2016 and 2017, respectively . F aisal is currently a Ph.D . candidate in the Department of Electr ical and Computer Engineer ing (ECE) at the UMN, Minneapolis, MN, USA. His research interests include signal processing, data mining, and optimization. Charilaos I. Kanatsoulis (S’18) is a Ph.D . can- didate in the Depar tment of Electr ical and Com- puter Engineering (ECE) at the University of Minnesota (UMN), T win Cities. He received his Diploma in electrical and computer engineer- ing from the National T echnical University of Athens, Greece, in 2014. His research inter- ests include signal processing, machine lear n- ing, tensor analysis, and gr aph mining. P R E M A : PRINCIPLED TENSOR DA T A RECOVER Y FROM MUL TIPLE AGGREGA TED VIEWS 15 Nicholas D. Sidiropoulos (F09) received the Diploma degree in electr ical engineering from the Ar istotle University of Thessaloniki, Thessa- loniki, Greece , in 1988, and the M.S . and Ph.D . degrees in electr ical engineering from the Uni- versity of Maryland, College Park, MD , USA, in 1990 and 1992, respectively . He has ser ved on the faculty of the University of Virginia (UV A), University of Minnesota, and the T echnical Uni- versity of Crete, Greece, prior to his current ap- pointment as Louis T . Rader Prof essor and Chair of ECE at UV A. F rom 2015 to 2017, he was an ADC Chair Professor with the University of Minnesota. His research interests are in signal processing, communications, optimization, tensor decomposition, and factor analysis, with applications in machine learning and communica- tions. He was the recipient of the NSF/CAREER award in 1998 and the IEEE Signal Processing Society (SPS) Best P aper A ward in 2001, 2007, and 2011. He ser ved as IEEE SPS Distinguished Lecturer during in 2008 and 2009, and as IEEE SPS Vice-President from 2017 to 2019. He was the recipient of the 2010 IEEE SPS Meritorious Service A ward, and the 2013 Distinguished Alumni A ward from the Univ ersity of Mar yland, Depar tment of ECE. He became a F ellow of EURASIP in 2014.

PREMA: Principled Tensor Data Recovery from Multiple Aggregated Views

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment