Creating a level playing field for all symbols in a discretization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In time series analysis research there is a strong interest in discrete representations of real valued data streams. One approach that emerged over a decade ago and is still considered state-of-the-art is the Symbolic Aggregate Approximation algorithm. This discretization algorithm was the first symbolic approach that mapped a real-valued time series to a symbolic representation that was guaranteed to lower-bound Euclidean distance. The interest of this paper concerns the SAX assumption of data being highly Gaussian and the use of the standard normal curve to choose partitions to discretize the data. Though not necessarily, but generally, and certainly in its canonical form, the SAX approach chooses partitions on the standard normal curve that would produce an equal probability for each symbol in a finite alphabet to occur. This procedure is generally valid as a time series is normalized before the rest of the SAX algorithm is applied. However there exists a caveat to this assumption of equi-probability due to the intermediate step of Piecewise Aggregate Approximation (PAA). What we will show in this paper is that when PAA is applied the distribution of the data is indeed altered, resulting in a shrinking standard deviation that is proportional to the number of points used to create a segment of the PAA representation and the degree of auto-correlation within the series. Data that exhibits statistically significant auto-correlation is less affected by this shrinking distribution. As the standard deviation of the data contracts, the mean remains the same, however the distribution is no longer standard normal and therefore the partitions based on the standard normal curve are no longer valid for the assumption of equal probability.

💡 Research Summary

The paper investigates a subtle but important flaw in the widely used Symbolic Aggregate Approximation (SAX) pipeline for time‑series discretization. SAX consists of three steps: (1) z‑normalization of the raw series (mean = 0, standard deviation = 1), (2) dimensionality reduction via Piecewise Aggregate Approximation (PAA), and (3) mapping each PAA segment to a symbol by cutting the standard normal distribution into equal‑probability intervals. The authors argue that step 2 fundamentally changes the statistical properties of the data, thereby invalidating the equal‑probability assumption that underlies step 3.

The core observation is that PAA replaces each block of k consecutive points with their mean. Because the variance of a sample mean is σ²/k, the standard deviation of the PAA series shrinks by a factor of √k relative to the original normalized series. Consequently, while the mean remains zero, the distribution becomes narrower than the standard normal. The pre‑computed breakpoints (derived from the N(0,1) quantiles) no longer correspond to equal‑probability regions for the transformed data. In practice, symbols corresponding to the central region become over‑represented, while those at the tails become under‑represented.

The authors further explore how the degree of autocorrelation in the original series moderates this effect. When successive observations are highly correlated (e.g., a smooth sinusoid), the block means are close to the original values, so the variance reduction is modest. Conversely, for series with little or no autocorrelation (e.g., white noise), the block means vary much less than the original points, leading to a pronounced contraction of variance.

To substantiate these claims, the paper presents two sets of experiments.

Simulated data – three synthetic series are generated: (a) pure white‑noise from N(0,1), (b) a perfect sinusoid, and (c) a sinusoid with added Gaussian noise. For each series, PAA is applied with segment lengths of 1, 2, 5, 10, and 20 points. The white‑noise case shows a dramatic drop in standard deviation (from ≈1.0 to ≈0.23) and a corresponding concentration of symbols around the middle of the alphabet. The pure sinusoid, despite being non‑Gaussian, exhibits almost no change in variance because its high autocorrelation preserves the original spread. The noisy sinusoid falls in between: variance shrinks to ≈0.61 for segment length 20, and the symbol distribution becomes noticeably skewed.
Real‑world data – twelve publicly available series from the UCI repository (including weather, EEG, robot arm, sunspot, and foreign‑exchange rates) are examined. Normality is assessed with the Jarque‑Bera test; only one series (a Forex rate) passes at the 5 % level. Autocorrelation functions (ACFs) are plotted, and the series are ordered by how strongly the PAA step affects them. Those with slowly decaying ACFs (similar to a sinusoid) show minimal variance reduction; the Forex series is a prime example. The remaining eleven series, many of which have near‑zero autocorrelation, experience substantial standard‑deviation shrinkage as the PAA segment size grows, confirming the simulated findings.

Tables in the paper quantify the standard‑deviation reduction for each series and each PAA window size, illustrating that the effect is systematic and not an artifact of a particular dataset.

The authors conclude that the standard SAX pipeline, when applied to low‑autocorrelation series with moderate or large PAA windows, violates its own equal‑probability premise. This has practical implications: the lower‑bounding distance measure (MINDIST) assumes symbol probabilities are uniform; if they are not, distance estimates can become biased, and downstream tasks such as clustering, classification, or anomaly detection may suffer.

To mitigate the problem, two practical remedies are suggested:

Re‑normalize after PAA – compute the mean and standard deviation of the PAA series and re‑scale it to unit variance before applying the standard breakpoints. This restores the equal‑probability property at the cost of an extra normalization step.
Adapt breakpoints to the empirical distribution – rather than using fixed N(0,1) quantiles, derive quantiles from the actual PAA output (e.g., via histogram equalization or quantile estimation). This approach automatically accounts for variance contraction and any residual skewness.

The paper does not propose a new discretization algorithm but raises awareness of a hidden source of bias in the canonical SAX implementation and provides guidance for practitioners who need reliable symbolic representations, especially when dealing with weakly autocorrelated or highly compressed time series.

Creating a level playing field for all symbols in a discretization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment