So the heart of the problem, from what I can gather, is the sliding window approach to generating subsequences. This approach necessarily generates "redundantly similar" windows, ie. windows will be most similar to their closest neighboring windows since there is an overlapping length of window - 1
The sliding window approach is not adding random noise to the clustering, it is adding redundantly similar data to the clustering resulting in almost perfect sine-wave-like cluster centers.
I am coming from a genomics background, so when I read about sliding window on time series data, I think about sliding window approach on regions of genome sequences.
The ultimate point of doing this clustering is to find repetitive sub-sequences in the series. This is also something we do commonly in the genomics field (repeatmasker/repeatmodeler software for example).
Something we can look at in genomes is a "k-mer coverage". A k-mer is just a k-lengthed sub-sequence (A,T,G,C), analogous to a k-length sub-sequence of a time series.
By scanning the genome, we can tally up how many times a k-mer appears in the genome. With this tally, we can determine a k-mer coverage for a region on the genome. This gives us an idea how many times we see this same region on the rest of the genome.
Maybe this approach can be adapted somehow? The only problem is that in genomics, we have four discrete classes (A,G,T,C), making finding exact matching relatively easy. In time-series, we have continuous data, making finding matches a lot tougher to do and define.
Disclaimer: I am not defending or denying the claims in the paper. I have a philosophical interest in time.[1]
The difference I think is that the series in a genome is lexicographic rather than temporal: i.e. which end of a particular strand you read from is irrelevant. On the other hand, a time series has an independent ordering. By analogy, a time series is a directed graph a genome is undirected.
That is the algorithm for finding a subsequence in a time series can be used for finding a subsequence in a genome. But the meaning of a subsequence is different. It's:
A - G - A - C
versus
C-> A -> G -> A.
Time series data contains implication and causality. Any claim about time series data is implicitly a claim about causality: e.g. we might claim a time series appears to be random. We don't really talk about a genome's amino acid sequence being random because the causality (other than the trivial case of analogues) lies outside the sequence - the reason for
This is pretty much the main idea behind another paper that Eamonn wrote [0]. The idea is that you can normalize and discretize timeseries data into a lecicographical representation, then perform all sorts of interests analyses on it. For example, searching for common subsequences can be done via regular expressions.
"The cluster centers found by STS clustering on any particular run of k-means on stock market dataset are not significantly more similar to each other than they are to cluster centers taken from random walk data! In other words, if we were asked to perform clustering on a particular stock market dataset, we could reuse an old clustering obtained from random walk data, and no one could tell the difference."
"As the sliding window passes by, the datapoint first appears as the rightmost value in the window, then it goes on to appear exactly once in every possible location within the sliding window. So the t_i datapoint contribution to the overall shape is the same everywhere..."
"Another way to look at it is that every value v_i in the mean vector, 1 ≤ i ≤ w, is computed by averaging essentially every value in the original time series; more precisely, from t_i to t_m-w+i . So for a time series of m = 1024 and w = 32, the first value in the mean vector is the average of t[1..993]; the second value is the average of t[2…994], and so forth. Again, the only datapoints not being included in every computation are the ones at the very beginning and at the very end, and their effects are negligible asymptotically."
I wonder if the same effect applies to an exponential sliding window, ie. weight more "recent" samples higher than older samples, so effectively it's an infinite window. Instead of using a hard cutoff window.
Hmm, good question. I'd say this would simply distribute the contribution of each window member along an exponential (rather than constant) trend. The terms in the mean vector would then be polynomials of the window members. Even if you lined up t_(x^n) weight with t_(x^m), it seems to me the other t_i filtered at exponential distances wouldn't line up. You'd could probably arrange it to get split on these nonidentical term weights, and maybe this could work as a (somewhat involved) method of feature selection that didn't degenerate to identical contributions. Interesting idea! - and maybe there's a few papers about it somewhere :)
Exactly the comment I was looking for - I've done a bit of work classifying time series using 1-NN and DTW as the distance metric and it works quite well.
I was under the impression this work was more well-known, though. It's surprising to not see it mentioned considering how long DTW has been around.
We found the same issue with non-normal distributions in our time series data sets when using SAX (developed by this paper's authors - assumes normality for dimensionality reduction) and addressed it by using quantiles in the piecewise aggregate approximation step. The quantile breakpoints behaved much better than the "normal" breakpoints.
when I last looked into time series clustering, it seemed like other statistical techniques worked better for predicting failure modes. I still think there is merit in motif recognition and rules based around that concept, but I haven't seen very much of interest outside of EKGs.
I think one way to summarize it is: by constructing feature vector components that are parochially based upon the mere position within some data, you introduce correlation and causality effects between the components of your feature vector.
For example, think about using regression instead of clustering. Your design matrix is still going to be S from the paper, where the p-th row of S is the p-th "iteration" of the window (while it is is being slid from start to finish of the data).
So column 1 of p=1 window will trivially be perfectly correlated with column 2 of p=0 window, and column 0 of p=2, at least for a number of rows until that data point has slid out of the window. So in some sense, past observations (earlier columns in S) will "cause" later observations (later columns in S) due to the sliding mechanism.
In regression this is a well-explored problem with multi-collinearity and also with conditional effect sizes when causal relationships are not separated into separate equations or corrected by adding dummy variables for conditional levels of one covariate given others.
It's not at all surprising that something similar would happen when clustering.
Sadly, it's also not that surprising that the entire research community doesn't care too much and is happy to crank out papers with this issue anyway. It's similar to data mining for statistical significance or other problems, and indeed machine learning is not at all a silver bullet to get around those problems.
One approach that might help reduce the problem is to randomly sub-sample windows. It would be a lot of work, and probably be computationally costly, but you could in principal devise a sampling scheme where you jointly sample all of the sub-windows at once, in such a way as to ensure certain bounds on the total amount of overlap between them, using probably some sort of Metropolis-like scheme. It's not clear to me if this is worthwhile or not, and I tend to prefer the idea of quantizing the signal in some way to make it a lexicographic problem instead.
Also note that the problem is a lot less worrisome in supervised learning, like training a decision tree or SVM on overlapping sub-windows. The reason is that you (presumably) have control over the labels. So you can ensure that two highly correlated sub-samples (like window p and window p+1) get the same label most of the time when it matters (e.g. when you're right on top of the interesting phenomenon you want to classify). Crucially the reason this helps is that you not only can ensure that two sample windows which differ by only a slight amount of sliding will get the same label, but also that those windows will get the same label as other instances of the same phenomenon elsewhere in the signal, perhaps far away in some other sub-window nowhere near it. It means you are clamping the positive training samples to reflect the signal properties you want, rather than hoping the clustering will intrinsically pick that to be the notion of similarity it uses when classifying samples (which, as the paper points out, you have no reason to expect that it will).
But it will still suffer somewhat to subjective choices about when, exactly, the window has slid far enough away from the labeled phenomenon to have "rolled off" enough that the label should be for the other class (in a binary case). What's bad about this is that however this gets codified into the training set needs to be taken into account as part of the experimental design and accounted for when reporting anything like statistical significance results later on. But because this choice, which is effectively a "roll off" parameter, is not written down or made explicitly part of the model anywhere, most people just ignore, or don't think about the way it might have an effect on the result.
The sliding window approach is not adding random noise to the clustering, it is adding redundantly similar data to the clustering resulting in almost perfect sine-wave-like cluster centers.