-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAX-VSM, constant time series error #106
Comments
Your intuition is somewhat right. The SAX-VSM algorithm works like this:
The issue comes from step 2: when a subsequence is constant, not only it cannot be standardized (zero variance), but also the choice of the symbol would be subjective. For instance, for a subsequence of length 4 with So, basically, as long as there is a constant subsequence that is longer than |
Thank you for the reply! It's a very interesting point you make about the constant subsequences that I hadn't noticed mentioned in the related works for the method. Do you perhaps know how the issue is dealt with in the saxpy package made by one of the developers of the SAX-VSM method? When I just run a SAX transformation from that package I do get words like 'aaa' and 'ccc', but then you're saying that they are the same and/or arbitrarily chosen? So these constant patterns could never be important features that discriminate between classes? Additionally, I'm confused about how exactly step 2 works. If we standardize each subsequence locally then how can we say that 'abc' in one means the same as 'abc' in another one? I thought that one word corresponds to one pattern and that's why they are treated as the same feature? But in reality the raw numbers behind the 'abc' are different depending on the subsequence? I'm working with the SAX method to see if I can extract interpretable patterns from my timeseries, I thought that if I discover that 'abc' is an important feature for classification that then I would be able to go back to the original data and see what this 'abc' means in numbers, is this then not the case? |
The original paper states that, if the standard deviation of a subsequence is too low, the standardization step is skipped (section IV.1): So in this case, I guess that they are using the raw values of the subsequences, and you can get different words depending on the constant value of the subsequences (e.g., [-1, -1, ..., -1] would be transformed into "aaa" while [1, 1, ..., 1] would be transformed into "ccc"). They also assume that the input time series is standardized (zero mean, unit variance).
Could you give me the code you're using with saxpy? It would help me dig into their code to see what is implemented.
The assumption behind the standardization of each subsequence is that the local mean and variance do not matter. You can a pattern by transforming a word into a vector (although you lose some information, because all the values falling in the same bin are transformed into the same letter), this pattern is independent of its mean and its variance. So, if If you really want a single pattern, you should have a look at shapelet-based algorithms. A shapelet is simply a real-valued vector and the distance between a time series and a shapelet is defined as the minimum squared Euclidean distance between the shapelet and all the subsequences extracted from the time series with the same length: |
I really appreciate your informative answer! I'm a Master's student doing a project with the SAX transformation, so I'm just learning about this method. Indeed, I'd missed the bit about not applying the normalization, and it makes sense in light of the points you've already mentioned, but so in pyts implementation of SAX-VSM this step is not happening? As in if the normalization can't be applied instead of not normalizing the error is thrown? Now I had a light-bulb moment and I understood that saxpy (https://github.com/seninp/saxpy) allows the user to specify this threshold
Sorry, I'm not entirely sure I understand. What do you mean with transforming a SAX word into a vector. Since the letters are essentially bins for the normal distribution they correspond to a range of values not a single value? |
That's an oversight on my part, I also did not pay attention to this when I wrote the implementation, and I want to fix it! The normalization step could definitely be skipped for the the subsequences that have a standard deviation below a threshold. I looked at the code in saxpy for the z-normalization and it seems like that they still center the subsequence even if they don't divide it by its standard deviation. So I don't get (for the moment) how you could get words like "aaa" or "ccc" if the subsequences are centered: the mean of the centered subsequence is zero, and its standard deviation is low, so all the values must be close to zero, and you should get a word with only the "middle" symbol ("bbb" if you use 3 symbols). And it does not really match what is written in the original paper. By sharing the code, I meant which function in saxpy you used (
Yes, they correspond to a range values, and an approximation would simply consist in taking the center of the interval (or any real number in the interval). At the end, if you really want to find a real-valued pattern, you have to make some kind of approximation to go back to real numbers from symbols/bins. |
So I did some research and the latest release of saxpy on PyPI is from March 11, 2018, which corresponds to the following source code. In this version, if the standard deviation of the subsequence is below a given threshold, the subsequence is neither centered nor scaled. That's why you can get words like "aaa", "bbb" and "ccc" for (almost) constant subsequences, the symbol depending on the values of the subsequence. With the current version (master branch), the subsequences are always centered, and I only get "bbb" words. Since the original paper introducing SAX-VSM was published in 2013 and the description in the paper matches the latest release on PyPI, I think that I will implement SAX-VSM this way (and possibly add a parameter to allow for centering subsequences with a standard deviation below the threshold). |
Good catch! The time series I'm using is relatively long to share, when I try to make it smaller/downsample I don't get the same patterns anymore (unsurprisingly), so it's hard to share reproducible code. Yet I observed that I can get multiple constant SAX words if I have NaNs in my time series. I'm not quite sure how saxpy deals with them. If I drop them I do seem to get no more than one constant SAX word (which could be I'm using SAX for time series classification. I'm working with multivariate time series and for the moment I use Inspired by you I also took a closer look at |
SAX-VSM applies numerosity reduction, a technique that only keeps a single instance of identical consecutive words.
SAX-VSM is not well known for its high predictive performance. Its upside is mostly scalability in terms of number of training time series and number of classes. The training phase is relatively simple (computing the TF-IDF matrix), and the inference phase is also simple (highest cosine similarity). Moreover, to make IDF useful, you need a lot of classes: for 2 classes, the document frequency can only take 2 values (1 or 2). So I'm not very surprised that it does not perform well. I don't know if you have some requirements about the classifier that you want to use (interpretability, running complexity, etc.) but there are probably other relevant classifiers for your use case that may be worth trying out.
Looking at the examples in the docstring of the Multivariate time series are unfortunately poorly supported in pyts, mainly because the algorithms, as described in their original papers, almost always work on univariate time series only. Nonetheless, if you want to extend a transformer or a classifier in pyts in a dummy way, there are two convenient classes:
So, if you would like to use SAX-VSM for multivariate time series in pyts, you would have two options:
I will try to work on fixing the standardization issue in my implementation of SAX-VSM ASAP. |
Your help is very much appreciated! For my problem the interpretability is key, hence SAX was a method I considered as then I could look at the patterns that are the most informative for classification and locate them/visualize them in the original timeseries. Also from domain knowledge I know that features from each individual time series won't be enough for class attribution, but rather a combination of them. The underlying problem is understanding why these series belong to one class or the other. Also as you mention many methods are developed for the univariate case, this huge bag of words method seemed to be a relatively straightforward way to circumvent that. |
There's usually a trade-off between predictive performance: more complex models might perform better at the cost of a more complex interpretability. Also, I would rather interpret (with much caution) a complex model that performs well than a simple model that performs mediocrely. Obviously this is your project, but if I were you, I would first focus on the predictive performance in order to have a rough estimation of how well you can predict the target of interest (class) given the input data (multivariate time series), without paying attention to the interpretability of the function mapping the input to the target. When this is done, I would compare the predictive performance of the different models and decide which model I pick (based on their predictive performance and interpretability). Finally, I would focus on the interpretability of this model. Concerning the multivariate point, a simple point approach, if you use an algorithm that extracts information followed by a standard machine learning classifier, is to concatenate the information independently extracted for each feature and let the classifier deal with all the information at once. Among the algorithms implemented that extract information, each variable (feature in machine learning terms) is relatively easy to interpret. Based on this, a general approach to evaluate the impact of a feature is permutation feature importance, which simply consists in shuffling a feature and see the difference in predictive performance (you need to take care of correlated features, it is mentioned at the bottom of the page). The issue with constant subsequences in SAX-VSM should be fixed soon (I hope that I can merge #108 today). Edit: the PR is merged and the fix is available on the |
Thanks for the fix! So now indeed the SAX-VSM doesn't throw that error anymore. However, when I tried to use the approach you suggested:
I still got the same error about the constant samples. This is my code, maybe I missed something? : from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords
transformer = MultivariateTransformer(BagOfWords(window_size=20, word_size=3, n_bins=3), flatten=True)
X_new = transformer.fit_transform(X)
>>
151 def _check_constant(self, X):
152 if np.any(np.max(X, axis=1) - np.min(X, axis=1) == 0):
153 raise ValueError("At least one sample is constant.")
154
155 def _compute_bins(self, X, n_samples, n_bins, strategy):
> ValueError: At least one sample is constant. Alternatively, maybe it is possible to provide the Bag of Words representation to the SAX-VSM directly so that it only performs the classification of already transformed signals? EDIT. I also tried with the threshold parameter, no change. transformer = MultivariateTransformer(BagOfWords(window_size=20, word_size=3, n_bins=3, threshold_std=0.05), flatten=True) |
Did you reinstall the package with the fix? With something like:
If you can use the Lines 306 to 334 in 77d9635
The following code works for me, could you try it? from pyts.datasets import load_basic_motions
from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
# Make a time series constant
X_train[0] = 0.
# Transform the dataset of multivariate time series
transformer = MultivariateTransformer(
BagOfWords(window_size=20, word_size=3, n_bins=3), flatten=True
)
X_train_bow = transformer.fit_transform(X_train)
X_test_bow = transformer.transform(X_test) |
Yes, I installed it as you outline, my current version is |
What is the shape of your input data
In my code snippet,
|
My X.shape was EDIT. I provide you the first 2 samples, so the data would be (2, 7, 51), maybe you manage to reproduce the error? First two samplesdata = np.array([[[ 4.83199997e+01, 4.81500015e+01, 4.80499992e+01, 4.81699944e+01, 4.79900017e+01, 4.77200050e+01, 4.77500000e+01, 4.76300011e+01, 4.74599991e+01, 4.74750023e+01, 4.74500008e+01, 4.73199997e+01, 4.73600006e+01, 4.72200012e+01, 4.70099983e+01, 4.68899994e+01, 4.69033356e+01, 4.68800011e+01, 4.69000015e+01, 4.69750023e+01, 4.68300056e+01, 4.67900009e+01, 4.67599983e+01, 4.66800003e+01, 4.67700005e+01, 4.70200005e+01, 4.65900002e+01, 4.66599998e+01, 4.66399994e+01, 4.67299995e+01, 4.68000031e+01, 4.67500000e+01, 4.68600006e+01, 4.67266655e+01, 4.65000000e+01, 4.65800018e+01, 4.65499992e+01, 4.67500000e+01, 4.66899986e+01, 4.66699982e+01, 4.68499985e+01, 4.67400017e+01, 4.67599983e+01, 4.68499985e+01, 4.66749992e+01, 4.66699982e+01, 4.67000008e+01, 4.68499985e+01, 4.68199997e+01, 4.69099998e+01, 4.69599991e+01], [-3.50000000e+00, -3.50000000e+00, -3.50000000e+00, -3.50000000e+00, -3.50000000e+00, -3.40000010e+00, -3.40000010e+00, -2.59999990e+00, -2.50000000e+00, -2.29999995e+00, -2.20000005e+00, -1.89999998e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -1.79999995e+00, -2.59999990e+00, -3.20000005e+00, -3.20000005e+00, -3.20000005e+00, -3.09999990e+00, -3.00000000e+00, -3.00000000e+00, -2.70000005e+00, -2.70000005e+00, -2.50000000e+00, -2.50000000e+00, -2.29999995e+00, -2.69999599e+00, -2.94998217e+00, -2.20000005e+00, -2.20000005e+00, -2.29999995e+00, -2.09999990e+00, -1.89999998e+00, -1.79999995e+00, -1.89999998e+00, -1.89999998e+00, -1.79999995e+00, -1.89999998e+00, -1.79999995e+00, -1.89999998e+00, -1.89999998e+00, -1.89999998e+00, -2.00000000e+00, -2.20000005e+00], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [-3.76166612e-01, -3.10367614e-01, -4.65889990e-01, -2.14665994e-01, -2.50550002e-01, -3.04383397e-01, -1.24935642e-01, -3.94109130e-01, -1.78770006e-01, -7.11008534e-02, -2.86440015e-01, -1.78771287e-01, -3.52108553e-02, -3.22329998e-01, -1.06991708e-01, -1.42879993e-01, -1.30915329e-01, -7.10987151e-02, -1.42879993e-01, -8.90428573e-02, -7.10978583e-02, 3.65700014e-02, 7.24597871e-02, -2.50544876e-01, -7.10999966e-02, 1.86276734e-02, -1.78769365e-01, 7.24600032e-02, 6.80641737e-04, 7.24600032e-02, 6.78930432e-04, 3.65667902e-02, -3.52097861e-02, -1.30916759e-01, 1.44236580e-01, 1.08350001e-01, 1.08356416e-01, 1.44238934e-01, 1.26294672e-01, -3.52091454e-02, 7.24610686e-02, 7.24600032e-02, 1.08350001e-01, 1.80130005e-01, 7.24600032e-02, 7.24600032e-02, 3.65700014e-02, 7.24600032e-02, 7.24600032e-02, 3.65700014e-02, 1.44240007e-01], [ 3.23685735e-01, 5.86882949e-01, 3.59580010e-01, 3.95464867e-01, 3.65700014e-02, 3.95466805e-01, 2.87805140e-01, 5.39029121e-01, 6.10809982e-01, 3.05747986e-01, 2.16020003e-01, 1.80128068e-01, 1.80129781e-01, 2.51910001e-01, 1.98077142e-01, 2.87800014e-01, 3.23690563e-01, 3.95470440e-01, 5.03139973e-01, 5.03140867e-01, 3.05747688e-01, 3.59580010e-01, 2.16020644e-01, 3.65725681e-02, 2.51910001e-01, 2.16014653e-01, -1.78769365e-01, 2.16020003e-01, 3.41633409e-01, 2.87800014e-01, 2.87802130e-01, 3.95470649e-01, 1.44239575e-01, 1.68166474e-01, 1.44243419e-01, 3.59580010e-01, 3.59577447e-01, 2.51911074e-01, 2.51911938e-01, 2.69856274e-01, 4.13418740e-01, 3.23689997e-01, 3.59579563e-01, 3.23689997e-01, 2.16023430e-01, 2.16020003e-01, 6.79999997e-04, 1.44240007e-01, 2.16020003e-01, 3.59580010e-01, 4.31360006e-01], [ 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02, 2.52000000e+02], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],
|
Could you copy-paste the full traceback? I'm pretty confident that the error is raised here: Line 350 in 77d9635
but just to be sure. Could you try applying |
I just tried to apply This is the original error message from using the Multivariate Transformer. Full error message
|
If the error occurs on each feature, let's focus on a single feature (univariate case). If you look at the source code, you can see that:
A possible explanation would be that, after the first two steps (StandardScaler + PAA), a subsequence whose standard deviation above the threshold has been transformed into a constant subsequence. However, if a subsequence has a positive standard deviation, it is not constant. Even after the first two steps (StandardScaler + PAA), a non-constant subsequence should not be transformed into a constant subsequence. Here is code snippet. Could you try it with a data set of univariate time series (any feature that raises the error)? import numpy as np
from pyts.approximation import PiecewiseAggregateApproximation, SymbolicAggregateApproximation
from pyts.bag_of_words import BagOfWords
from pyts.datasets import load_gunpoint
from pyts.preprocessing import StandardScaler
from pyts.utils import windowed_view
# Toy daaset
X, _, _, _ = load_gunpoint(return_X_y=True)
# Parameters
window_size= 20
word_size = 3
n_bins = 3
window_step = 1
threshold_std = 0.01
# Compute the number of windows
n_samples, n_timestamps = X.shape
n_windows = (n_timestamps - window_size + window_step) // window_step
# Standardize each time series independently
X_scaled = StandardScaler().transform(X)
# Extract all the subsequences
X_window = windowed_view(X_scaled, window_size).reshape(n_samples * n_windows, window_size)
# Identify subsequences whose standard deviation is below the threshold
idx = np.std(X_window, axis=1) < threshold_std
# Standardize each subsequence
X_window_above_thresh_scaled = StandardScaler().transform(X_window[~idx])
# Apply PAA to each standardized subsequence
X_window_above_thresh_scaled_paa = PiecewiseAggregateApproximation(
window_size=None, output_size=word_size
).transform(X_window_above_thresh_scaled)
# Check if a PAA standardized subsequence is constant
if np.any(np.max(X_window_above_thresh_scaled_paa, axis=1) - np.min(X_window_above_thresh_scaled_paa, axis=1) == 0):
raise ValueError("At least one subsequence is constant.")
# Apply SAX
X_window_above_thresh_scaled_paa_sax = SymbolicAggregateApproximation(
n_bins=n_bins
).transform(X_window_above_thresh_scaled_paa) |
For 3 of my features I get the constant sample error and for 2 others I get this error:
Could you help me with interpreting the error? If quantiles are equal doesn't it mean it's constant? |
Which This warning is raised when you compute the bin edges using the quantiles of the time series and that back-to-back quantiles are equal. Imagine that you use 4 bins, thus the 5 bin edges are q_0, q_25, q_50, q_75 and q_100, where q_N is the N-th percentile of the time series. Now imagine that a time series takes a given value many times. This means that you will have two back-to-back bin edges that are equal, and thus no value will be binned in this bin, which means that a symbol would never occur in the word (for instance no "a" or no "b"). In particular, this issue is likely to occur if you have long subsequences in which the time series is constant. |
I just copy-pasted your code and provided my data as X, since I don't see that parameter explicitly stated it should take the default value. And in my previous experiments, I have not used this parameter, so if Do you get the error if you try with the 2-sample array I provided?
But why is this a problem? As in all the letters (bins) don't need to appear in one window, right? So you could have something like 'ccb' for word lenght of 3 and alphabet size of 3 for example. Or is this just a message to let the user know? This maybe leads into the additional question I posed at the beginning. Why Why is having say alphabet size of 5 and word length of 3 not all rigth? |
My bad, I didn't understand that you ran the code that I provided, I thought that you were still referring to your code. SAX uses Could you identify which subsequences are constant? I'm still surprised that you can get constant subsequences from non-constant subsequences with the StandardScaler + PAA transformation. With something like this: # Standardize each subsequence
X_window_scaled = StandardScaler().transform(X_window)
# Apply PAA to each standardized subsequence
X_window_scaled_paa = PiecewiseAggregateApproximation(
window_size=None, output_size=word_size
).transform(X_window)
# Identify constant scaled PAA subsequences
idx_constant = np.where(X_window_scaled_paa.std() < 1e-30)[0]
print(idx_constant)
print(X_window[idx_constant].std(axis=1))
I don't get the error. Do you get it?
Usually you don't want to have more bins that the number of unique values because you would have some bins that are not used. The objective of binning is to summarize information in a smaller space (you don't look at the exact value, but only in which range the value falls). If you have more bins than the number of values, you may amplify the noise in the signal. Finally, the number of possible words is It was shown in a study (I think the one introducing SAX) that the histogram of a standardized time series is similar to the standard normal distribution, which means that, if you use the quantiles of the standard normal distribution, you should have (in expectation) roughly the same number of values falling inside each bin, meaning that, in a word, each symbol should be have the same frequency. This is why the quantiles of the standard normal distribution are usually used in SAX. So you actually expect to have the same number of symbols in each word. What you are looking at is the ordering of the symbols. |
Alright reproducible code! (Hopefully ) from pyts.bag_of_words import BagOfWords
bow = BagOfWords(window_size=20, word_size=3, n_bins=3, threshold_std=0.01)
X = np.array([[-2. , -2.29999995, -2.29999995, -2.4000001 , -2.5 ,
-2.5 , -2.5 , -2.5999999 , -2.5 , -2.5 ,
-2.5 , -2.5 , -2.5 , -2.5 , -2.5 ,
-2.5 , -2.5 , -2.5 , -2.5 , -2.5 ,
-2.5 , -2.5 , -2.5 , -2.5 , -2.5 ,
-2.5 , -2.5 , -2.5 , -2.5 , -2.5 ,
-2.5 , -2.5 , -2.5 , -2.5 , -2.5 ,
-2.5 , -2.5 , -2.5 , -2.5 , -2.5 ,
-2.5 , -2.5 , -2.5 , -2.5 , -2.5 ,
-2.5 , -2.79999995, -2.9000001 , -3.29999995, -3.4000001 ,
-3.9000001 ]])
bow.fit_transform(X) This gives me the constant time series error Thank you for the second point about the bins. I understand that some bins might not be used if alphabet size is greater than the word length, but my thinking was that that could indicate a steeper change in the signal and be an interesting pattern. Is this wrong? In one of the SAX papers (I don't recall which one at the top of my head) they seemed to recommend an alphabet size 4-6, saying that it's alright for most application, but then by what you said it implies that the word-size has to be at least that as well? Additionally, the 'saxpy' implementation allows for alphabet size to be larger than word length. I had tried 5 letters and word length of 3 for my classification problem. So, I then got words like |
Thank you very much for the reproducible code!
My intuition was right but my conclusion was wrong: a non-constant subsequence can be transformed into a constant subsequence after using PAA: # Before PAA
[ 3.16227766, 0. , 0. , 0. , -3.16227766,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ]
# After PAA with word_size=3
[ 0. , 0. , 0. ] I should probably get rid of this constant sample check because it's a bit of an edge case that leads to too many issues...
I see what you mean. Since the standardization is performed on the whole subsequence (before PAA and not after), having a number of bins larger than the word size is not an issue (but I guess it's better to have a number of bins not larger than the window size so that each bin has at least one value from the subsequence). Applying PAA indicates in which bin the mean value falls, which could indicate for instance if there's a plateau around the minimum (which may lead to a I will try fix this tomorrow (which should consist in applying the same fix that I added to |
The fix (#110) has been merged into the |
Thanks @johannfaouzi - I came across this error using 0.11 and installed from main (ff746ef) and it's working for me! |
Yes, thank you, this seems to work for me as well :) What would then be the syntax to use the multivariate transformer with SAX-VSM? from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords
from pyts.classification import SAXVSM
from sklearn.pipeline import make_pipeline
bow = BagOfWords(window_size=20, word_size=3, n_bins=5)
transformer = MultivariateTransformer(bow, flatten=True)
sax_vsm = SAXVSM(window_size=20, word_size=3, n_bins=5, strategy='normal')
clf = make_pipeline(transformer, sax_vsm)
clf.fit(X, y) I also noticed that just this transformer = MultivariateTransformer(BagOfWords(window_size=20, word_size=3, n_bins=5), flatten=True)`
new = transformer.fit_transform(X) gives me slightly different results compared to using sax_via_window from saxpy with the same parameters, saxw = sax_via_window(signal, win_size=20, paa_size=3, alphabet_size=5,
nr_strategy='exact', z_threshold=0.01) Namely, from |
SAXVSM is implemented as a classifier and works on raw time series, so unfortunately, if you want to perform the remaining steps, you would have to do it with your own code (feel free to have a look at the source code of SAXVSM). How do you plan to work on the multivariate case (not only for TF but also IDF)? Do you consider different features as different documents? I don't mind helping you a bit with the remaining code, but I would need more information on how you want to handle the multivariate case. Since from pyts.multivariate.classification import MultivariateClassifier
from pyts.classification import SAXVSM
from pyts.datasets import load_basic_motions
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
clf = MultivariateClassifier(
SAXVSM(window_size=20, word_size=6, n_bins=6, strategy='normal'),
)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
I tested it out and also noticed this, but didn't notice this when I set a very low value to the threshold parameter (like |
I don't know how you compute the word counts and the TF-IDF matrix but I would suggest using the tools made available in scikit-learn: sklearn.feature_extraction.text.TfidfVectorizer and sklearn.feature_extraction.text.CountVectorizer. SAX-VSM considers that each class is a document, not that each sample is a document:
So you need to merge the bag of words for all the samples that belong to the same class (or equivalently sum the histograms, i.e. rows, of your matrix for all the samples that belong to the same class). The first approach is currently used in pyts: pyts/pyts/classification/saxvsm.py Lines 178 to 179 in ff746ef
Here is an example of an implementation of the algorithm matching your description to handle the multivariate case: import numpy as np
import pandas as pd
from pyts.bag_of_words import BagOfWords
from pyts.datasets import load_basic_motions
from sklearn.base import clone
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
###################
# T R A I N I N G #
###################
le = LabelEncoder()
y_train_ind = le.fit_transform(y_train)
n_features = X_train.shape[1]
bow_list = [clone(BagOfWords(window_size=20, word_size=4, n_bins=4, strategy='normal'))
for _ in range(n_features)]
# Create an empty list that will be be transformed into a DataFrame with the TF-IDF matrix
df_tfidf = []
# Create a list to save all the instances of TfidfVectorizer
tfidf_list = []
for i, bow in enumerate(bow_list):
# Get the BoW transformation for this feature
X_bow = bow.transform(X_train[:, i])
# Group bags of words for samples belonging to the same class
X_class = [' '.join(X_bow[y_train_ind == classe])
for classe in range(le.classes_.size)]
# Create an instance of TfidfVectorizer
tfidf = TfidfVectorizer(
norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True
)
# Get TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(X_class).A
tfidf_list.append(tfidf)
# Get dictionary for this feature
dictionary = {value: key for key, value in tfidf.vocabulary_.items()}
dictionary = [word + f'_{i + 1}'
for word in np.vectorize(dictionary.get)(np.arange(tfidf_matrix.shape[1]))]
df_tfidf.append(pd.DataFrame(data=tfidf_matrix, columns=dictionary, index=le.classes_))
# Concatenate each TF-IDF matrix
df_tfidf = pd.concat(df_tfidf, axis=1)
#####################
# I N F E R E N C E #
#####################
# Create an empty list that will be be transformed into a DataFrame with the
# histograms for all the test time series
df_test = []
for i, (bow, tfidf) in enumerate(zip(bow_list, tfidf_list)):
# Get the BoW transformation for this feature
X_bow = bow.transform(X_test[:, i])
# Create an instance of CountVectorizer
vectorizer = CountVectorizer(vocabulary=tfidf.vocabulary_)
# Get the histogram (word count) for each test time series
X_histogram = vectorizer.transform(X_bow).A
df_test.append(pd.DataFrame(data=X_histogram))
# Concatenate each histogram
df_test = pd.concat(df_test, axis=1)
df_test.columns = df_tfidf.columns
# Compute cosine similarity scores
cosine_similarity_scores = cosine_similarity(df_test, df_tfidf)
# Return the class with the highest cosine similarity
y_pred = le.classes_[cosine_similarity_scores.argmax(axis=1)]
# Compute the accuracy score
accuracy_score(y_test, y_pred) |
Thanks again for your help!
Ahh, yes thank you for this, I was summing up the frequencies and I was wondering if that's all right. I see now that And if I want to just have one huge BOW for each sample representing the tf-idf scre, so from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords
X_new = transformer.fit_transform(X_train_val)
features_tfidf = []
for i in range(X_new.shape[1]):
tfidf = TfidfVectorizer(
norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True
)
feature = X_new[:, i]
tfidf_matrix = tfidf.fit_transform(feature).A
dictionary = {value: key for key, value in tfidf.vocabulary_.items()}
pattern_names = [word + f'_{i}'
for word in np.vectorize(dictionary.get)(np.arange(tfidf_matrix.shape[1]))]
df_idf = pd.DataFrame(tfidf_matrix, columns=pattern_names)
features_tfidf.append(df_idf)
dfx_idf = pd.concat(features_tfidf, axis=1) I want to eventually set up crossvalidation and grid search for the word length and bin count for the BOW transformation. I know how it can be done relatively easily with scikit-learn ML models. I suspect that it would be trickier with the SAX-VSM implementation you shared. But if I had something like |
I think I'm close to getting my grid search working; the main issue I'm having is that when using the RF classifier each split needs to have the same amount of features. But since if BOW only returns the patterns it found and not the summary of all possible words then the pipeline fails when new features appear as the classifier isn't expecting them. To circumvent this I tried to infer the word size and number of bins from the output of BOW. Word size is easy to do and so I thought was also the number of bins. My idea was to create a feature for all the possible words to create consistent input to the classifier and if they're not found simply set them to 0. This still didn't work, it was complaining about the number of features that the classifier is getting and what it is actually expecting. Do you know of any way I could make this kind of pipeline feasible? Code: from pyts.multivariate.transformation import MultivariateTransformer
from pyts.bag_of_words import BagOfWords
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import itertools
def bow_to_matrix(X):
features_tfidf = []
for i in range(X.shape[1]):
tfidf = TfidfVectorizer(
norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True
)
feature = X[:, i]
tfidf_matrix = tfidf.fit_transform(feature).A
dictionary = {value: key for key, value in tfidf.vocabulary_.items()}
pattern_names = [word + f'_{i}'
for word in np.vectorize(dictionary.get)(np.arange(tfidf_matrix.shape[1]))]
df_idf = pd.DataFrame(tfidf_matrix, columns=pattern_names)
## trying to infer number of bins and word size from the transformation
unique_patterns_found = [word for word in np.vectorize(dictionary.get)(np.arange(tfidf_matrix.shape[1]))]
word_size = len(unique_patterns_found[0])
alphabet_size = len(set(''.join(unique_patterns_found)))
alphabet = bytearray(range(97,123)).decode("utf-8")
alphabet = alphabet[:alphabet_size]
alpha = [x for x in alphabet]
bow_all = [p for p in itertools.product(alpha, repeat=word_lenght)]
bow_all = [''.join(x) for x in bow_all]
patterns_all = [f'{x}_{i}' for x in bow_all]
patterns_not_found = [x for x in patterns_all if x not in pattern_names]
df_idf[patterns_not_found] = 0
df_idf = df_idf[sorted(df_idf.columns)]
features_tfidf.append(df_idf)
dfx_idf = pd.concat(features_tfidf, axis=1)
return dfx_idf
transformer = MultivariateTransformer(BagOfWords(), flatten=True)
tf_idf_transformer = FunctionTransformer(func=bow_to_matrix)
rf = RandomForestClassifier()
steps = [('SAX_transform', transformer), ('tf_idf_transform', tf_idf_transformer), ('rf', rf)]
pipeline = Pipeline(steps)
parameter_grid = [{'SAX_transform__estimator__word_size': [3,4,5,6],
'SAX_transform__estimator__n_bins': [3,4,5,6],
'SAX_transform__estimator__window_size': [12, 20]}]
grid_search = GridSearchCV(pipeline, parameter_grid, cv=5, scoring='recall', verbose=3)
grid_search.fit(X_train_val, y_train_val) |
Yeah, there's nothing wrong about this approach.
Yes, it is definitely possible and a great idea. Pipeline tools from scikit-learn are awesome and can prevent a lot of issues when doing "hand-made" cross-validation with a lot of steps. A pipeline is a chain of estimators such that:
When you call
When you call
So you need to create a transformer. I find it more simple to create a new class that incorporates both
You can provide the vocabulary to Here is an example in which I create a from itertools import product
import numpy as np
from pyts.bag_of_words import BagOfWords
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
class BagOfWordsTfidfTransformer(BaseEstimator, TransformerMixin):
def __init__(self, window_size=20, word_size=4, n_bins=4, strategy='normal',
norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True):
self.window_size = window_size
self.word_size = word_size
self.n_bins = n_bins
self.strategy = strategy
self.norm = norm
self.use_idf = use_idf
self.smooth_idf = smooth_idf
self.sublinear_tf = sublinear_tf
def fit(self, X, y=None):
"""Fit the transformer.
Parameters
----------
X : array, shape = (n_samples, n_features, n_timestamps)
Dataset of multivariaye time series
y : array, shape = (n_samples,)
Target values (ignored)
Returns
-------
self : object
"""
le = LabelEncoder()
y_train_ind = le.fit_transform(y_train)
n_features = X.shape[1]
# Create a BagOfWords instance for each feature
bow_list = [
clone(BagOfWords(window_size=self.window_size, word_size=self.word_size,
n_bins=self.n_bins, strategy=self.strategy, alphabet=None))
for _ in range(n_features)
]
# Compute the vocabulary for each feature
alphabet = np.array([chr(i) for i in range(97, 97 + self.n_bins)])
vocabulary = [''.join(word) for word in product(alphabet, repeat=self.word_size)]
# Create a list to save all the instances of TfidfVectorizer
tfidf_list = []
for i, bow in enumerate(bow_list):
# Get the BoW transformation for this feature
X_bow = bow.transform(X[:, i])
# Create an instance of TfidfVectorizer
tfidf = TfidfVectorizer(
norm=self.norm, use_idf=self.use_idf,
smooth_idf=self.smooth_idf, sublinear_tf=self.sublinear_tf,
vocabulary=vocabulary
)
# Fit the instance
tfidf.fit(X_bow)
tfidf_list.append(tfidf)
# Save the instances
self._bow_list = bow_list
self._tfidf_list = tfidf_list
# Save the vocabulary
vocabulary_list = [
[''.join(word) + f'_{i}' for word in vocabulary]
for i in range(n_features)
]
self.vocabulary_ = np.concatenate(vocabulary_list)
return self
def fit_transform(self, X, y=None):
"""Fit the transformer and return the transformation.
Parameters
----------
X : array, shape = (n_samples, n_features, n_timestamps)
Dataset of multivariaye time series
y : array, shape = (n_samples,)
Target values (ignored)
Returns
-------
X_new : array, shape = (n_samples, n_words)
TF-IDF matrix.
"""
le = LabelEncoder()
y_train_ind = le.fit_transform(y_train)
n_features = X.shape[1]
# Create a BagOfWords instance for each feature
bow_list = [
clone(BagOfWords(window_size=self.window_size, word_size=self.word_size,
n_bins=self.n_bins, strategy=self.strategy))
for _ in range(n_features)
]
# Compute the vocabulary for each feature
alphabet = np.array([chr(i) for i in range(97, 97 + self.n_bins)])
vocabulary = [''.join(word) for word in product(alphabet, repeat=self.word_size)]
# Create a list to save all the TF-IDF matrices
tfidf_matrix = []
# Create a list to save all the instances of TfidfVectorizer
tfidf_list = []
for i, bow in enumerate(bow_list):
# Get the BoW transformation for this feature
X_bow = bow.transform(X[:, i])
# Create an instance of TfidfVectorizer
tfidf = TfidfVectorizer(
norm=self.norm, use_idf=self.use_idf,
smooth_idf=self.smooth_idf, sublinear_tf=self.sublinear_tf,
vocabulary=vocabulary
)
# Fit the instance and transform the input
tfidf_matrix.append(tfidf.fit_transform(X_bow).A)
tfidf_list.append(tfidf)
# Save the instances
self._bow_list = bow_list
self._tfidf_list = tfidf_list
# Save the vocabulary
vocabulary_list = [
[''.join(word) + f'_{i}' for word in vocabulary]
for i in range(n_features)
]
self.vocabulary_ = np.concatenate(vocabulary_list)
# Return the TF-IDF matrix
return np.concatenate(tfidf_matrix, axis=1)
def transform(self, X, y=None):
"""Return the transformation.
Parameters
----------
X : array, shape = (n_samples, n_features, n_timestamps)
Dataset of multivariaye time series
y : array, shape = (n_samples,)
Target values (ignored)
Returns
-------
X_new : array, shape = (n_samples, n_words)
TF-IDF matrix.
"""
# Create a list to save all the TF-IDF matrices
tfidf_matrix = []
for i, (bow, tfidf) in enumerate(zip(self._bow_list, self._tfidf_list)):
# Get the BoW transformation for this feature
X_bow = bow.transform(X[:, i])
# Transform the BoW into a TF-IDF matrix
tfidf_matrix.append(tfidf.transform(X_bow).A)
# Return the TF-IDF matrix
return np.concatenate(tfidf_matrix, axis=1) This class can be used with scikit-learn tools such as pipelines and grid search with cross-validation: from pyts.datasets import load_basic_motions
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
# Define instances
transformer = BagOfWordsTfidfTransformer(word_size=6)
rfc = RandomForestClassifier(random_state=42)
# Create the pipeline
pipe = Pipeline(steps=[('bowtfidf', transformer), ('rfc', rfc)])
param_grid = {'bowtfidf__window_size': [16, 20, 24], 'bowtfidf__n_bins': [3, 4, 5],
'rfc__criterion': ['gini', 'entropy']}
# Create the grid search with cross-validation
clf = GridSearchCV(pipe, param_grid, cv=5)
# Fit the instance (grid search with cross-validation on the training set)
clf.fit(X_train, y_train)
# Evaluate the performance on the test set
clf.score(X_test, y_test) |
You may add sklearn.feature_selection.VarianceThreshold in the pipeline to remove features (words) that are constant (zero variance), which should decrease the number of features and speed up fitting the classifier. But it does not change the maximum memory usage because it would only be done after the transformation. |
Actually the issue only came from the fact that you were calling from itertools import product
import numpy as np
from pyts.bag_of_words import BagOfWords
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
class BagOfWordsTfidfTransformer(BaseEstimator, TransformerMixin):
def __init__(self, window_size=20, word_size=4, n_bins=4, strategy='normal',
norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True):
self.window_size = window_size
self.word_size = word_size
self.n_bins = n_bins
self.strategy = strategy
self.norm = norm
self.use_idf = use_idf
self.smooth_idf = smooth_idf
self.sublinear_tf = sublinear_tf
def fit(self, X, y=None):
"""Fit the transformer.
Parameters
----------
X : array, shape = (n_samples, n_features, n_timestamps)
Dataset of multivariaye time series
y : array, shape = (n_samples,)
Target values (ignored)
Returns
-------
self : object
"""
le = LabelEncoder()
y_train_ind = le.fit_transform(y_train)
n_features = X.shape[1]
# Create a BagOfWords instance for each feature
bow_list = [
clone(BagOfWords(window_size=self.window_size, word_size=self.word_size,
n_bins=self.n_bins, strategy=self.strategy, alphabet=None))
for _ in range(n_features)
]
# Create a list to save all the instances of TfidfVectorizer
tfidf_list = []
for i, bow in enumerate(bow_list):
# Get the BoW transformation for this feature
X_bow = bow.transform(X[:, i])
# Create an instance of TfidfVectorizer
tfidf = TfidfVectorizer(
norm=self.norm, use_idf=self.use_idf,
smooth_idf=self.smooth_idf, sublinear_tf=self.sublinear_tf,
)
# Fit the instance
tfidf.fit(X_bow)
tfidf_list.append(tfidf)
# Save the instances
self._bow_list = bow_list
self._tfidf_list = tfidf_list
# Save the vocabulary
vocabulary_list = [
[word + f'_{i}' for word in np.vectorize(
{value: key for key, value in tfidf.vocabulary_.items()}.get
)(np.arange(len(tfidf.vocabulary_)))]
for i, tfidf in enumerate(tfidf_list)
]
self.vocabulary_ = np.concatenate(vocabulary_list)
return self
def fit_transform(self, X, y=None):
"""Fit the transformer and return the transformation.
Parameters
----------
X : array, shape = (n_samples, n_features, n_timestamps)
Dataset of multivariaye time series
y : array, shape = (n_samples,)
Target values (ignored)
Returns
-------
X_new : array, shape = (n_samples, n_words)
TF-IDF matrix.
"""
le = LabelEncoder()
y_train_ind = le.fit_transform(y_train)
n_features = X.shape[1]
# Create a BagOfWords instance for each feature
bow_list = [
clone(BagOfWords(window_size=self.window_size, word_size=self.word_size,
n_bins=self.n_bins, strategy=self.strategy))
for _ in range(n_features)
]
# Create a list to save all the TF-IDF matrices
tfidf_matrix = []
# Create a list to save all the instances of TfidfVectorizer
tfidf_list = []
for i, bow in enumerate(bow_list):
# Get the BoW transformation for this feature
X_bow = bow.transform(X[:, i])
# Create an instance of TfidfVectorizer
tfidf = TfidfVectorizer(
norm=self.norm, use_idf=self.use_idf,
smooth_idf=self.smooth_idf, sublinear_tf=self.sublinear_tf,
)
# Fit the instance and transform the input
tfidf_matrix.append(tfidf.fit_transform(X_bow).A)
tfidf_list.append(tfidf)
# Save the instances
self._bow_list = bow_list
self._tfidf_list = tfidf_list
# Save the vocabulary
vocabulary_list = [
[word + f'_{i}' for word in np.vectorize(
{value: key for key, value in tfidf.vocabulary_.items()}.get
)(np.arange(len(tfidf.vocabulary_)))]
for i, tfidf in enumerate(tfidf_list)
]
self.vocabulary_ = np.concatenate(vocabulary_list)
# Return the TF-IDF matrix
return np.concatenate(tfidf_matrix, axis=1)
def transform(self, X, y=None):
"""Return the transformation.
Parameters
----------
X : array, shape = (n_samples, n_features, n_timestamps)
Dataset of multivariaye time series
y : array, shape = (n_samples,)
Target values (ignored)
Returns
-------
X_new : array, shape = (n_samples, n_words)
TF-IDF matrix.
"""
# Create a list to save all the TF-IDF matrices
tfidf_matrix = []
for i, (bow, tfidf) in enumerate(zip(self._bow_list, self._tfidf_list)):
# Get the BoW transformation for this feature
X_bow = bow.transform(X[:, i])
# Transform the BoW into a TF-IDF matrix
tfidf_matrix.append(tfidf.transform(X_bow).A)
# Return the TF-IDF matrix
return np.concatenate(tfidf_matrix, axis=1) |
This time a bit of a delay before my next enormous thank you! :)
EDIT. I managed to call the best model from grid_search with |
It's probably due to the huge increase in the number of features which makes fitting the classifier much longer. Which classifier are you using? Random forest? The algorithmic complexity of the provided from pyts.datasets import load_basic_motions
# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
%%timeit
transformer = BagOfWordsTfidfTransformer(n_bins=3, word_size=3)
transformer.fit_transform(X_train, y_train)
# 58.8 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
transformer = BagOfWordsTfidfTransformer(n_bins=6, word_size=5)
transformer.fit_transform(X_train, y_train)
# 85.5 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The second dimension is the number of words that are present in at least one time series. The number of possible words in the univariate case is
It may depend on the data type of your array. If your background is more focused on mathematics than computer science (as is mine), you need to understand how your data is represented. Float64 is also known as Double-precision floating-point format and is the most precise (thus requiring the most memory) format for float numbers in a numpy array. At the bottom of this page, you can find a nice summary of the different data types in numpy. For a given data type, you can find out the size (number of bytes) of a single element using the np.array([0], dtype='int8').itemsize # 1
np.array([0], dtype='uint8').itemsize # 1
np.array([0], dtype='int16').itemsize # 2
np.array([0], dtype='uint16').itemsize # 2
np.array([0], dtype='float16').itemsize # 2
np.array([0], dtype='int32').itemsize # 4
np.array([0], dtype='uint32').itemsize # 4
np.array([0], dtype='float32').itemsize # 4
np.array([0], dtype='int64').itemsize # 8
np.array([0], dtype='uint64').itemsize # 8
np.array([0], dtype='float64').itemsize # 8 Since a byte corresponds to 8 bits, you can see that the number at the end of the data type indicates the number of bits required to save a single element. Back to your error, That being said, there are several solutions to your issue:
Which code are you exactly using? Normally I think that you should get the same results, but it requires a bit of work to reorder the columns (which should not have a big impact on the performance of your classifier, even for bagging models like random forest): from pyts.datasets import load_basic_motions
from pyts.multivariate.transformation import MultivariateTransformer
# Create a function that append the feature number as a suffix
def append_suffix(x, n_features):
liste = []
for j in range(n_features):
liste.extend([word + f'_{j}' for word in x[j].split(' ')])
return ' '.join(liste)
# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
# Transformation using BagOfWordsTfidfTransformer
transformer = BagOfWordsTfidfTransformer(
window_size=20, word_size=4, n_bins=4, strategy='normal',
norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True
)
X_new_train = transformer.fit_transform(X_train, y_train)
X_new_test = transformer.transform(X_test)
# Hand-made transformation
bow = MultivariateTransformer(
BagOfWords(window_size=20, word_size=4, n_bins=4, strategy='normal')
)
tfidf = TfidfVectorizer(norm=None, use_idf=True, smooth_idf=True, sublinear_tf=True)
X_bow_train = bow.fit_transform(X_train)
X_bow_train = [append_suffix(X_bow_train[i], X_train.shape[1]) for i in range(X_train.shape[0])]
X_tfidf_train = tfidf.fit_transform(X_bow_train).A
# Reorder the columns to match the order of the columns of BagOfWordsTfidfTransformer
# First order on the feature (suffix), then on the word without the suffix
vocabulary = np.vectorize(
{value: key for key, value in tfidf.vocabulary_.items()}.get
)(np.arange(X_tfidf_train.shape[1]))
ind = np.where(vocabulary[None, :] == transformer.vocabulary_[:, None])[1]
assert np.all(vocabulary[ind] == transformer.vocabulary_)
X_tfidf_train = X_tfidf_train[:, ind]
X_bow_test = bow.transform(X_test)
X_bow_test = [append_suffix(X_bow_test[i], X_test.shape[1]) for i in range(X_test.shape[0])]
X_tfidf_test = tfidf.transform(X_bow_test).A[:, ind]
# Check that the arrays are equal
assert np.all(X_tfidf_train == X_new_train)
assert np.all(X_tfidf_test == X_new_test) |
Ahh your input about dimensions and array sizes make sense, (indeed my background is not computer science :) ) I also happen to be dealing with an imbalanced classification problem, so now that I have a baseline I wanted to use the pipeline for imbalanced-learn to run some experiments. Now I have a choice when to upsample the minority class. Before or after the bow-tfidf transformation. If I just do oversampling with replacement I don't think it matters where I resample before/after because the same instance is going to be transformed the same way. If I arange the pipeline like this, with the oversampling before the from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline
sampler = RandomOverSampler(random_state=0, sampling_strategy=0.2)
pipe_imb = ImbPipeline(steps=[('oversample', sampler), ('bowtfidf', transformer), ('rfc', rfc)])
scoring = ['accuracy', 'recall', 'precision']
scores = cross_validate(pipe_imb, X_train_val, y_train_val, scoring=scoring, cv=5, verbose=5) but for simple oversampling as I say I think it doesn't matter. pipe_imb = ImbPipeline(steps=[('bowtfidf', transformer), ('oversample', sampler), ('rfc', rfc)]) This works. But if I wanted to generate slightly different learning instances (something akin to SMOTE but for timeseries - I don't even know yet if it exists :D ) I should do it before the transformation. Do you have some experience with this? Slightly off topic, are you a teacher/professor somewhere? Your explanations are very to-the-point! |
If you just perform oversampling, it doesn't matter as you said. Also I don't think that you need oversampling if you use a
There's no such tool in Back to time series, if you don't want to bother with complex black box deep learning approaches, I would say that you could investigate the discrete Fourier transform as the latent space:
This should work for a dataset of univariate time series (but I don't know if it's good). For multivariate time series, it's probably trickier because of the correlation between the features. A simple solution would to use the same convex combination for all the features. It may be possible to achieve this by applying SMOTE to each univariate time series dataset and setting the Edit: SMOTE computes the nearest neighbors of the minority samples, so you could get different neighbors for different features. Maybe an approach would be to stack the Fourier coefficients columns (as you do with the words), use SMOTE on this 2D dataset to generate the new Fourier coefficients for all the features at once, then go back to multivariate time series.
Thank you! I'm a postdoctoral researcher working on something totally different and my work on machine learning for time series is a side project (there's a bit more information on my website: https://johannfaouzi.github.io). I would like to become an associate professor at a university in the near future :) |
Just a few obvious comments on resampling: it should only be performed on the training set, which means that:
|
I'm going to assume that you are using the quantiles of the standard normal distribution to define the bin edges (i.e., Based on what you pointed it out in If this option is enabled, then different constant subsequences may lead to different words in order to take into account the constant value. For instance, with 3 symbols, the bin edges are:
So If this option is disabled, then all the subsequences are standardized. Constant subsequences will be transformed into a vector full of zeros (dividing by the standard deviation is skipped). In this case, all the constant subsequences will be mapped to the same word (with the same symbol). For instance, with the previous examples, the three subsequences would be transformed into If you are using the default value ( Let me know if this is clearer or not! |
Right indeed you are! This makes sense, I remember now :) Lastly, I wanted to visualise the relevant patterns I find with In |
The location of the symbols is wrong. I think that you may have misunderstood how the subsequences are extracted. Nonetheless, updating the code to get what you're looking for was a great idea! If you look at your time series, you can see that it is (almost) constant starting at time point 11. Therefore, all the extracted subsequences whose start index is greater than or equal to 11 will be transformed into the I recently submitted a review paper on time series classification and coded a function to illustrate another algorithm (pyts.transformation.BOSS) that is very similar to what you're doing (except that the words are extracted using the pyts.approximation.SymbolicFourierApproximation instead of PAA+SAX). The resulting figure is displayed below: with the following caption:
Here is an example with import matplotlib.pyplot as plt
import numpy as np
from pyts.bag_of_words import BagOfWords
from pyts.datasets import load_basic_motions
from sklearn.feature_extraction.text import CountVectorizer
# Display stuff
x_coord_caption_subfigure = 0.00
y_coord_caption_subfigure = 1.04
# Data
X, _, _, _ = load_basic_motions(return_X_y=True)
X = X[:, 0, :]
# Parameters
window_size = 10
window_step = 1
word_size = 2
n_bins = 5
n_windows = (X.shape[1] - window_size + window_step) // window_step
fig, axes = plt.subplots(4, 1, figsize=(12, 8), gridspec_kw={'hspace': 0.5})
# Plot raw time series
idx = 0
axes[0].plot(X[idx])
xlim = axes[0].get_xlim()
ylim = axes[0].get_ylim()
axes[0].text(x_coord_caption_subfigure, y_coord_caption_subfigure,
'(a)', fontweight='bold', va='bottom', ha='left', transform=axes[0].transAxes)
# Plot subsequences
def offset(x_sub, i):
return - (i % (window_size + 5)) / 4 + (1.0052278 - x_sub.max())
for i in range(n_windows):
x_sub = X[idx, i:i + window_size]
axes[1].plot(np.arange(i, i + window_size), x_sub + offset(x_sub, i), linewidth=0.5)
axes[1].set_ylim((ylim[0] - 2, ylim[1]))
axes[1].set_axis_off()
axes[1].text(x_coord_caption_subfigure, 0.95,
'(b)', fontweight='bold', va='bottom', ha='left', transform=axes[1].transAxes)
# Plot SFA words with numerosity reduction
bow = BagOfWords(window_size=window_size, word_size=word_size, n_bins=n_bins,
window_step=window_step, numerosity_reduction=False)
X_bow = bow.transform(X)
X_word = np.asarray([X_bow[i].split(' ') for i in range(X.shape[0])])
not_equal_idx = np.c_[np.full(X.shape[0], True), X_word[:, 1:] != X_word[:, :-1]]
x_word = X_word[idx]
x_bow = X_bow[idx]
for i in range(n_windows):
alpha = 1. if not_equal_idx[idx, i] else 0.2
fontweight = 'bold' if not_equal_idx[idx, i] else 'normal'
axes[2].text(i, 1.2 - (i % (window_size + 5)) / 10, x_word[i],
color=f'C{i}', alpha=alpha, fontsize=6, fontweight=fontweight)
axes[2].set_xlim(xlim)
axes[2].set_axis_off()
axes[2].text(x_coord_caption_subfigure, 1.26,
'(c)', fontweight='bold', va='bottom', ha='left', transform=axes[2].transAxes)
# Plot histogram
bow.set_params(numerosity_reduction=True)
X_bow_with_numerosity = bow.transform(X)
count = CountVectorizer()
X_hist = count.fit_transform(X_bow_with_numerosity).A
vocabulary = np.array(sorted(count.vocabulary_.keys()))
axes[3].bar(np.arange(np.sum(X_hist[idx] > 0)), X_hist[idx][X_hist[idx] > 0])
axes[3].set_xticks(np.arange(np.sum(X_hist[idx] > 0)))
axes[3].set_xticklabels(vocabulary[X_hist[idx] > 0])
axes[3].text(x_coord_caption_subfigure, y_coord_caption_subfigure,
'(d)', fontweight='bold', va='bottom', ha='left', transform=axes[3].transAxes)
plt.show() which gives the following figure: |
Description
When running SAX-VSM on my timeseries I get the following error:
At least one sample is constant.
I tried filtering out all the constant time series with
X = X[np.where(~(np.var(X, axis=1) == 0))[0]]
to no avail
I tried fitting the model on 1 non-constant array and still got the error. I think that the issue is that this error is thrown when the SAX approximation would give the same symbol for the window, thus meaning that the window is constant. E.g. for a wordsize of 3 if the SAX transform would yield 'aaa' then this error appears. Could it be the case?
Steps/Code to Reproduce
Versions
NumPy 1.20.1
SciPy 1.6.1
Scikit-Learn 0.24.1
Numba 0.53.1
Pyts 0.11.0
Additionally an error:
'n_bins' must be greater than or equal to 2 and lower than or equal to min(word_size, 26)
If n_bins represents the alphabet size/the paa approximation then why should it be lower than the word size?
Doesn't that mean that situations like alphabet = {a,b,c,d,e} and wordsize = 3, are impossible? (which shouldn't be the case)
The text was updated successfully, but these errors were encountered: