Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WEASEL+MUSE large number of features #102

Open
wwjd1234 opened this issue Jun 14, 2021 · 3 comments
Open

WEASEL+MUSE large number of features #102

wwjd1234 opened this issue Jun 14, 2021 · 3 comments

Comments

@wwjd1234
Copy link

Description

When using WEASELMUSE for multivariate time series classification the result of the trandformer give a very large number of features 650,000. Also, the number of counter in the histogram is sometime zero for some examples. Is this expected?

Steps/Code to Reproduce

I used my own data set the result of X_weasel was an ndarray size 1500 x 650000.
The 1500 makes sense as this is the number of examples I had, but the 650000 seems large.
I use the following code below. Also, when using the same code in the example when loading basic motions I get similar results. Large number of feature and some examples with all zeros. Thus if I plot the histogram there is nothing to plot.

    transformer = WEASELMUSE(strategy='uniform', word_size=4, window_sizes=np.arange(5, 70), sparse=False)
    X_weasel = transformer.fit_transform(X_train, y_train)

Versions

NumPy 1.20.3
SciPy 1.6.3
Scikit-Learn 0.24.2
Numba 0.53.1
Pyts 0.11.0

@johannfaouzi
Copy link
Owner

Actually it's not that surprising, because this algorithm (in a nutshell) mainly consists in extracting many features and filtering in the best ones. Also, if the window size is very large (compared to the number of time points), the algorithm can only extract a very small number of subsequences for this window size, and since this algorithm counts the number of words (each subsequence is transformed into a word), the number of non-zero values is very small, while the number of features is very large.

You have two main approaches to decrease the number of features:

  • Obtaining an array with fewer features by changing the values for some arguments: decreasing the size of the alphabet (word_size) or the number of windows considered (window_sizes), or increasing the threshold for the chi2 statistics (chi2_threshold). In particular, I think that considering every window size between two ranges is not necessary, because you will probably extract very correlated features.
  • Decreasing the number of features by using an estimator from scikit-learn to perform feature selection. There are several approaches that are well described in the documentation. There is also another tutorial in which correlated features are removed, which could be relevant.

Hope this helps you a bit.

@wwjd1234
Copy link
Author

Thank you for the explanation. I was able to reduce the feature space size by adjusting values as you mentioned, decreasing word size, and setting 2 values for the window instead of a range. This does however affect the model accuracy for classification. I found that when I keep the 650,000 features I get excellent accuracies but lower other wise.

@johannfaouzi
Copy link
Owner

Great if it's working well with the first set of values for the hyper-parameters.

I don't know if it's necessary to mention it, but it's mandatory to perform cross-validation to evaluate a model: it's really easy to overfit any machine learning algorithm on a dataset of 1,500 samples and 650,000 features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants