Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to speedup or use full memory? #264

Open
hanzigs opened this issue Dec 16, 2019 · 16 comments
Open

how to speedup or use full memory? #264

hanzigs opened this issue Dec 16, 2019 · 16 comments
Assignees
Labels

Comments

@hanzigs
Copy link

hanzigs commented Dec 16, 2019

Hi, I am running
algorithms.SOFM(n_inputs=79, n_outputs=2, learning_radius=1, step=0.1, shuffle_data=True, weight='sample_from_data',verbose=True)
data array size is 4.3gb
running windows 10 machine with 32GB RAM,
Set to 200 epoch, each epoch taking 25mins, this may take 3.3 days to complete.
its not using full memory though, while running RAM usage is just 14-15GB, still 16 gigs free sitting up there
May i know how to speed up or use full memory or cores please
Thanks

@itdxer
Copy link
Owner

itdxer commented Dec 16, 2019

Hi, did you consider reducing number of input samples? For example, you can randomly sample, for example, 5% of your data. In addition, your samples don't have to be fixed. For each epoch you can get a new sample.

In some sense, it's just a sophisticated way of reducing number of epochs. Smaller sample size should approximate your dataset and obtained map should generalise to a larger set

@itdxer itdxer self-assigned this Dec 16, 2019
@hanzigs
Copy link
Author

hanzigs commented Dec 16, 2019

Thanks for that, may i know how to set new sample for each epoch? Are you mentioning on sofm.train(df, epochs=200) this, where df is split to small batches,

If i split df, then i have to build multiple models right, i have a test data separate, its hard for prediction right on multiple models
Thanks

@itdxer
Copy link
Owner

itdxer commented Dec 16, 2019

you need to do it yourself

Solution 1

data_sample = randomly_sample(data)
for _ in range(200):
    sofm.train(data_sample, epochs=1)

Solution 2

for _ in range(200):
    data_sample = randomly_sample(data)
    sofm.train(data_sample, epochs=1)

@hanzigs
Copy link
Author

hanzigs commented Dec 16, 2019

Oh Yes, Thanks for that, Will try

@sujithgangaraju
Copy link

sujithgangaraju commented Dec 17, 2019

I have the same issue when training with a large number of array of data.
the length of data(np.ndarray) is around 4000 for me and it's trying to train which is taking a long time.

cluster_ranges = [387, 388] (I have around 130 elements and looping over)
for i_clusters in cluster_ranges:
   sofm = algorithms.SOFM(n_outputs=i_clusters, **self.sofm_dict) 
   sofm.train(data, epochs=100)
  ## and appending the silhotte score to array

The above is my sample code and its taking hell lot of time to finish the loop with the train which is killing the performance.

Can someone of you help how to optimize and speed up the algorithm in the above scenario?

@itdxer
Copy link
Owner

itdxer commented Dec 17, 2019

Hi @sujithgangaraju, do you have to run it for 100 epochs? Can you use some sort of convergence criteria in order to avoid training it for that many epochs?

For example,

epsilon = 0.1 # you should pick the right value

for epoch in range(100):
    sofm.train(data, epochs=1)
    training_errors = sofm.errors.train
    
    if epoch >= 1 and abs(training_errors[-1] - training_errors[-2]) < epsilon:
        break  # stop training

Also, is it possible to minimise number of clusters that you need to test? In your example there is very little difference between 387 and 388, so difference in silhouette scores should be random. In addition, instead of doing a greed search you can do a bit more intelligent hyper parameter search, for example, you can use TPE (Python library that supports TPE: https://github.com/hyperopt/hyperopt). Check this article in order to learn more about TPE: http://neupy.com/2016/12/17/hyperparameter_optimization_for_neural_networks.html

In addition, you can distribute training across multiple machines. Each machine can be trained using unique set of cluster_ranges

I hope it helps

@sujithgangaraju
Copy link

sujithgangaraju commented Dec 17, 2019

cluster_ranges

Yes, I have to run through 100 epochs, the cluster ranges which are calculating from min and max cluster. so all sequential ranges are looping over.

in my scenario, for one element 387 - the sofm.train is taking almost an hour to finish for first element.

can we speed up with same cluster ranges and with 100 epochs? I tried with the above code snippet over 100 epochs range but I dont see a difference in performance.

@itdxer
Copy link
Owner

itdxer commented Dec 18, 2019

@sujithgangaraju

the cluster ranges which are calculating from min and max cluster. so all sequential ranges are looping over

When you have large number of clusters adding/removing one or two clusters shouldn't show you any effect. In think in that case increase in the cluster sizes should follow exponential function. Any difference between 387 and 388 won't be significant unless you have very large dataset (for example, > 1M rows) and even then you need to run your experiments a lot of time. And even if the difference is statistically significant I doubt that it will have any practical significance.

can we speed up with same cluster ranges and with 100 epochs?

Also, as I said before, you can run training on different machines using different subset of cluster ranges. In this way you will parallelise your training. It might be important to shuffle your clusters before processing in different machines. For example, if first machine processes clusters with sizes [2, 3,4] and the other one with sizes [5, 6, 7] ,clearly, the second machine should spend more time on training only because each one of the possible cluster sizes is larger compare to the cluster sizes on the first machine.

I tried with the above code snippet over 100 epochs range but I dont see a difference in performance.

What convergence curve do you get when you plot error differences after each epoch?

Also, you can do batch training , but you need to make sure that your batches are reproducible during the training. Maybe you can do something like this

for epoch in range(100):
    data_sample = randomly_sample(data, random_seed=epoch)
    sofm.train(data_sample, epochs=1)

Other than that, I don't think there its anything you can do to speed up algorithm without rewriting the code or extending the main logic with some heuristics

@sujithgangaraju
Copy link

Thanks for the inputs, I will let you know if any help I need.

@hanzigs
Copy link
Author

hanzigs commented Dec 18, 2019

Hi @itdxer
From the solution

for _ in range(20):
     data_sample = randomly_sample(data)
     for epoch in range(200):
          sofm.train(data_sample, epochs=1)

whether is it good decision to pickle the model after 100 epochs and load back again and continue training, will that cause overfitting?

I understand keras model.save saves information for restarting training, whether sofm can be done like that, here on sampling
i save sofm and data_sample when every inner for loop finish,
say on 10th outer loop some interruption stopped the run (like memoryError),

Here i have 2 questions?

  1. Can i restart the loop from the saved sofm and removing the data_sample from original data?
  2. After completing full run of 20 outer loops, if i find still there is a possibility for error can go down, can i restart from last saved sofm starting from data again (NO new data)

I am saving sofm using pickle.dump.

Last one: is there chances for passing validation data and get validation error on each epoch?
Thanks

@itdxer
Copy link
Owner

itdxer commented Dec 19, 2019

say on 10th outer loop some interruption stopped the run (like memoryError),

That shouldn't happen. Is that because of SOFM? Does it fail when you're using some other model? For example, k-means from scikit-learn.

Can i restart the loop from the saved sofm and removing the data_sample from original data?

Yes, but you need to make sure that datasets that you're applying during the training are reproducible. For example, if you use seed=10 to sample subset of data for the 10th epoch then you need to make sure that if you save your model after than 10th epoch and load it back then you will use dataset with seed=11 for training since technically you will be training it for the 11th epoch

Hopefully my previous explanation give you some explanation for the second question as well.


Validation could be done in the following way

test_cluster_indices = sofm.predict(x_test).argmax(axis=1)
cluster_centers = self.weights[np.arange(len(test_clusters)), test_cluster_indices]
test_error = np.abs(x_test - cluster_centers).mean()

Note that training and test errors won't be comparable in that case since training error is being accumulated during the training, so that's some sort of running error estimate over the epoch (weights are being updated N times, where N is the number of samples).

@hanzigs
Copy link
Author

hanzigs commented Dec 19, 2019

Thanks @itdxer
MemoryError not because of sofm, running multiple kernals in jupyterLabs, data occupies more RAM. some times i will run into memoryerror. i can control that.

and Thanks for validation suggestions.

regarding, reload sofm, i got it.
considering no interruption
For every outer loop if i save the data_sample (i will get 20 files) and wait till one cycle of both loop finishes,
then reload all 20 data_samples and run the inner loop,
Am i going to get started from 201 to 400 epochs?
Thanks

@itdxer
Copy link
Owner

itdxer commented Dec 23, 2019

Am i going to get started from 201 to 400 epochs?

It depends on the number of epochs, since a model stored in each file has been trained on a different number of epochs. The first model went from 1st to 200th epoch, which means that you have to start from epoch 201, but that won't be true for the second model, since it was trained until epoch 400, which means that after restart you have to start from epoch 401

@hanzigs
Copy link
Author

hanzigs commented Dec 27, 2019

Hi @itdxer
I have 5million records
Have tried the batch(data_sampling) and full data, batch is 500k per epoch for 100 epochs, full is 5million records, batch takes 2min per epoch and full takes 18min per epoch
Using only 5% on CPU and 12GB RAM of my 32GB,
here is the error curve
image

After 4 batches the error keeps increasing, so not happy with the model
So running Full, so far 160 cycles,

Need something to increase speed either by using all cores or something, is that possible by having n_jobs parameter in train function to use all cores.

I don't have GPU capacity.

Any suggestiosn for speed up.

@itdxer
Copy link
Owner

itdxer commented Dec 27, 2019

After 4 batches the error keeps increasing, so not happy with the model

Looks like each increase and decrease happens for a fixed number of learning cycles. Looks like this width is around 100 cycles (which probably the same as 100 epochs in your case). If you're using different batches per each 100 epochs than these errors are not comparable and increase or decrease in your graph doesn't mean that it's getting better or worse. You can see that error jumps only when you change your training batch. But when you run your training with a fixed batch your error is increasing. To make error curve a bit more reliable you need to fix test or validation set and evaluate your error after each iteration on a fixed set, otherwise it's hard to interpret the error curve.

Need something to increase speed either by using all cores or something, is that possible by having n_jobs parameter in train function to use all cores.

There is no way of doing it now, maybe there will be some option in the future

@hanzigs
Copy link
Author

hanzigs commented Dec 27, 2019

Actually batch run is from the solution recommended and data is unsupervised

for _ in range(10):
     data_sample = randomly_sample(data=500000)
     for epoch in range(100):
          sofm.train(data_sample, epochs=1)
          training_errors = sofm.errors.train

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants