- Documentation - Table of Contents
- Sign Up Process
- Getting Started with SSH
- Getting Started with the Power Platform
- Add
NetmindMixins
to your Training Code - Import a Training Job from Google Colab
- Export Trained Models' Checkpoints to Google Colab or any other Machine for Inference
- Binding you MetaMask Wallet to your NetMind Account
In order to use any NetMind service, you need to create an account first. You can register with your Google or Microsoft account, or with any email address. After login, click on the icon on the top right of the screen and select "Console" from the dropdown menu (see image below).
You will be directed to your personal page.
Follow these steps to rent a server with GPUs.
After signing in, select "Console" from the dropdown menu. You will be directed to your personal page. Select "Rent GPU" from the dropdown on the left, and then click on "Rent new machine".
Fill in the form which will appear. At the end you will be asked to choose a password which you will use to SSH. Click "Rent machine" once the form is complete.
You should now see a dashboard with your newly rented machine. If you do not see the IP Address and Port, you may need to manually refresh the browser page. You can copy the SSH command by clicking on the button as shown below.
ssh root@alb.protago-dev.com -p 10362
If you're on Linux os MacOS, you will usually be able to pass this command directly to your terminal in order to connect. For Windows users, you can SSH using the PuTTY software from this link.
Our product also supports IDE connection (for example via VSCode).
NOTES
- If any failure or technical issue is detected on the server, the machine status on the dashboard might be showing as "Offline". In that case a "restart" button should appear, please restart the machine. If the problem persists, contact our support at hello@netmind.ai.
- When you request a machine, you will be able to specify a set rental time up to 7 days. We will send you a reminder to renew the lease when you have less than one day remaining. If you do not renew, upon expiry the machine will be made available for anyone to rent and all your data will be permanently and securely removed from the server.
- It is not possible to retrieve data after the rental is terminated and the machine has been released back to our pool of available servers. We suggest you create a local backup of any relevant data before terminating the rental. If you're on a Linux or MacOS system, this could be achieved via Secure Copy with the
scp
command. The example below shows how to donwnload the/data/example.zip
folder from the remote machine to your local machine in the home directory, when run form your local machine. You will need to adjust the port and IP address accordingly. On Windows, you can use WinSCP.
scp -P 10362 root@alb.protago-dev.com:/data/example.zip ~/example.zip
After signing in, select "Console" from the dropdown menu. You will be directed to your personal page. Select "Workspaces" from the dropdown on the left, and then select "create a new workspace", assign your workspace a name and press "Confirm". A workspace should now appear under your Workspaces. For example, a workspace named "test" will look like this.
Click on your workspace box and then click on "Create Training job", as shown below. You should then see the following page.Assign your training job a name and fill in the fields as required. You will need to upload your training code and data, if applicable. In order for your code to train successfully on our platform, you need to make some minor modifications to your codebase by adding the NetmindMixins
library. The required modifications depend on whether you're implementing a custom training loop or using HuggingFace Trainer
class. Details are given in the section below.
We also support uploading a training job directly from Google Colab. For a step-by-step guide, see the section Import a Training Job from Google Colab at the end.
Here we share some examples and general guidance for how to train machine learning models on our platform using the NetmindMixins
library. NetmindMixins
is a proprietary library developed by NetMind.AI in order to facilitate training of machine learning models across our distributed platform ecosystem, the NetMind Power platform. It is already pre-installed into all our environments, so you need not worry about installation. Please specify any other required library for your project in a requirements.txt
file and add it to the root folder of your project structure. This will ensure any listed library is installed before running your code. If the file has a different name, it will not be recognized by our system.
Below are the instructions and general principles to follow in order to add our NetmindMixins
framework to your existing training code. This is necessary in order to successfully run your code on our platform, as it allows us to monitor essential functions such as training progress, to estimate training time, to save checkpoint on cloud and bill accounts appropriately, and to distribute the workload across multiple machines. While detailed general instructions are shown below, if you want to see some specific examples of how they are implemented in practice, please see the following example files (available in this repository) and notebooks (links below).
Files
story_custom_trainer.py
: in this file, we show how to apply ourNetmindMixins
wrapper to a custom training function.story_hf_trainer_data_parallel.py
: in this file we use thetransformers
'sTrainer
to do the training and use ourNetmindMixins
wrapper to implement data parallelism.story_hf_trainer_model_parallel.py
: in this file we use thetransformers
'sTrainer
to do the training and use ourNetmindMixins
wrapper to implement model parallelism.
Notebooks
Type | Python File Name | Colab |
---|---|---|
Custom Trainer | story_custom_trainer.py |
|
HuggingFace Trainer | story_hf_trainer.py |
|
Model Parallel | story_hf_trainer_model_parallel.py |
NOTE: If you convert a Google Colab notebook to a python file, remember to remove library install lines such as
!pip install ...
!git clone ...
and anything else which is not valid python code.
The instructions given below vary based on whether you are implementing a custom training loop using the PyTorch framework, or using HuggingFace Trainer
class.
The two examples should be generalizable and adaptable to the vast majority of use cases. However, if you feel they do not apply or cannot be applied to your codebase, feel free to join our Discord support channel or reach out at hello@netmind.ai and a member of our team will aim to provide more tailored guidance.
In the UI, you will be asked to upload your code and your dataset separately. The dataset upload supports a much larger size limit than the code upload. Both your code and dataset folders should be uploaded as zip files, unless the code consists of a single python file, in which case it can be uploaded directly. If your project has a multi-folder structure, we ask that the "entry file" (the script being run when training your model) be placed into the root folder of the project. This file can then import modules located in other folders as well of course.
In order to access the dataset from your python file, you should define a data
training argument in your entry file; your data folder can then be accessed via training_args.data
. See example below
import argparse
def setup_args():
parser = argparse.ArgumentParser()
parser.add_argument(
'--data', type=str, required=False, help=''
)
return parser.parse_known_args()[0]
training_args = setup_args()
The Power platform will then automatically pass the data argument to your code, which can be accessed through training_args.data
. Below are two examples showing how you can load your data via training_args.data
(each codebase will vary so you will need to tailor this to your code.)
- If the folder you uploaded has a structure that can be loaded using HuggingFace datasets library, you can load it in this way
from datasets import load_dataset
data = load_dataset(training_args.data)
- If your uploaded data folder contains files within subfolder which you need to access, for example you need to load the
data/SQuAD/train-v1.1.json
file, you can do it in this way
with open(os.path.join(training_args.data, "SQuAD/train-v1.1.json"), "r") as f:
train = json.load(f)
These are the required import statements
from NetmindMixins.Netmind import nmp, NetmindOptimizer, NetmindDistributedModel
and the library is initialized as follows
nmp.init(use_ddp=True)
This initialization line should be placed in your code before the training loop.
NOTE: please do not call torch.distributed.init_process_group
after nmp.init
, as it's already called within nmp.init
and doing this twice will raise an error.
Set use_ddp=True
to use data parallelism and use_ddp=False
to use model parallelism.
At the moment, we support model parallelism or data parallelism, but we do not support both at the same time. We leave it to the users, as the experts in their models, to make the best decision as to which technique to use, however if in doubt, we recommend starting with data parallelism and if you encounter a memory error (for example your model cannot fit within the GPU RAM even when using low batch sizes), then switch to model parallelism.
If you intend to run the model using data parallelism with use_ddp=True
as shown above, you also need to set the model, and any relevant tensor (such as input ids) to device number zero, as shown below
import torch
device = torch.device("cuda:0")
model.to(device)
We will take care of distributing your training across multiple devices. This step does not apply if you are using model parallelism.
After you have instantiated your machine learning model and optimizer in the usual way (which will depend on the specifics of your codebase), you should wrap both the model and the optimizer around our model and optimizer classes as shown below.
- If you are using distributed data parallelism
ddp_model = NetmindDistributedModel(
torch.nn.parallel.DistributedDataParallel(model)
)
optimizer = NetmindOptimizer(optimizer)
- If you are not using distributed data parallelism (for example you're using distributed model parallelism instead, or no parallelism)
ddp_model = NetmindDistributedModel(model)
optimizer = NetmindOptimizer(optimizer)
Before you run the training loop, you should make this inplace training bar initialization
nmp.init_train_bar(max_steps=1000)
where max_steps
allows you to control the number of total training steps your model will be run for. If you do not want to set a maximum number of steps, we suggest setting this parameter to a very large number, larger than the length of your training dataset multiplied by the number of training epochs, so this number will never be reached, regardless of the batch size chosen.
Inside the training loop you will need to make a call to nmp.step
, which is required to allow us to monitor the training progress. Without this, we cannot estimate the total training time and also cannot allow you to monitor metrics such as the loss via our user interface. Where exactly you place this call will depend on your code structure, but in general it should be placed at the end of each training step, after you have calculated the loss. If you have a standard structure with two for
loops, one for each epoch and one for each training step respectively, your code should look like this
for epoch in range(training_args.num_train_epochs):
... # YOUR CODE
model.train()
total_loss = 0
for train_step, batch in enumerate(train_data):
optimizer.zero_grad()
... # YOUR CODE, INCLUDING THE FORWARD PASS
optimizer.step()
scheduler.step() # You may or may not have a scheduler
# REQUIRED ADDITION BELOW
monitor_metrics = {
"loss": loss.item(),
"Learning rate": scheduler.get_last_lr()[0] # remove if you don't have a scheduler
}
nmp.step(monitor_metrics)
WARNING: If you have both a training and an evaluation loop in your code, please make sure nmp.step
is added to the training loop. You can also use nmp.evaluate
outside the evaluation loop (if you have one) as shown below.
NOTE: while nmp.step
needs to be placed inside the training loop, at the end of the innermost loop after each training step, nmp.evaluate
needs to be placed outside the evaluation loop at the end. Adding nmp.evaluate
is optional, but will help us to calculate the evaluation metrics and update the UI.
for train_step, batch in enumerate(valid_data):
... # YOUR CODE, INCLUDING THE FORWARD PASS
training_stats[f"epoch_{epoch + 1}"] = {
"training_loss": average_training_loss_per_batch,
"valid_loss": average_validation_loss_per_batch,
"valid_accuracy": average_validation_accuracy_per_epoch,
"training_time": training_time,
"valid_time": valid_time
}
nmp.evaluate(training_stats[f"epoch_{epoch + 1}"])
NOTE: The variables shown above, such as average_training_loss_per_batch
are for illustration purposes, and you should adapt them based on your code.
At the very end of the training code (after all epochs have finished, outside all training loops), you must call this inplace function once:
nmp.finish_training()
Your trained model checkpoints will then become available for you to download from the NetMind Power platform UI once the training is complete.
In the UI, you will be asked to upload your code and your dataset separately. The dataset upload supports a much larger size limit than the code upload. Both your code and dataset folders should be uploaded as zip files, unless the code consists of a single python file, in which case it can be uploaded directly. If your project has a multi-folder structure, we ask that the "entry file" (the script being run when training your model) be placed into the root folder of the project. This file can then import modules located in other folders as well of course.
In your code you will need to define a ModelTrainingArguments
class which inherits from HuggingFace transformers.TrainingArguments
class, as shown below.
from transformers import TrainingArguments, HfArgumentParser
@dataclass
class ModelTrainingArguments(TrainingArguments):
data: str = None
max_steps: int = 1000
...
parser = HfArgumentParser(ModelTrainingArguments)
training_args = parser.parse_args_into_dataclasses(return_remaining_strings=True)[0]
This class sets the training arguments to be passed to the Trainer
. As shown above, you should set data
to None
, and max_steps
as required, which allows you to control the number of total training steps your model will be run for. If you do not want to set a maximum number of steps, we suggest setting this parameter to a very large number, larger than the length of your training dataset multiplied by the number of training epochs, so this number will never be reached, regardless of the batch size chosen. Any other training argument you wish to pass to the Trainer
class should be set within the ModelTrainingArguments
class above.
If you uploaded a dataset, you can access it from your code through training_args.data
. Below are two examples showing how you can load your data via training_args.data
(each codebase will vary so you will need to tailor this to your code.)
- If the folder you uploaded has a structure that can be loaded using HuggingFace datasets library, you can load it in this way
from datasets import load_dataset
data = load_dataset(training_args.data)
- If your uploaded data folder contains files within subfolder which you need to access, for example you need to load the
data/SQuAD/train-v1.1.json
file, you can do it in this way
with open(os.path.join(training_args.data, "SQuAD/train-v1.1.json"), "r") as f:
train = json.load(f)
These are the required import statements
from NetmindMixins.Netmind import nmp, NetmindTrainerCallback
and the library is initialized as follows
nmp.init(use_ddp=True)
This initialization line should be placed in your code before instantiating the HuggingFace Trainer
class (see below).
NOTE: please do not call torch.distributed.init_process_group
after nmp.init
, as it's already called within nmp.init
and doing this twice will raise an error.
Set use_ddp=True
to use data parallelism and use_ddp=False
to use model parallelism.
At the moment, we support model parallelism or data parallelism, but we do not support both at the same time. We leave it to the users, as the experts in their models, to make the best decision as to which technique to use, however if in doubt, we recommend starting with data parallelism and if you encounter a memory error (for example your model cannot fit within the GPU RAM even when using low batch sizes), then switch to model parallelism.
If you need to implement your own training and / or evaluation metrics, you should create a class inheriting from NetmindTrainerCallback
and adapt it to your use case, as shown in the example below.
import transformers
class CustomTrainerCallback(NetmindTrainerCallback):
def __init__(self):
super().__init__()
'''
Add custom training metrics
'''
def on_step_end(
self, args: transformers.TrainingArguments,
state: transformers.TrainerState,
control: transformers.TrainerControl, **kwargs
):
kwargs["custom_metrics"] = {}
return super().on_step_end(args, state, control, **kwargs)
'''
Add custom evaluation metrics
'''
def on_evaluate(
self, args: transformers.TrainingArguments,
state: transformers.TrainerState,
control: transformers.TrainerControl, **kwargs
):
kwargs["custom_metrics"] = {}
return super().on_evaluate(args, state, control, **kwargs)
If you do not need to add any custom metrics on training or evaluation, you can skip the step above and use NetmindTrainerCallback
directly when instantiating the Trainer
(see explanation below).
When implementing the HuggingFace Trainer
, you will need to set args=training_args
(see above for how training_args
was defined) and add the callbacks=[CustomTrainerCallback]
as an argument of your Trainer
class as shown below. If you didn't define a CustomTrainerCallback
class, then you can set this argument to callbacks=[NetmindTrainerCallback]
. After the training is done, you also need to call nmp.finish_training
inplace. See the example below
from transformers import Trainer
trainer = Trainer(
model=...,
tokenizer=...,
args=training_args,
train_dataset=...,
eval_dataset=...,
optimizers=...,
...,
callbacks=[CustomTrainerCallback]
)
trainer.remove_callback(transformers.trainer_callback.PrinterCallback)
trainer.remove_callback(transformers.trainer_callback.ProgressCallback)
trainer.train()
nmp.finish_training()
Your trained model checkpoints will then become available for you to download from the NetMind Power platform UI once the training is complete.
We now also allow you to import a training job directly from Google Colab to the Netmind Power platform. You can do this by following these steps:
- Find your training code file on Google Colab
- Modify your code by adding the
NetMindMixins
library as explained in this section. - Share your Google Colab file with NetMind Power. In order to do this, you will need to make your code visible to "Anyone with the link" (see below)
- Follow the same steps as in the Getting Started section in order to create a new training job. However, when you get to the final interface, under "Training type" you should select "Using Google Colab URL" as shown below.
- Under "Training file", simply paste the URL link provided by Google Colab for your notebook
- Fill the rest of the fields as required and submit the training job.
NOTE:
If you choose to upload your own data locally, you should follow the instructions in the Uploading Datasets section and access the data from your code via training_args.data
.
After you have trained your model using our Power Platform, if you wish to do inference, you can download your checkpoints on Google Colab, your local machine or any machine with an internet connection. In order to do that please follow the steps below.
- On your Workspace, copy the ID of the training job you want to download the checkpoints for. You can do it by clicking on the squared icon below. An example ID would be
2356f5c2-7f25-436f-a4b2-759e9e12a01b
. - On your Console, use the left panel to navigate to the "API keys" tab and then click on "Create new API Key". On the field "Name" assign a unique name to your key and click "Create secret key" when done. An API key will then be generated, for example
INIYBEXPK68U1CLH
. - Open the environment you want to download the checkpoints to. This could be Google Colab, your local machine or any machine with an internet connection. Run the script below and replace your job ID and API key accordingly. Note that this script will install NetmindMixins in your python environment.
import os
os.system("pip install git+https://github.com/protagolabs/NetMind-Mixin-Runtime@py310")
from NetmindMixins.Netmind import nmp
model_path = nmp.download_checkpoint(
job_id="2356f5c2-7f25-436f-a4b2-759e9e12a01b", # replace with your job ID
token="INIYBEXPK68U1CLH" # replace with your API key
)
if not model_path:
raise Exception("download failed")
print(f"Checkpoints successfully downloaded at: {model_path}")
If you wish to volunteer computing power to the NetMind Power platform by connecting your idle GPUs to the network, you will receive rewards in the form of NMT (NetMind Token), the native utility token of the NetMind Chain. More details are
provided in sections 6 and 7 of our White Paper. In order to enable us to send you your rewards, you will need to create a MetaMask wallet and bind it to your NetMind Power account.
MetaMask is an online Cryptocurrency wallet that allows you to store tokens (in this case NMT) securely; it provides a browser extension that allows you to easily access your wallet and tokens directly within the browser. The supported browsers are Chrome, Firefox, Brave, Edge, Opera. To download the suitable extension for your specific browser, visit this page and select your browser of choice.
After you have created a MetaMask account and wallet, follow the steps below to bind it to your NetMind Power account.
- Login to the NetMind Power account (see here for sign-up instructions)
- Select "Account" from the dropdown menu (see image below).
- Click “Bind wallet” and follow the instructions to bind the wallet (see image below). Make sure your MetaMask wallet is already connected to the browser you're using to login to NetMind Power, otherwise this will not work.
- A pop-up should appear, and your wallet number should be automatically recognised. Check that the wallet address is correct and click "OK" (image below on the left). If for some reason this doesn't happen, enter your wallet address manually in the box which will appear (image below on the right)
NOTE: Make sure the wallet address you bound to your NetMind Power account is the same as the wallet address shown on your MetaMask plugin (the two boxes in red in the image below should show the same number).
NOTE: You can only bind one wallet address to each NetMind Power account. You cannot undo this action and you cannot bind a different wallet to the same NetMind Power account. If you need to change MetaMask wallet, please create a new NetMind Power account and bind it to your new wallet.