This repository contains the code for ProtoNN (a KNN based algorithm) implemented in Tensorflow for large-scale multi-label learning. This repository also has a script to run the training on multiple GPUs.
Note: some modifications have been made to improve run-time and performance on large-scale datasets. For more details about ProtoNN, please refer to ProtoNN: Compressed and Accurate kNN for Resource-scarce Devices. If you are seeking to reproduce the results in the original paper, please use the official code provided by the authors.
Unlike multi-class or binary classification, extreme multi-label (XML) algorithms tag data points with a subset of labels (rather than just a single label) from an extremely large label-set. XML problems usually deal with a large number of labels (103 - 106 labels) and a large number of dimensions and training points.
For datasets, check: XML-repository
- Tensorflow
- FAISS
- Numpy
- Scipy
- Easydict
Check the ipython notebook to run the code on Eurlex-4k dataset. To change the parameters, modify the config file.
To run on a new dataset:
-
Create a new folder with the directory name. Place two separate files train_data.mat and test_data.mat in that directory. Note that each of these files must have two variables: X with shape: (num instances, num features) and Y with shape (num instances, num labels)
-
Create a config file in cfgs folder with the required parameters.
-
For single GPU: Modify eurlex_train.py -> train.py (import the correct config file). For training on multiple GPUs modify eurlex_multigpu_train.py -> train.py and run
python train.py