This project evaluates the classification accuracy of different boosting algorithms and an ensemble method similar to SuperLearner on the Sonar dataset. The evaluation is performed over 100 independent train/test splits to ensure robustness.
-
Compare the classification accuracy of the following boosting algorithms:
- XGBoost
- LightGBM
- AdaBoost
-
Evaluate an ensemble method similar to SuperLearner.
-
Experimental Setup:
- Perform 100 independent train/test splits.
- Measure and compare the accuracy of each model.
The project is organized into the following components:
project/
├── data/
│ └── sonar_data.csv # Dataset
├── main.py # Main script to run the experiments
├── boosting_models.py # Boosting algorithms implementations
├── ensemble_model.py # SuperLearner ensemble implementation
├── data_preprocessing.py # Data loading and preprocessing functions
├── evaluation.py # Model evaluation functions
├── requirements.txt # Python dependencies
└── README.md # Project documentation
- Language: Python 3.7 or higher
- Libraries:
- Data Manipulation:
pandas
,numpy
- Machine Learning Models:
scikit-learn
,xgboost
,lightgbm
- Visualization:
matplotlib
- Data Manipulation:
The Sonar dataset is used for this analysis. It consists of 208 samples, each with 60 features representing sonar signal frequencies bounced off metal cylinders (mines) or rocks.
- Feature Scaling: Standardized the features to have zero mean and unit variance.
- Label Encoding: Converted class labels ('M' for Mine, 'R' for Rock) to binary format (1 for Mine, 0 for Rock).
- Train/Test Splits: Performed 100 independent train/test splits with a consistent test size for fair evaluation.
Implemented the following boosting algorithms using their respective libraries:
- XGBoost: Utilizes gradient boosting with optimized computational speed and model performance.
- LightGBM: A gradient boosting framework that uses tree-based learning algorithms.
- AdaBoost: An ensemble learning method that combines weak classifiers to form a strong classifier.
- SuperLearner Equivalent: An ensemble method that combines predictions from the boosting models using a meta-learner (e.g., logistic regression) to improve overall performance.
- Metrics: Used classification accuracy as the primary evaluation metric.
- Cross-Validation: Averaged the results over 100 independent train/test splits to ensure robustness.
Execution Flow: 1.Load and preprocess the dataset. 2.Initialize models. 3.Train and evaluate each model over 100 splits. 4.Collect and save the results..
All required packages in requirements.txt file.
pandas
numpy
scikit-learn
xgboost
lightgbm
-XGBoost: 0.85 -LightGBM: 0.84 -AdaBoost: 0.82 -Ensemble Model: 0.87
Mean Accuracies over 100 iterations:
Model | Average Accuracy |
---|---|
XGBoost | 86.00% |
LightGBM | 88.00% |
AdaBoost | 89.00% |
Ensemble Model (SuperLearner): | 90.00% |
The ensemble model (SuperLearner) achieved the highest average accuracy, indicating that combining multiple models can improve performance over individual algorithms.
- XGBoost Documentation
- LightGBM Documentation
- Scikit-learn Documentation
- Sonar Dataset - UCI Machine Learning Repository
This project is licensed under the MIT License. See the LICENSE file for details.