DataVolt: Streamline Your Data Engineering Pipelines

Overview

DataVolt is a modular toolkit designed to automate and streamline data engineering pipelines. It provides reusable, extensible components for data loading, preprocessing, feature engineering, and model training. With DataFlux, you can save time and effort by leveraging prebuilt tools to handle repetitive and complex data engineering tasks.

Built for technical users, DataStream is ideal for:

Data Scientists looking to simplify their preprocessing workflows.
Machine Learning Engineers need robust pipelines for consistent data transformations.
Data Engineers aiming to optimize data ingestion and transformation pipelines.

Features

Reusable Components: Prebuilt modules for data loading, cleaning, scaling, encoding, and more.
Extensibility: Custom hooks for user-defined preprocessing and model training steps.
Modular Design: Each file serves a specific purpose, ensuring flexibility and clarity.
Integration Ready: Seamlessly integrate with cloud storage, SQL databases, or ML frameworks.
Automated Pipelines: Chain tasks together into a functional pipeline to minimize manual coding.

Installation

You can install DataStream via PyPI or UV:

pip install datavolt

Alternatively, using UV:

uv install datavolt

File Structure

The toolkit is organized into modular folders:

DataVolt/
├── loaders/           # Modules for data ingestion
│   ├── csv_loader.py  # Load CSV files
│   ├── sql_loader.py  # Load data from SQL databases
│   ├── s3_loader.py   # Fetch data from S3 buckets
│   └── custom_loader.py # Base class for custom loaders
├── preprocess/        # Preprocessing modules
│   ├── cleaning.py    # Data cleaning utilities
│   ├── encoding.py    # Encoding categorical variables
│   ├── scaling.py     # Data scaling and normalization
│   ├── feature_engineering.py # Feature engineering tools
│   └── pipeline.py    # Orchestrates preprocessing steps
├── model/             # Model lifecycle management
│   ├── trainer.py     # Automates model training
│   ├── evaluator.py   # Evaluates model performance
│   ├── hyperparameter.py # Hyperparameter optimization
│   └── model_export.py # Export trained models
├── ext/               # Extensions and utilities
│   ├── logger.py      # Logging utilities
│   ├── custom_step.py # Hooks for custom pipeline steps
│   └── neptune_integration.py # Logs experiments using Neptune.ai
└── README.md          # Project documentation

Initial commit: Add DataStream project with modular toolkit for data engineering pipelines## Quick Start Guide

Step 1: Load Your Data

Choose a loader module based on your data source:

from datavolt.loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="data.csv")
data = loader.load()

Step 2: Preprocess Your Data

Use the preprocessing modules to clean and transform your data:

from datavolt.preprocess.pipeline import PreprocessingPipeline
from datavolt.preprocess.scaling import StandardScaler
from datavolt.preprocess.encoding import OneHotEncoder

pipeline = PreprocessingPipeline([
    StandardScaler(),
    OneHotEncoder()
])
preprocessed_data = pipeline.run(data)

Step 3: Train a Model

Pass the preprocessed data into the model training module:

from datavolt.model.trainer import ModelTrainer

trainer = ModelTrainer(model="random_forest", parameters={"n_estimators": 100})
trained_model = trainer.train(preprocessed_data, labels)

Step 4: Evaluate and Export

Evaluate the model and save it for deployment:

from datavolt.model.evaluator import Evaluator
from datavolt.model.model_export import ModelExporter

evaluator = Evaluator()
metrics = evaluator.evaluate(trained_model, test_data, test_labels)
print(metrics)

exporter = ModelExporter()
exporter.save(trained_model, "models/random_forest.pkl")

Why Use DataVolt?

In the Data Engineering Ecosystem:

DataVolt addresses key challenges in the modern data engineering landscape:

Reusability: Standardize and modularize workflows to prevent redundant code.
Consistency: Ensures uniform data transformations across projects.
Efficiency: Reduces the time spent on routine data preprocessing and model setup tasks.
Scalability: Easily adapted to different data sources and project scales.

Example Use Case:

In a machine learning workflow, DataStream can:

Load large datasets from cloud storage.
Clean and preprocess them for feature selection.
Automate model training and hyperparameter tuning.
Track experiment metrics and export production-ready models.

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a feature branch.
Commit your changes and submit a pull request.

License

DataVolt is licensed under the MIT License. See LICENSE for details.

Support

For questions, issues, or feature requests, please open a GitHub issue or contact me at [allanw.mk@gmail.com]).

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
.idea		.idea
Data_Generators		Data_Generators
EDA		EDA
Example		Example
Loaders		Loaders
Models		Models
Tests		Tests
Writerside		Writerside
build/lib		build/lib
data		data
datastream.egg-info		datastream.egg-info
dist		dist
preprocess		preprocess
.coverage		.coverage
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Config.py		Config.py
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
README.MD		README.MD
UV.yml		UV.yml
__init__.py		__init__.py
coverage.xml		coverage.xml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataVolt: Streamline Your Data Engineering Pipelines

Overview

Features

Installation

File Structure

Step 1: Load Your Data

Step 2: Preprocess Your Data

Step 3: Train a Model

Step 4: Evaluate and Export

Why Use DataVolt?

In the Data Engineering Ecosystem:

Example Use Case:

Contributing

License

Support

Happy Streamlining with DataVolt

About

Releases 1

Packages

Contributors 2

Languages

License

DarkStarStrix/DataVolt

Folders and files

Latest commit

History

Repository files navigation

DataVolt: Streamline Your Data Engineering Pipelines

Overview

Features

Installation

File Structure

Step 1: Load Your Data

Step 2: Preprocess Your Data

Step 3: Train a Model

Step 4: Evaluate and Export

Why Use DataVolt?

In the Data Engineering Ecosystem:

Example Use Case:

Contributing

License

Support

Happy Streamlining with DataVolt

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages