Skip to content

Vivianstar/mlops_project

Repository files navigation

mlops_project

This directory contains an ML project based on the default Databricks MLOps Stacks, defining a production-grade ML pipeline for automated retraining and batch inference of an ML model on tabular data. The "Getting Started" docs can be found at https://docs.databricks.com/dev-tools/bundles/mlops-stacks.html.

See the full pipeline structure below. The MLOps Stacks README contains additional details on how ML pipelines are tested and deployed across each of the dev, staging, prod environments below.

MLOps Stacks diagram

Code structure

This project contains the following components:

Component Description
ML Code Example ML project code, with unit tested Python modules and notebooks
ML Resources as Code ML pipeline resources (training and batch inference jobs with schedules, etc) configured and deployed through databricks CLI bundles
CI/CD GitHub Actions workflows to test and deploy ML code and resources

contained in the following files:

mlops_project        <- Root directory. Both monorepo and polyrepo are supported.
│
├── mlops_project       <- Contains python code, notebooks and ML resources related to one ML project. 
│   │
│   ├── requirements.txt        <- Specifies Python dependencies for ML code (for example: model training, batch inference).
│   │
│   ├── databricks.yml          <- databricks.yml is the root bundle file for the ML project that can be loaded by databricks CLI bundles. It defines the bundle name, workspace URL and resource config component to be included.
│   │
│   ├── training                <- Training folder contains Notebook that trains and registers the model with feature store support.
│   │
│   ├── feature_engineering     <- Feature computation code (Python modules) that implements the feature transforms.
│   │                              The output of these transforms get persisted as Feature Store tables. Most development
│   │                              work happens here.
│   │
│   ├── validation              <- Optional model validation step before deploying a model.
│   │
│   ├── monitoring              <- Model monitoring, feature monitoring, etc.
│   │
│   ├── deployment              <- Deployment and Batch inference workflows
│   │   │
│   │   ├── batch_inference     <- Batch inference code that will run as part of scheduled workflow.
│   │   │
│   │   ├── model_deployment    <- As part of CD workflow, deploy the registered model by assigning it the appropriate alias.
│   │
│   │
│   ├── tests                   <- Unit tests for the ML project, including the modules under `features`.
│   │
│   ├── resources               <- ML resource (ML jobs, MLflow models) config definitions expressed as code, across dev/staging/prod/test.
│       │
│       ├── model-workflow-resource.yml                <- ML resource config definition for model training, validation, deployment workflow
│       │
│       ├── batch-inference-workflow-resource.yml      <- ML resource config definition for batch inference workflow
│       │
│       ├── feature-engineering-workflow-resource.yml  <- ML resource config definition for feature engineering workflow
│       │
│       ├── ml-artifacts-resource.yml                  <- ML resource config definition for model and experiment
│       │
│       ├── monitoring-resource.yml           <- ML resource config definition for quality monitoring workflow
│
├── .github                     <- Configuration folder for CI/CD using GitHub Actions.  The CI/CD workflows deploy ML resources defined in the `./resources/*` folder with databricks CLI bundles.
│
├── docs                        <- Contains documentation for the repo.
│
├── cicd.tar.gz                 <- Contains CI/CD bundle that should be deployed by deploy-cicd.yml to set up CI/CD for projects.

Using this repo

The table below links to detailed docs explaining how to use this repo for different use cases.

This project comes with example ML code to train, validate and deploy a regression model to predict NYC taxi fares. If you're a data scientist just getting started with this repo for a brand new ML project, we recommend adapting the provided example code to your ML problem. Then making and testing ML code changes on Databricks or your local machine. Follow the instructions from the project README.

When you're ready to deploy production training/inference pipelines, ask your ops team to follow the MLOps setup guide to configure CI/CD and deploy production ML pipelines.

After that, follow the ML pull request guide and ML resource config guide to propose, test, and deploy changes to production ML code (e.g. update model parameters) or pipeline resources (e.g. use a larger instance type for model training) via pull request.

Role Goal Docs
Data Scientist Get started writing ML code for a brand new project project README
MLOps / DevOps Set up CI/CD for the current ML project MLOps setup guide
Data Scientist Update production ML code (e.g. model training logic) for an existing project ML pull request guide
Data Scientist Modify production model ML resources, e.g. model training or inference jobs ML resource config guide

Setting up CI/CD

This stack comes with a workflow to set up CI/CD for projects that can be found in

.github/workflows/deploy-cicd.yml.

To set up CI/CD for projects that were created through MLOps Stacks with the Project_Only parameter, run the above mentioned workflow, specifying the project_name as a parameter. For example, for the monorepo case:

  1. Setup your repository by initializing MLOps Stacks via Databricks CLI with the CICD_and_Project or CICD_Only parameter.
  2. Follow the MLOps Setup Guide to setup authentication and get the repo ready for CI/CD.
  3. Create a new project by initializing MLOps Stacks again but this time with the Project_Only parameter.
  4. Run the deploy-cicd.yml workflow with the project_name parameter set to the name of the project you want to set up CI/CD for.

NOTE: This project has already been initialized with an instantiation of the above workflow, so there's no need to run it again for project mlops_project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages