From Data Science to Production: MLOps CI/CD Pipeline with Real-time Accuracy Monitoring

The Problem

Your data science team trained an amazing ML model. Accuracy on test data: 94%.

You deploy it to Kubernetes. It runs great for 2 weeks.

Then accuracy silently drops to 72%.

Nobody notices for 10 days. By the time you catch it, the model has made 50,000 bad predictions.

This is the MLOps problem in a nutshell: models decay, and without monitoring, you won’t know until it’s too late.

The Reality: Why Models Fail in Production

It’s Not the Training Code That Breaks

Your training pipeline works fine. You can rerun it on the same data, get the same model. The code is solid.

The problem is data drift. In production, your input data changes:

User behavior shifts seasonally
Feature distributions morph over time
Edge cases you never saw in training appear in production
The world changes, but your model doesn’t

The Traditional Approach (Manual and Fragile)

Most teams handle ML deployment like this:

Data scientist trains a model locally
“It’s ready” - they export a pickle file or ONNX model
Engineer manually packages it into a Docker image
It gets deployed to Kubernetes (maybe)
It runs for a while
Something breaks (nobody’s monitoring)
Someone manually checks logs
Maybe they retrain, maybe they downgrade the model
Repeat 6-8 three months later

The problem: Every step is manual. No automation. No feedback loop. No recovery.

Why This Breaks at Scale

No versioning: Which model version is running? What data trained it? Unknown.
No testing: Did you test the model against a held-out dataset? Just hope it works.
No monitoring: Is accuracy degrading? You won’t know until someone complains.
No retraining: When drift happens, you manually trigger retraining (if you remember).
No governance: Audit trail? Compliance? Good luck explaining to regulators.

The Solution: MLOps with Kubernetes + Prometheus + Automated Monitoring

MLOps is the engineering discipline that brings DevOps practices to machine learning.

Instead of manually managing model deployment, you:

Automate the entire pipeline - data validation → training → testing → deployment
Monitor model performance - accuracy, precision, recall, and data drift in real-time
Trigger retraining automatically - when drift is detected or on schedule
Test before deployment - validate model accuracy matches thresholds
Deploy safely - staged rollouts, canary deployments, instant rollback

Our Approach: Kubernetes-Native MLOps

We built a production ML system on Kubernetes with these components:

1. Automated Training Pipeline (Orchestration)

Using Kubernetes CronJobs + custom training containers:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: ml-retraining-pipeline
spec:
  schedule: "0 2 * * *"  # Runs daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: trainer
            image: myrepo/ml-trainer:latest
            env:
            - name: TRAINING_DATASET
              value: "s3://ml-data/training/"
            - name: MODEL_REGISTRY
              value: "http://mlflow:5000"

What it does:

Fetches fresh data from S3 (or your data warehouse)
Validates data quality (schema, statistical properties, anomalies)
Trains the model from scratch
Compares new model accuracy against baseline
If accuracy passes threshold: register in MLflow Model Registry
If accuracy fails: alert, don’t deploy

2. Model Registry (Version Control for Models)

Using MLflow Model Registry:

Every trained model is registered with metadata
Tracks training dataset version, hyperparameters, metrics
Maintains full lineage (which code, which data, which environment)
Enables rollback to previous model versions

Model: fraud-detector
├── Version 1 (staging) - acc: 91.2%
├── Version 2 (staging) - acc: 93.1%
└── Version 3 (production) - acc: 93.1% (deployed 2 days ago)

3. Staged Deployment (Testing Before Production)

Data Flow: Training Data → Validation → Staging → Production

Staging Environment:
- Receives 10% of live traffic (shadow mode)
- Model makes predictions but doesn't affect users
- Prometheus compares staging accuracy vs baseline

If staging accuracy meets threshold (93%+) → Approve for production

4. Real-time Accuracy Monitoring (Prometheus)

This is the critical piece. You need to track model accuracy in production:

# Prometheus config for ML metrics
global:
  scrape_interval: 15s

scrape_configs:
- job_name: 'ml-inference'
  static_configs:
  - targets: ['ml-service:8000']

Your ML service exposes metrics:

from prometheus_client import Counter, Gauge, Histogram

# Counters
predictions_total = Counter('ml_predictions_total', 'Total predictions', ['model', 'outcome'])
correct_predictions = Counter('ml_correct_predictions_total', 'Correct predictions', ['model'])

# Gauges
model_accuracy = Gauge('ml_model_accuracy', 'Current model accuracy', ['model'])
data_drift_score = Gauge('ml_data_drift_score', 'Feature distribution drift', ['model'])

# In your inference code:
prediction = model.predict(features)
actual_label = get_ground_truth(request_id)  # From async feedback loop

if prediction == actual_label:
    correct_predictions.labels(model=model_name).inc()

predictions_total.labels(model=model_name, outcome='correct' if prediction == actual_label else 'incorrect').inc()

5. Automated Alerts on Drift

Prometheus alerting rules:

groups:
- name: ml_alerts
  interval: 30s
  rules:
  - alert: ModelAccuracyDegraded
    expr: ml_model_accuracy < 0.90  # Alert if accuracy drops below 90%
    for: 5m
    annotations:
      summary: "Model accuracy dropped to {{ $value }}"
      action: "Check data drift. Consider retraining."

  - alert: DataDriftDetected
    expr: ml_data_drift_score > 0.15
    for: 10m
    annotations:
      summary: "Feature distribution shifted significantly"
      action: "Trigger emergency retraining"

When accuracy drops below 90%, you:

Get notified immediately
Automatically trigger retraining
Compare new model vs baseline
If better: deploy to staging
If worse: rollback to previous version

Practical Example: Credit Risk Model

The Setup

You have a credit risk model that predicts loan default probability.

Training: 100K historical loans, 2% default rate
Production: Running on Kubernetes, serving 500 requests/day
Accuracy in test set: 94%

Week 1: Everything Works

Model accuracy: 93.8% ✓
Data drift: 0.05 (normal)
All systems green

Week 3: Silent Failure

Economy shifts: unemployment rises
Default patterns change
Model still predicts based on old economic patterns
Model accuracy: 68% (unnoticed for 3 days)
1,200 bad predictions made

Without MLOps: You find out when the business notices the loan portfolio is worse than expected.

With MLOps:

Day 1, hour 3: Prometheus alert fires
Alert triggers automated retraining
New model trained on last week’s data (with new economic patterns)
New model accuracy: 89% (not perfect, but better)
Model deployed to staging automatically
Accuracy verified in shadow mode
Model promoted to production
Total downtime: 2 hours instead of 72 hours

The MLOps Maturity Progression

Level 0: Manual Everything (Typical Today)

Train model locally
Export pickle/ONNX manually
Docker image built manually
Deployed manually to Kubernetes
Zero monitoring
Zero retraining automation

Level 1: Automated Training (Easy Starting Point)

Training pipeline triggers on schedule (CronJob)
Data validation in pipeline
Model automatically registered in MLflow
Still manual deployment decision

Level 2: Full CI/CD (Our Approach)

Automated training on schedule + drift detection
Automated testing (accuracy threshold checks)
Automated deployment to staging (shadow mode)
Real-time accuracy monitoring (Prometheus)
Automated promotion to production if staging passes
Automated retraining on drift detection
Instant rollback if accuracy drops

Implementation Checklist

To build this yourself on Kubernetes:

1. Monitoring & Metrics

Prometheus deployed on cluster
ML service exposes Prometheus metrics (predictions, accuracy, drift)
Grafana dashboards for visualization
Alert rules for accuracy degradation and drift

2. Model Training Pipeline

CronJob for regular retraining
Data validation step (schema, distributions)
Model training containerized
Accuracy testing before registration

3. Model Registry

MLflow deployed
Training pipeline registers models automatically
Metadata tracked (dataset version, hyperparameters, metrics)

4. Staged Deployment

Two ML service deployments (staging + production)
Traffic mirroring or shadow mode for staging
Automated promotion based on accuracy threshold

5. Automated Retraining

Prometheus alerts trigger Kubernetes Job
Or: CronJob checks drift score, triggers if needed
New model compared against baseline
Auto-rollback if model degrades

Key Lessons

Monitor accuracy, not just infrastructure - CPU and memory are fine. Your model might be terrible.
Data drift is silent - Models don’t throw errors. They just gradually get worse.
Staging is critical - Test model accuracy on 10% of real traffic before going 100%
Automate the loop - Without automation, you’ll miss drift because nobody’s monitoring 24/7
Reproducibility matters - Keep full lineage: training data version, code version, hyperparameters, metrics
Speed matters - From alert to retraining to deployment should be minutes, not days

What We’re Not Covering (Yet)

Feature stores (Tecton, Feast) - manage features consistently across training/serving
Advanced drift detection (Evidently AI, WhyLabs) - detect distribution shift automatically
Online learning - retraining on single examples as feedback arrives
Causal inference - understanding why accuracy changed
Multi-armed bandits - A/B testing different models in production

These are Level 3 optimizations. Start with the basics above.

Next Steps

Deploy Prometheus on your Kubernetes cluster
Instrument your ML service with prediction accuracy metrics
Set up alerts for accuracy drops
Automate training with CronJobs
Add staging environment for model testing
Monitor for the first week to catch issues

After one month, you’ll have caught issues that would have been silent for weeks in the old system.

References & Tools

Kubernetes CronJob: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
MLflow Model Registry: https://mlflow.org/docs/latest/registry.html
Prometheus: https://prometheus.io/
ML Monitoring: Evidently AI, WhyLabs, Arize (enterprise options)
Feature Stores: Tecton, Feast, Hopsworks

Menu