Scaling MySQL Beyond Kubernetes: When Backups Become The Blocker

We ran MySQL in Kubernetes for years. It worked great.

Then the data grew.

From GB to hundreds of GB to TB. And suddenly, Kubernetes wasn’t the problem—backups were.

This is the story of why we moved MySQL off Kubernetes, but not because Kubernetes failed. It failed because the backup tool was never designed for databases at our scale.

The Years It Worked

Kubernetes was fine for MySQL when:

Data was small (single-digit GB)
Backups were quick
Recovery was rare
Kasten (the Kubernetes backup tool) could handle snapshots

We had everything infrastructure engineers want:

Infrastructure as code (YAML files)
Automatic failover (StatefulSets)
Self-healing (pod restarts)
Version-controlled infrastructure

It wasn’t elegant, but it worked. We weren’t thinking about moving it.

The Scaling Problem

Data grew. Fintech platforms have that problem—transactional data multiplies.

We moved from:

Single GB backups (minutes to complete, hours to verify recovery)
To 100GB backups (hours to complete, restore became risky)
To TB backups (overnight backups, recovery windows measured in days)

Suddenly, backup operations became the critical path for the entire infrastructure.

Why Kasten Failed at Scale

Kasten is designed for Kubernetes workloads. It’s great for:

Restoring a lost pod
Backing up application state
Snapshot-based recovery

It’s terrible for:

Large, consistent database exports
Incremental backups
Point-in-time recovery
Hot backup management

The root cause: Kasten doesn’t understand databases.

It treats MySQL the same as it treats stateless services—snapshots and restore. But databases need:

Consistent, logical backups (not raw snapshots)
Incremental backups (TB of data can’t be full-backed daily)
Recovery verification (you need to test that backups actually work)
Point-in-time recovery (financial systems need this)

When backups hit TB scale, Kasten couldn’t keep up. Recovery testing took days. Backup failures happened often. The team spent cycles babysitting backups instead of shipping features.

The Decision: Move to VMs

Not because Kubernetes was broken. Because the backup requirement broke Kubernetes.

The solution: Move MySQL to Debian VMs and implement proper database backups.

This sounds like a step backward. It’s not. It’s recognizing that databases need specialized infrastructure.

The Implementation: XtraBackup + S3

We didn’t just move the database. We built a backup solution designed for MySQL at scale.

Three components:

Backup script - Full and incremental backups with S3 sync
Restore script - Automatic decompression, decryption, preparation
Analysis tool - Backup chain verification and integrity checking

Key features:

Automatic tool detection (MySQL vs Percona vs MariaDB vs Galera)
Incremental backup chains that track relationships
Compression and encryption built-in
Smart retention (never deletes a chain with recent incrementals)
Point-in-time recovery (restore up to any incremental in the chain)
Local or S3 storage modes
Dry-run support for safety

Workflow:

Sunday:   Full backup + cleanup of orphaned chains
Mon-Sat:  Incremental backups every 6 hours
Weekly:   Analyze all backup chains
Monthly:  Verify integrity of recent backups

Result:

TB-scale MySQL with reliable backups. Recovery testing works. The team sleeps.

What Changed Operationally

Before (Kubernetes):

Backup operations were uncertain
Recovery testing took days
Backup failures were common
Team spent time debugging Kasten
Scaling meant hoping backups still worked

After (VMs + XtraBackup):

Backups complete reliably every 6 hours
Recovery testing takes hours, not days
Backup failures are rare (and when they happen, they’re database problems, not infrastructure)
Team maintains simple shell scripts, not Kubernetes operators
Scaling means adding more incremental backups, not rearchitecting

The Real Lesson

Kubernetes is great for compute. It’s not a data platform.

When people say “run everything on Kubernetes,” they usually mean “run stateless services on Kubernetes.”

Databases are different. They’re stateful, they’re I/O sensitive, and they have specialized operational requirements.

For databases:

Use managed services (RDS, Cloud SQL) if available
Use dedicated infrastructure (VMs with proper backup tooling) if not
Don’t use container orchestration platforms designed for stateless services

This isn’t a failure of Kubernetes. It’s recognizing the right tool for the job.

What We Learned

Backup strategy determines infrastructure choices
- For large databases, backup design is the primary constraint
- Kubernetes has no good answer for this
- Database-native tools (XtraBackup) exist for a reason
Scaling exposes wrong choices
- At GB scale, Kubernetes for MySQL is fine
- At TB scale, it’s untenable
- If you find yourself fighting your infrastructure, you have the wrong tool
Operational simplicity matters
- Simple scripts > complex operators
- Database-native tools > infrastructure-generic tools
- Team can maintain shell scripts; Kubernetes expertise isn’t required
Cloud-native ≠ Kubernetes
- Cloud-native means using cloud resources effectively
- Sometimes that’s Kubernetes, sometimes it’s managed services, sometimes it’s VMs
- Choose the tool that solves your problem with the least operational overhead

The Trade-off

What we gave up:

Cluster elasticity (VMs are fixed-size)
The feeling of being “modern” (VMs feel old-fashioned)

What we got:

Infrastructure as code via Ansible (VMs fully automated and reproducible)
Reliable backups that actually work
Point-in-time recovery capability (validated with full recovery testing for compliance)
Team doesn’t spend cycles on infrastructure
Operational simplicity
Significantly fewer production incidents related to data loss

The trade-off was worth it.

Menu