MongoDB Sharded Cluster Database - Data Protection

Modern, Fast, & Application-Consistent Data Protection for MongoDB Sharded Clusters Powered by NDB Time Machine and MongoDB Ops Manager

By Saravana Selvaraj, Staff Engineer
and Anand Chandak, Group Product Manager NDB

MongoDB sharded clusters power mission critical applications across industries—from FinTech to E-commerce to global SaaS platforms. These environments demand not just scale, but enterprise class data protection: fast backups, consistent restores, low overhead, and full automation across distributed database architecture.

To meet these needs, Nutanix Database Service (NDB) extends its powerful advanced data-protection framework, Time Machine, to support MongoDB sharded clusters. This integration combines the deep MongoDB native intelligence of MongoDB Ops Manager with the snapshot performance and storage efficiencies of the Nutanix platform, delivering a solution designed for modern, large-scale MongoDB workloads. 

This blog explores how NDB orchestrates application-consistent online backups via cluster-consistent snapshots, coarse grained oplogs catchups, and cluster wide coordinated point-in-time restores at seconds granularity.

Advanced Data Protection for MongoDB Sharded Clusters

Sharded MongoDB databases are distributed by design—multiple shards, each composed of replica sets, alongside a config‑server replica set. Protecting such an environment requires:

  • Coordinated, cluster wide consistency
  • High‑performance snapshotting with no application downtime
  • Continuous ingestion of oplogs for PITR (Point‑In‑Time Restore) at second-level granularity
  • The ability to recover every node across every shard to the same logical moment, the sharded cluster

Note: Traditional backup tools like mongodump aren’t designed for this complexity.

Nutanix NDB and MongoDB Ops Manager integration

First, Ops Manager is Onboarded into NDB and then Time Machine is enabled to realize the integrated Backup and Restore capabilities. 

Reference: To know how to Onboard MongoDB Ops Manager in Nutanix NDB, refer here

MongoDB Ops Manager exposes third-party-based backup APIs that allow storage and data-protection vendors to integrate their own snapshot technologies while Ops Manager maintains MongoDB-level consistency.

Third-party based APIs provide a means for backup vendors like Nutanix to co-ordinate the overall backup and restore process.  This model demands the vendor to trigger and orchestrate complete Snapshot and Restore workflows, while the MongoDB Ops Manager controls the overall MongoDB native operations required for achieving consistency across shards.

A well-defined state transition driven model helps both the integrating parties to co-ordinate throughout the process for consistent behaviors and thus results.

NDB leverages this integration to achieve:

  • MongoDB aware data consistency through backup cursors
  • Distributed snapshot orchestration through NDB
  • Near-instant, size independent snapshot capture using Nutanix’s patented snapshot technology
  • Space efficient backup and policy-based retention on Nutanix storage fabric

This partnership enables NDB to capture backup data quickly and in strict coordination with MongoDB’s internal state.

Backup cursors are a MongoDB feature exposed through Ops Manager that provide a consistent view of data in each replica set.

NDB Time Machine Capabilities for MongoDB Sharded Clusters

NDB Time Machine consists of three core protection pillars:

  • Snapshot Backups – Full cluster‑consistent recovery points
  • Log Catchups – Continuous oplogs ingestion
  • Restores (Snapshot-based & PITR) – Distributed, coordinated recovery across all shards at cluster level.

1. Snapshot Backups – Near-Instantaneous & Cluster Consistent

Snapshot backups represent complete, consistent restore points of the entire sharded cluster— shards + config server. The process is designed to be fully online, and designed to help all MongoDB transactions continue seamlessly throughout.

Nutanix snapshots leverage a redirect-on-write algorithm that makes data protection lightweight by design. When a snapshot is triggered, Nutanix creates a new vDisk with read/write access — both the original snapshotted vDisk and the new vDisk reference the same underlying block-map, with zero new data blocks created at snapshot time. Because this is majorly a metadata-dominant operation, snapshot creation is near-instantaneous and largely independent of dataset size — a 50TB or a 500GB data set completes snapshots in comparable time. The I/O overhead is minimal, with negligible impact on the running MongoDB workload during the snapshot window. For further, deeper technical details to know about the underlying mechanism, refer to the Nutanix Bible, Snapshots and Clones section.

The data node for each shard and config server backed up is also known as Eligible & Available (EA) node, chosen, as per backup policy configured. The backup policy enables the user to specify from which data node to take a snapshot from – Primary or Secondary.

Below is the expanded, precise workflow, explained in phases - 

Snapshot Workflow Overview

Phase 1 — Initiate

A snapshot may be triggered by a schedule, or on demand.
When initiated:

  • NDB randomly selects a Mongos server (since a sharded cluster can have one or more Mongos Servers) and orchestrates the entire operation via NDB Agent
  • NDB identifies every snapshot as a full backup, helping enable independent recoverability
  • At least one Mongos server availability is enough to manage the entire sharded cluster's snapshot

Phase 2 — Discover

The orchestrating NDB Agent on Mongos:

  • Identifies Eligible & Available (EA) nodes for every shard and config server
  • Performs discovery in parallel, significantly reducing runtime
  • Requests Ops Manager to create a snapshot handle, representing this unified backup event

Phase 3 — Start

  • NDB Agent on Mongos instructs Ops Manager to initiate the backup
  • Ops Manager, via the MongoDB Agents on every data‑bearing node and config server, opens backup cursors and marks the start of a consistent backup window.
  • NDB Agent on Mongos waits for Ops Manager to transition to a backup state from PENDING → READY

Phase 4 — Snapshot

This is where Nutanix’s infrastructure shines:

  • NDB triggers parallel snapshots across data and journal disks of all EA nodes
  • Each shard receives its own snapshot, including the config server 
  • NDB consolidates these into a single logical snapshot group representing the backup for the entire cluster
  • Snapshot capture time is independent of data size, even at multi‑TB scale
  • Snapshots are stored on the Nutanix Distributed Storage Fabric, benefiting from storage efficiency

Phase 5 — Finalize

  • NDB Agent on Mongos instructs Ops Manager to finalize the backup
  • Ops Manager, through the MongoDB Agents, closes all backup cursors and completes the backup workflow
  • Backup state transitions from FINISHING → FINISHED

Phase 6 — Conclude

  • NDB catalogs snapshot metadata and Ops Manager backup metadata in its repository
  • The snapshot is now ready for restores or PITR workflows

Resiliency: Every task is designed to be idempotent and re-entrant, thus enabling optimal behavior on retries. The system is fine-tuned using sufficient retries at interaction points governed by configs, maximizing opportunity to succeed in environments involving multiple sub-systems, which are vulnerable for failures.

2. Log Catchups – Continuous Oplogs Protection

Oplogs backups enable data protection, allowing restores to any precise second within the retention window. 

Time Machine coordinates this with Ops Manager to achieve consistency and reliability. 

Below depiction, describes the workflow further in detail:

Log Catchup Workflow Overview

Phase 1 — Initiate

  • Triggered via schedule,  or on‑demand
  • NDB randomly selected Mongos server orchestrates the process via NDB Agent
  • NDB is designed to prevent  concurrent oplogs backup from running

Phase 2 — Discover

  • NDB Agent discovers EA nodes for oplogs capture across all shards and the config server
  • If EA nodes changed, NDB updates Ops Manager’s preferred nodes
  • Ops Manager creates an oplogs snapshot handle

Phase 3 — Start

  • NDB Agent on Mongos instructs Ops Manager to start the oplogs backup
  • NDB waits until the backup state moves from PENDING → READY
  • Ops Manager provides a list of consistent oplogs to copy for each shard/config server

Phase 4 — Protect Oplogs

  • NDB copies only oplogs validated by Ops Manager
  • Oplogs are stored in the Nutanix Object Store, S3 compatible object storage, an integral part of Nutanix distributed data fabric. The oplogs can be protected by WORM retention policies, when enabled on Nutanix Object Store
  • Each shard produces a unique oplogs group, helping promote shard‑wise consistency for point-in-time restores, at seconds granularity
  • All shard oplogs are backed up to the NDB object-store in parallel for optimal performance

Phase 5 — Finalize

  • NDB Agent on Mongos instructs Ops Manager to finalize the oplogs backup
  • Ops Manager instructs backup agents to delete copied oplogs from local nodes
  • Backup state transitions to FINISHED

Phase 6 — Conclude

  • NDB stores oplogs metadata and associated directory details in its repository
  • Oplogs are now available for PITR

3. Restore Operations – Cluster‑Wide Recovery & PITR to the second

Restore operations in sharded environments require precise orchestration—every shard must be restored to the same logical point, even though they are backed up independently.

NDB Time Machine’s in-place restore workflow supports:

  • Snapshot‑based full backup restores
  • Point‑In‑Time Restores (PITR) (seconds granularity) using snapshots + oplogs

Restore process can be best visualized in phases, below:

Restore Workflow Overview

Phase 1 — Initiate

  • NDB pauses Time Machine, helping prevent new backups from executing during recovery
  • The restore request lands on a randomly selected Mongos server, which orchestrates the workflow via NDB Agent
  • Expectation - All shards and nodes of the cluster must be online and reachable, a pre-requisite for any third-party-managed restore using Ops Manager

Phase 2 — Prepare Ops Manager

  • NDB extracts snapshot metadata needed for the restore
  • NDB creates a restore handle in Ops Manager using a third-party API, providing snapshot metadata as the key
  • For PITR
    • The nearest valid snapshot is automatically selected
    • Required per‑shard oplogs ranges are identified down to the second

Phase 3 — Prepare Data Nodes

Ops Manager requires that all nodes of a shard recover from identical data.

So NDB:

  • Determines where each shard’s nodes are located across Nutanix clusters
  • If nodes/replicas of a shard span multiple Nutanix clusters, NDB performs Snapshot Replication to promote uniform data availability

Phase 4 — Start

  • NDB agent on Mongos triggers Ops Manager to begin the restore
  • Ops Manager transitions the restore job state from INITIAL → COPY_FILES

Phase 5 — Restore Snapshot & Oplogs

  • NDB restores the correct snapshot to every shard node and config server
  • For PITR
    • Shard‑specific oplogs are copied and applied with second‑level precision
    • Restored datasets are now aligned at a cluster‑consistent, point‑in‑time.

Phase 6 — Recover

  • NDB Agent on Mongos instructs Ops Manager to perform MongoDB-native recovery
  • Ops Manager, through the MongoDB Agents on the target nodes, runs the native MongoDB recovery workflow.
  • Ops Manager completes the recovery and transitions state from RECOVERY_IN_PROGRESS → COMPLETED

Phase 7 — Conclude

  • NDB marks the restore as complete
  • Database returns to READY state
  • Time Machine is resumed
  • A post‑restore snapshot is automatically triggered

Note: NDB, for fast availability of identical data across each node of a given shard, hosted across multiple Nutanix clusters, leverages Nutanix snapshot-based replication technology, for quick turnaround time.

Restore Behavior – Forward‑Only Design

If a restore operation fails:

  • The database enters state RESTORE_FAILED
  • Time Machine stays PAUSED
  • No new backups occur until the issue is corrected and restore is completed

This design helps reduce the risk of restore operations corrupting backup chains—like how forward‑only database upgrades behave.

Time Machine Capabilities and Health

The health of a MongoDB sharded cluster database is made up of  the combined health of all shards and config-servers. Both the database and Time Machine provide one view for the entire MongoDB sharded cluster. Thus, the Time Machine capabilities and operations are available for the entire MongoDB sharded cluster. 

Summary

NDB extends its powerful Time Machine framework to MongoDB sharded clusters, delivering consistent, storage‑efficient, and application‑aware data protection. 

By integrating with MongoDB Ops Manager’s third‑party backup APIs, NDB orchestrates cluster‑wide snapshots, op‑log backups, and PITR workflows while Ops Manager enables MongoDB‑native coordination across shards and config servers. 

Snapshots are near-instantaneous and captured in parallel across the cluster. Log catchups provide continuous protection via op‑log extraction, storage, and retention on the Nutanix object store.

Restore workflows combine NDB’s snapshot intelligence with Ops Manager’s recovery engine to enable point‑in‑time or snapshot‑based recovery across all shards. Unlike other engines, sharded cluster restores are forward‑only—failure places the database in a RESTORE_FAILED state until the issue is resolved. Once Time Machine is enabled, all backup, log, and restore operations operate at a unified cluster scope, giving users a single, consistent operational experience for large‑scale MongoDB deployments.

Feature Availability: The integrated solution for application consistent backups and restores for a sharded cluster MongoDB Database is available from NDB release version 2.10 and MongoDB Ops Manager release version 8.0.19, onwards.

©2026 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned are for identification purposes only and may be the trademarks of their respective holder(s).