Introduction
Your data science team just got the green light on a multi-million dollar project: building a custom Generative AI model to revolutionize your customer support. They have a new, GPU-packed cluster spinning up in a dedicated analytics VPC. They just need one thing to get started: the data.
All of it. The 2PB of customer interaction logs, call center audio recordings, and chat histories you’ve been diligently storing in your primary object store for the last five years.
"No problem," you think. "This is what replication is for."
You log in to your storage console, create a new replication rule from your production bucket to the new AI training bucket, and click 'Save'. You message the team lead: "Done. The data is on its way."
An hour later, a frustrated message comes back: "The bucket is empty. What's going on? We can't start the training job."
You check the console. Replication is active, but "0 objects" have been moved. And then you discover the critical gap, your object store will only replicate new files as they come in. It has no built-in mechanism to clone the 2PB of historical data your AI team actually needs.
The high-priority AI initiative takes a back seat until you manually write, test, and babysit a massive data copying script which could be slow, unreliable, and not scalable at all.
This historical sync gap is a persistent roadblock not just for the world of AI, but for many core business functions:
- The Brownfield DR Plan: Adding a new disaster recovery site to a 5-year-old bucket.
- Data Lake Hydration: Moving existing business data for new analytics initiatives.
- The Failback: Resyncing data to a rebuilt primary datacenter after a failover.
In all these cases, standard replication leaves years of existing data behind. Bridging this gap usually requires complex scripts, external command-line tools, or manual processes that create operational friction and fall short of the integrated experience modern enterprises expect.
The Solution
This is where the Nutanix Objects Storage solution changes the conversation. We treat historical sync not as a separate project, but as a seamless extension of your standard protection policy. Once you establish a replication rule for your new data, you are given a simple, one-click option to sync the entire bucket, eliminating the operational friction of manual scripts.
This integrated approach stands in stark contrast to many public cloud solutions. Often, replicating existing data isn't a simple checkbox but a separate administrative event using Batch Replication tools that forces you into a disjointed workflow:
- Operational Complexity: You cannot simply turn on protection but must manage separate manifest files and batch jobs just to tell the system what to copy.
- High Latency: You often experience significant delays for an asynchronous inventory report just to get started.
- Cost overhead: You are charged extra fees per job and per object just to bridge the gap in the standard offering.
Nutanix Objects Storage eliminates this complexity. We believe historical sync should not be a separate, paid project but a native property of your policy. Whether it is an initial backlog or a disaster recovery failback, the system handles the complexity. There are no manifests to generate, no extra fees to pay, and no waiting to get started.
Under the Hood: The Architecture of Scale
This simple, one-click experience is the culmination of an architecture designed from the ground up to handle the immense scale of modern object storage. Manually scripting a 2PB data copy is problematic, but synchronizing petabytes of data with billions of objects across a live production system is an entirely different class of engineering challenge. A brute-force approach, where a single process scanning the entire bucket space, won’t scale well.
To solve this, we leveraged the core architecture of Nutanix Objects Storage, in particular its distributed metadata layout. Instead of relying on a single, monolithic index, a bucket's object metadata is divided into partitions that are distributed across the cluster. The solution to scaling the Sync rests on two architectural pillars that are built upon this foundation: a partition-based design and a resilient, self-throttling execution engine.
Partition-based design: In Nutanix Objects Storage, object metadata is distributed across the cluster into smaller, independent chunks called partitions. Each partition manages a specific range of object keys. Sync leverages this native architecture by breaking the potentially monumental task of a bucket scan into several smaller, parallelizable tasks, i.e., one for each partition. This transforms a slow, serial process into a parallel operation that scales with the cluster.
Bucket Sync Engine: To ensure this background task coexists smoothly with live user traffic, we implemented a sophisticated, self-throttling scan mechanism using a two-cursor approach.
For each partition scan, we use two cursors:
- The Right Cursor (Cr - The Producer): This cursor moves forward through the partition's object list, checking if each object needs to be replicated and queuing it up. It is the "producer" of replication work.
- The Left Cursor (Cl - The Consumer): This cursor follows behind Cr, moving forward only after it confirms that an object has been successfully replicated to the destination.
This design brilliantly solves two problems at once:
- Resource Throttling: We enforce a maximum distance between the two cursors. If the producer (Cr) gets too far ahead of the consumer (Cl), it automatically pauses. This creates natural backpressure, preventing the replication queue from becoming bloated and ensuring the background sync doesn't starve foreground operations of resources.
- Resilience and Crash Recovery: The positions of both cursors for every partition scan are frequently checkpointed to a persistent metadata map. If a node crashes, we restart from the last saved cursor positions and resume exactly where the previous one left off. This eliminates the need to restart a multi-hour or multi-day sync from scratch, making the feature robust and reliable enough for enterprise workloads.
Summary
By combining this partition-based parallelism and a resilient, self-throttling execution engine, Nutanix Objects Storage transforms the challenge of historical data synchronization into a safe, scalable, and efficient one-click operation.
Finally, we prioritized protecting your live data. Since a 2PB backlog might take days to transfer, Nutanix Objects Storage is designed to run both the Historical Sync (background) and Streaming Replication (inline) engines simultaneously.
Crucially, our metadata design differentiates between these two streams, allowing live replication to continue effectively alongside the backfill. This intelligence enables smart prioritization: if an object is modified by a user while the historical sync is running, the new write is treated as a fresh event, ensuring your most recent data is protected without waiting for the full historical queue to clear.
Moving petabytes of historical data shouldn't require a team of engineers and a library of custom scripts. By integrating Historical Sync directly into the core engine, Nutanix Objects Storage simplifies a complex operational challenge into a straightforward background task.
So the next time your AI team asks for all 2PB of the logs, you won't have to scramble. You can simply check a box, click save, and tell them: "Done. The data is on its way."
©2026 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned are for identification purposes only and may be the trademarks of their respective holder(s).