Introduction
Fine-tuning large language models (LLMs) especially with 100B+ parameters requires significant computational resources. On the Nutanix Cloud Platform (NCP), we successfully implemented a distributed training environment using Fully Sharded Data Parallel (FSDP) to manage a 405B parameter model (https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct).
Infrastructure Setup
Our internal test setup utilized a two-node distributed cluster. Each compute node featured dual AMD EPYC processors and 8x NVIDIA RTX Pro 6000 Blackwell GPUs, providing a total of 16 GPUs and 7.8 TB of RAM. High-speed communication was handled via a 100 GbE bonded interconnect, while a 28 TB shared NFS storage pool helped maintain data consistency. We employed a LoRA (Low-Rank Adaptation) approach, targeting approximately 400M trainable parameters (~0.1% of the base model). The process utilized the OpenAssistant/oasst2 dataset with a focus on English conversations. Distributed strategy was managed through FSDP with activation checkpointing enabled to optimize memory usage across the 16-GPU cluster.
Topology
Three hosts on 10.21.133.0/24 subnet, all running the Ubuntu operating system with kernel 6.8.0-107-generic:
Per-compute-node hardware (identical)
Network
Interconnect: 100 GbE on bond0 (100,000 Mbps, full-duplex) between all three hosts.
Storage (28TB Pooled)
Persistence: /etc/fstab entries on all three hosts, with x-systemd.requires= dependencies so mergerfs waits for both bricks.
Slurm
Software Stack
Distributed-training posture
- FSDP (FULL_SHARD) across all 16 GPUs via accelerate config config/accelerate_fsdp.yaml, TRANSFORMER_BASED_WRAP on LlamaDecoderLayer, SHARDED_STATE_DICT, bf16 mixed precision, FSDP activation checkpointing enabled, CPU offload off, CPU-RAM-efficient loading on.
- NCCL: bond0 (TCP), IB disabled, P2P enabled within node (PCIe).
- Rendezvous: Slurm-picked head node (first hostname of SLURM_NODELIST), port 29500, static rendezvous, one accelerate launch per node driven by srun.
- Data path: HF cache at /pool/hf_cache, model weights at /pool/models/, checkpoints at /pool/checkpoints/sft-<jobid>/, logs at /pool/logs/, W&B run dir inside the output dir.
- Experiment tracking: W&B (rajat-ghosh11/llama405b-oasst2-sft for the OASST2 run; rajat-ghosh11/llama8b-lima-sft for earlier LIMA jobs).
- Model push: enabled via HF Hub adapter-only upload at end of training, gated by HF_REPO_ID + HF_TOKEN in .env.
Notable limitations / bottlenecks
- No NVLink, no IB → all cross-GPU traffic is PCIe (intra-node) or TCP-over-100-GbE (inter-node). A single 405B FSDP all-gather per layer moves ~50 GB across the wire → steady-state per-step time ~40 min.
- Single NFS server for /shared — if node 7 goes down, node 8 loses access to both the original shared data and the cluster exits distributed training.
- No accounting (sacct disabled) — no historical job metrics beyond what each run logs to W&B + stdout.
- Controller (node 6) is not in the GPU partition — can submit, can't run GPU jobs itself.
Fine-Tuning Details
Training Performance
Evaluation Performance
Calibration against a Well-Known Benchmarking Tool (Link)
GPU Utilization
Summary
The training run was completed in 8 hours and 24 minutes, closely matching our projected timeframe of 8.8 hours. We achieved a final evaluation loss of 1.139 and a token accuracy of approximately 70.5%. These results demonstrate the efficiency of the Nutanix Cloud Platform for high-demand distributed deep learning workloads.
©2026 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned are for identification purposes only and may be the trademarks of their respective holder(s).
This content reflects an experiment in a test environment. Results, benefits, savings, or other outcomes described depend on a variety of factors including use case, individual requirements, and operating environments, and this publication should not be construed as a promise or obligation to deliver specific outcomes.