Introduction

Fine-tuning large language models (LLMs) especially with 100B+ parameters requires significant computational resources. On the Nutanix Cloud Platform (NCP), we successfully implemented a distributed training environment using Fully Sharded Data Parallel (FSDP) to manage a 405B parameter model (https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct).

Infrastructure Setup

Our internal test setup utilized a two-node distributed cluster. Each compute node featured dual AMD EPYC processors and 8x NVIDIA RTX Pro 6000 Blackwell GPUs, providing a total of 16 GPUs and 7.8 TB of RAM. High-speed communication was handled via a 100 GbE bonded interconnect, while a 28 TB shared NFS storage pool helped maintain data consistency. We employed a LoRA (Low-Rank Adaptation) approach, targeting approximately 400M trainable parameters (~0.1% of the base model). The process utilized the OpenAssistant/oasst2 dataset with a focus on English conversations. Distributed strategy was managed through FSDP with activation checkpointing enabled to optimize memory usage across the 16-GPU cluster.

Each compute node features dual AMD EPYC processors, 3.9 TB of DDR5 RAM, and 8x NVIDIA RTX Pro 6000 Blackwell GPUs divided across two NUMA domains. The control plane relies on slurmctld and slurmd for job dispatching, whereas the data plane utilizes a 100 GbE bonded interconnect for high-speed NCCL/FSDP collective communications. A 28 TB shared NFS storage pool ensures consistent data access across the cluster. Figure 1: Hardware and network topology for a two-node distributed deep learning cluster. Each compute node features dual AMD EPYC processors, 3.9 TB of DDR5 RAM, and 8x NVIDIA RTX Pro 6000 Blackwell GPUs divided across two NUMA domains. The control plane relies on slurmctld and slurmd for job dispatching, whereas the data plane utilizes a 100 GbE bonded interconnect for high-speed NCCL/FSDP collective communications. A 28 TB shared NFS storage pool ensures consistent data access across the cluster.

Topology

Three hosts on 10.21.133.0/24 subnet, all running the Ubuntu operating system with kernel 6.8.0-107-generic:

Role Hostname IP Notes
Controller / login rno-it-gpu0006-1 10.21.133.31 No GPUs; Slurm submission + NFS client
Compute + NFS server rno-it-gpu0007-1 10.21.133.35 Hosts /shared (14 TB NVMe)
Compute rno-it-gpu0008-1 10.21.133.39 Hosts /shared-b (14 TB NVMe)

Per-compute-node hardware (identical)

Component Spec
CPU 2× AMD EPYC 9575F (64c each) → 128 physical cores, 256 threads, 2 NUMA nodes
RAM 3.9 TB DDR5 per node
GPU 8× NVIDIA RTX PRO 6000 Blackwell Server (96 GB VRAM each)
Driver 580.126.09, VBIOS 98.02.81.00.01  
GPU Interconnect PCIe only, no NVLink — topology shows PIX (same switch, GPU pairs) / NODE (same NUMA) / SYS (cross-NUMA). Peer bandwidth is PCIe Gen5
GPU → NUMA GPUs 0–3 on NUMA 0 (cores 0–63,128–191) ; GPUs 4–7 on NUMA 1 (cores 64–127,192–255) 
NICs 5 interfaces; bond0 is the active one 

Network

Interconnect: 100 GbE on bond0 (100,000 Mbps, full-duplex) between all three hosts.

Storage (28TB Pooled)

Mount Backing Size Purpose
/shared
10.21.133.35:/data/shared (node 7 NVMe, XFS on vg_nvme-lv_scratch) via NFSv4 14 TB Original shared pool (4.8 TB used - models + envs)
/shared-b 10.21.133.39:/data (node 8 NVMe, XFS) via NFSv4 on node 7 + controller; bind-mounted on node 8 14 TB Newly added (empty)
/pool mergerfs FUSE union of /shared + /shared-b, policy category, create=mfs 28 TB Unified namespace: new writes routed to most-free brick
per-node-root /dev/mapper/ubuntu--vg-ubuntu--lv 877GB

Persistence: /etc/fstab entries on all three hosts, with x-systemd.requires= dependencies so mergerfs waits for both bricks.

Slurm

Version slurm-wlm
Controller Running on rno-it-gpu0007-1 (slurmctld UP)
Partition gpu (default), nodes rno-it-gpu000{7,8}-1
Resources 512 CPU threads total, 7.8 TB RAM, 16 GPUs (gpu:rtx_pro_6000:8 per node)
Time Limit unlimited (jobs default to infinite unless #SBATCH --time= set; your sbatch sets 48 h)
Accounting Disabled — sacct not available

Software Stack

Package Version
Python 3.12.13
PyTorch 2.11.0+cu128 (CUDA 12.8)
transformers 5.5.4
trl 1.2.0
peft 0.19.1
accelerate 1.13.0
datasets 4.8.4
bitsandbytes 0.49.2
wandb 0.26.0

Distributed-training posture

  • FSDP (FULL_SHARD) across all 16 GPUs via accelerate config config/accelerate_fsdp.yaml, TRANSFORMER_BASED_WRAP on LlamaDecoderLayer, SHARDED_STATE_DICT, bf16 mixed precision, FSDP activation checkpointing enabled, CPU offload off, CPU-RAM-efficient loading on.
  • NCCL: bond0 (TCP), IB disabled, P2P enabled within node (PCIe).
  • Rendezvous: Slurm-picked head node (first hostname of SLURM_NODELIST), port 29500, static rendezvous, one accelerate launch per node driven by srun.
  • Data path: HF cache at /pool/hf_cache, model weights at /pool/models/, checkpoints at /pool/checkpoints/sft-<jobid>/, logs at  /pool/logs/, W&B run dir inside the output dir.
  • Experiment tracking: W&B (rajat-ghosh11/llama405b-oasst2-sft for the OASST2 run; rajat-ghosh11/llama8b-lima-sft for earlier LIMA jobs).
  • Model push: enabled via HF Hub adapter-only upload at end of training, gated by HF_REPO_ID + HF_TOKEN in .env.

Notable limitations / bottlenecks

  • No NVLink, no IB → all cross-GPU traffic is PCIe (intra-node) or TCP-over-100-GbE (inter-node). A single 405B FSDP all-gather per layer moves ~50 GB across the wire → steady-state per-step time ~40 min.
  • Single NFS server for /shared — if node 7 goes down, node 8 loses access to both the original shared data and the cluster exits distributed training.
  • No accounting (sacct disabled) — no historical job metrics beyond what each run logs to W&B + stdout.
  • Controller (node 6) is not in the GPU partition — can submit, can't run GPU jobs itself.

Fine-Tuning Details

Base Model meta-llama/Llama-3.1-405B-Instruct
Dataset OpenAssistant/oasst2
Language filter en only
Preprocessing tree-walk → pick rank-0 assistant reply at each branch, keep longest path per tree
Train conversations 5,125
Eval conversations 275
Format HF chat template ({"messages":[{"role","content"}]})
Max seq length 2048
Packing True (TRL packs conversations to fill context)
Tokenizer AutoTokenizer from the base model; pad_token = eos_token if unset
Chat template Llama-3.1 official (installed if missing)
Optimizer adamw_torch (PyTorch AdamW)
Learning rate (peak) 5e-5
LR scheduler cosine
Warmup ratio 0.03 (3% of steps)
Weight decay 0.01
Seed 42
Epochs 1
Per-device batch size 1
Gradient accumulation 2
Effective batch size 32
Total optimizer steps 36 (5125 convos / 32 eff. batch × 1 epoch with packing)
Type LoRA
Rank 32
Alpha 64
Dropout 0.05
Bias None
Target Modules q_proj, k_proj, v_proj, o_proj
Task Type CAUSAL_LM
Trainable Parameters ~400M (~0.1% of the base)
Precision bf16
TF32 enabled
Distributed Strategy FSDP
FSDP wrap policy TRANSFORMER_BASED_WRAP on LlamaDecoderLayer
FSDP state-dict type SHARDED_STATE_DICT
FSDP backward prefetch BACKWARD_PRE
FSDP CPU-efficient loading enabled
FSDP use_orig_params True
FSDP offload_params False
Activation checkpointing enabled (fsdp_activation_checkpointing: true)
gradient_checkpointing (HF) False (FSDP handles it)
DDP find unused params False
Attention impl sdpa
Launcher accelerate launch via Slurm srun, one process per node
Eval strategy every EVAL_STEPS=2 steps
Logging steps every step (LOGGING_STEPS=1) + logging_first_step=True
Save strategy every SAVE_STEPS=20 steps
save_total_limit 3
save_only_model False
Reporter wandb (report_to=["wandb"])
Adapter export extracted from FSDP .distcp shards → adapter_model.safetensors (1.62 GB)
Merged model 405B + LoRA fused on CPU (3.9 TB host RAM), sharded 191× @ 5 GB bf16
Final Score - Wall time: 8 h 24 m - Final train loss: 1.111 (from wandb-summary.json) - Final eval loss: 1.139 - Token accuracy: ~70.5% (train/eval) - IFEval (on merged): prompt-strict 70.98%, prompt-loose 74.49%, inst-strict 78.30%, inst-loose 81.06%

Training Performance

Figure 2. Training Performance Figure 2. Training Performance

Evaluation Performance

Figure 3. Evaluation Performance Figure 3. Evaluation Performance

Calibration against a Well-Known Benchmarking Tool (Link)

Figure 4. Calibration against as Well-Known Benchmarking Tool.  Projected Hours: 8.8 Hours. Actual Time Taken: 8h 24m Figure 4. Calibration against as Well-Known Benchmarking Tool. Projected Hours: 8.8 Hours. Actual Time Taken: 8h 24m

GPU Utilization

Figure 5. GPU Utilization (%) Figure 5. GPU Utilization (%)
Figure 6. GPU Memory Allocated (Bytes) Figure 6. GPU Memory Allocated (Bytes)

Summary

The training run was completed in 8 hours and 24 minutes, closely matching our projected timeframe of 8.8 hours. We achieved a final evaluation loss of 1.139 and a token accuracy of approximately 70.5%. These results demonstrate the efficiency of the Nutanix Cloud Platform for high-demand distributed deep learning workloads.

 

©2026 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned are for identification purposes only and may be the trademarks of their respective holder(s).

This content reflects an experiment in a test environment. Results, benefits, savings, or other outcomes described depend on a variety of factors including use case, individual requirements, and operating environments, and this publication should not be construed as a promise or obligation to deliver specific outcomes.