What is multi-node LLM fine-tuning on Nutanix Cloud Platform?

Multi-node LLM fine-tuning on Nutanix Cloud Platform is a distributed training approach that enables organizations to efficiently fine-tune large language models with 100B+ parameters across multiple compute nodes. It leverages Fully Sharded Data Parallel (FSDP) technology to manage complex models like Llama 3.1-405B across 16 or more GPUs.

What infrastructure do you need for distributed LLM fine-tuning?

A distributed LLM fine-tuning setup requires high-performance compute nodes with dual EPYC processors, RTX Pro 6000 Blackwell GPUs (8 per node minimum), DDR5 RAM (3.9 TB per node), 100 GbE bonded interconnect for high-speed communication, and shared NFS storage pools (28 TB pooled recommended) to ensure consistent data access.

What is Fully Sharded Data Parallel (FSDP) and why is it important?

Fully Sharded Data Parallel (FSDP) is a distributed training strategy that shards both model parameters and optimizer states across multiple GPUs and nodes. It is critical for fine-tuning large language models because it optimizes memory usage and enables training of models that would otherwise exceed individual GPU memory capacity.

What is Low-Rank Adaptation (LoRA) and how does it help?

Low-Rank Adaptation (LoRA) is a technique that reduces the number of trainable parameters by targeting specific, low-rank weight matrices within the model. For large models like Llama 3.1-405B, LoRA can reduce trainable parameters from billions to approximately 400M parameters (0.1% of the base model), significantly lowering computational requirements.

How is GPU memory optimized in multi-node fine-tuning clusters?

GPU memory is optimized through several techniques: FSDP distributes model parameters across GPUs, activation checkpointing discards intermediate activations during forward passes and recomputes them during backward passes, and careful mapping of GPUs to NUMA domains ensures efficient data access patterns and reduced cross-NUMA communication overhead.

What role does the interconnect play in distributed training?

The interconnect, such as a 100 GbE bonded connection, handles high-speed NCCL and FSDP collective communication between nodes. It is critical for synchronizing gradients, sharing model parameters, and maintaining training efficiency. PCIe Gen5 GPU interconnects provide peer bandwidth within nodes to optimize intra-node communication.

What dataset and training configurations were used for the Llama 3.1-405B fine-tuning?

The fine-tuning used the OpenAssistant/oasst2 dataset, focusing on English conversations. The training setup included a two-node distributed cluster with 16 GPUs total, FSDP with activation checkpointing, LoRA for efficient parameter updates, and Slurm for job scheduling and task dispatch across the compute cluster.

Introduction

Fine-tuning large language models (LLMs) especially with 100B+ parameters requires significant computational resources. On the Nutanix Cloud Platform (NCP), we successfully implemented a distributed training environment using Fully Sharded Data Parallel (FSDP) to manage a 405B parameter model (https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct).

Infrastructure Setup

Our internal test setup utilized a two-node distributed cluster. Each compute node featured dual AMD EPYC processors and 8x NVIDIA RTX Pro 6000 Blackwell GPUs, providing a total of 16 GPUs and 7.8 TB of RAM. High-speed communication was handled via a 100 GbE bonded interconnect, while a 28 TB shared NFS storage pool helped maintain data consistency. We employed a LoRA (Low-Rank Adaptation) approach, targeting approximately 400M trainable parameters (~0.1% of the base model). The process utilized the OpenAssistant/oasst2 dataset with a focus on English conversations. Distributed strategy was managed through FSDP with activation checkpointing enabled to optimize memory usage across the 16-GPU cluster.

Each compute node features dual AMD EPYC processors, 3.9 TB of DDR5 RAM, and 8x NVIDIA RTX Pro 6000 Blackwell GPUs divided across two NUMA domains. The control plane relies on slurmctld and slurmd for job dispatching, whereas the data plane utilizes a 100 GbE bonded interconnect for high-speed NCCL/FSDP collective communications. A 28 TB shared NFS storage pool ensures consistent data access across the cluster.

Figure 1: Hardware and network topology for a two-node distributed deep learning cluster. Each compute node features dual AMD EPYC processors, 3.9 TB of DDR5 RAM, and 8x NVIDIA RTX Pro 6000 Blackwell GPUs divided across two NUMA domains. The control plane relies on slurmctld and slurmd for job dispatching, whereas the data plane utilizes a 100 GbE bonded interconnect for high-speed NCCL/FSDP collective communications. A 28 TB shared NFS storage pool ensures consistent data access across the cluster.

Topology

Three hosts on 10.21.133.0/24 subnet, all running the Ubuntu operating system with kernel 6.8.0-107-generic:

Role	Hostname	IP	Notes
Controller / login	rno-it-gpu0006-1	10.21.133.31	No GPUs; Slurm submission + NFS client
Compute + NFS server	rno-it-gpu0007-1	10.21.133.35	Hosts /shared (14 TB NVMe)
Compute	rno-it-gpu0008-1	10.21.133.39	Hosts /shared-b (14 TB NVMe)

Per-compute-node hardware (identical)

Component	Spec
CPU	2× AMD EPYC 9575F (64c each) → 128 physical cores, 256 threads, 2 NUMA nodes
RAM	3.9 TB DDR5 per node
GPU	8× NVIDIA RTX PRO 6000 Blackwell Server (96 GB VRAM each)
Driver	580.126.09, VBIOS 98.02.81.00.01
GPU Interconnect	PCIe only, no NVLink — topology shows PIX (same switch, GPU pairs) / NODE (same NUMA) / SYS (cross-NUMA). Peer bandwidth is PCIe Gen5
GPU → NUMA	GPUs 0–3 on NUMA 0 (cores 0–63,128–191) ; GPUs 4–7 on NUMA 1 (cores 64–127,192–255)
NICs	5 interfaces; bond0 is the active one

Network

Interconnect: 100 GbE on bond0 (100,000 Mbps, full-duplex) between all three hosts.

Storage (28TB Pooled)

Mount	Backing	Size	Purpose
/shared	10.21.133.35:/data/shared (node 7 NVMe, XFS on vg_nvme-lv_scratch) via NFSv4	14 TB	Original shared pool (4.8 TB used - models + envs)
/shared-b	10.21.133.39:/data (node 8 NVMe, XFS) via NFSv4 on node 7 + controller; bind-mounted on node 8	14 TB	Newly added (empty)
/pool	mergerfs FUSE union of /shared + /shared-b, policy category, create=mfs	28 TB	Unified namespace: new writes routed to most-free brick
per-node-root	/dev/mapper/ubuntu--vg-ubuntu--lv	877GB

Persistence: /etc/fstab entries on all three hosts, with x-systemd.requires= dependencies so mergerfs waits for both bricks.

Slurm

Version	slurm-wlm
Controller	Running on rno-it-gpu0007-1 (slurmctld UP)
Partition	gpu (default), nodes rno-it-gpu000{7,8}-1
Resources	512 CPU threads total, 7.8 TB RAM, 16 GPUs (gpu:rtx_pro_6000:8 per node)
Time Limit	unlimited (jobs default to infinite unless #SBATCH --time= set; your sbatch sets 48 h)
Accounting	Disabled — sacct not available

Software Stack

Package	Version
Python	3.12.13
PyTorch	2.11.0+cu128 (CUDA 12.8)
transformers	5.5.4
trl	1.2.0
peft	0.19.1
accelerate	1.13.0
datasets	4.8.4
bitsandbytes	0.49.2
wandb	0.26.0

Distributed-training posture

FSDP (FULL_SHARD) across all 16 GPUs via accelerate config config/accelerate_fsdp.yaml, TRANSFORMER_BASED_WRAP on LlamaDecoderLayer, SHARDED_STATE_DICT, bf16 mixed precision, FSDP activation checkpointing enabled, CPU offload off, CPU-RAM-efficient loading on.
NCCL: bond0 (TCP), IB disabled, P2P enabled within node (PCIe).
Rendezvous: Slurm-picked head node (first hostname of SLURM_NODELIST), port 29500, static rendezvous, one accelerate launch per node driven by srun.
Data path: HF cache at /pool/hf_cache, model weights at /pool/models/, checkpoints at /pool/checkpoints/sft-<jobid>/, logs at /pool/logs/, W&B run dir inside the output dir.
Experiment tracking: W&B (rajat-ghosh11/llama405b-oasst2-sft for the OASST2 run; rajat-ghosh11/llama8b-lima-sft for earlier LIMA jobs).
Model push: enabled via HF Hub adapter-only upload at end of training, gated by HF_REPO_ID + HF_TOKEN in .env.

Notable limitations / bottlenecks

No NVLink, no IB → all cross-GPU traffic is PCIe (intra-node) or TCP-over-100-GbE (inter-node). A single 405B FSDP all-gather per layer moves ~50 GB across the wire → steady-state per-step time ~40 min.
Single NFS server for /shared — if node 7 goes down, node 8 loses access to both the original shared data and the cluster exits distributed training.
No accounting (sacct disabled) — no historical job metrics beyond what each run logs to W&B + stdout.
Controller (node 6) is not in the GPU partition — can submit, can't run GPU jobs itself.

Fine-Tuning Details

Base Model	meta-llama/Llama-3.1-405B-Instruct
Dataset	OpenAssistant/oasst2
Language filter	en only
Preprocessing	tree-walk → pick rank-0 assistant reply at each branch, keep longest path per tree
Train conversations	5,125
Eval conversations	275
Format	HF chat template ({"messages":[{"role","content"}]})
Max seq length	2048
Packing	True (TRL packs conversations to fill context)
Tokenizer	AutoTokenizer from the base model; pad_token = eos_token if unset
Chat template	Llama-3.1 official (installed if missing)
Optimizer	adamw_torch (PyTorch AdamW)
Learning rate (peak)	5e-5
LR scheduler	cosine
Warmup ratio	0.03 (3% of steps)
Weight decay	0.01
Seed	42
Epochs	1
Per-device batch size	1
Gradient accumulation	2
Effective batch size	32
Total optimizer steps	36 (5125 convos / 32 eff. batch × 1 epoch with packing)
Type	LoRA
Rank	32
Alpha	64
Dropout	0.05
Bias	None
Target Modules	q_proj, k_proj, v_proj, o_proj
Task Type	CAUSAL_LM
Trainable Parameters	~400M (~0.1% of the base)
Precision	bf16
TF32	enabled
Distributed Strategy	FSDP
FSDP wrap policy	TRANSFORMER_BASED_WRAP on LlamaDecoderLayer
FSDP state-dict type	SHARDED_STATE_DICT
FSDP backward prefetch	BACKWARD_PRE
FSDP CPU-efficient loading	enabled
FSDP use_orig_params	True
FSDP offload_params	False
Activation checkpointing	enabled (fsdp_activation_checkpointing: true)
gradient_checkpointing (HF)	False (FSDP handles it)
DDP find unused params	False
Attention impl	sdpa
Launcher	accelerate launch via Slurm srun, one process per node
Eval strategy	every EVAL_STEPS=2 steps
Logging steps	every step (LOGGING_STEPS=1) + logging_first_step=True
Save strategy	every SAVE_STEPS=20 steps
save_total_limit	3
save_only_model	False
Reporter	wandb (report_to=["wandb"])
Adapter export	extracted from FSDP .distcp shards → adapter_model.safetensors (1.62 GB)
Merged model	405B + LoRA fused on CPU (3.9 TB host RAM), sharded 191× @ 5 GB bf16
Final Score	- Wall time: 8 h 24 m - Final train loss: 1.111 (from wandb-summary.json) - Final eval loss: 1.139 - Token accuracy: ~70.5% (train/eval) - IFEval (on merged): prompt-strict 70.98%, prompt-loose 74.49%, inst-strict 78.30%, inst-loose 81.06%

Training Performance

Figure 2. Training Performance

Evaluation Performance

Figure 3. Evaluation Performance

Calibration against a Well-Known Benchmarking Tool (Link)

Figure 4. Calibration against as Well-Known Benchmarking Tool. Projected Hours: 8.8 Hours. Actual Time Taken: 8h 24m

GPU Utilization

Figure 5. GPU Utilization (%)

Figure 6. GPU Memory Allocated (Bytes)

Summary

The training run was completed in 8 hours and 24 minutes, closely matching our projected timeframe of 8.8 hours. We achieved a final evaluation loss of 1.139 and a token accuracy of approximately 70.5%. These results demonstrate the efficiency of the Nutanix Cloud Platform for high-demand distributed deep learning workloads.

©2026 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned are for identification purposes only and may be the trademarks of their respective holder(s).

This content reflects an experiment in a test environment. Results, benefits, savings, or other outcomes described depend on a variety of factors including use case, individual requirements, and operating environments, and this publication should not be construed as a promise or obligation to deliver specific outcomes.

Multi-Node LLM Finetuning on Nutanix Cloud Platform

Multi-Node LLM Finetuning on Nutanix Cloud Platform

Introduction

Introduction