Why AI Training Stalls: A GPU Cluster Troubleshooting Guide

Ehsan Ghasisin Expert Picks
06/30/2026 8:11am 14 minute read

Key takeaways

• AI training stalls are rarely caused by failing GPUs. The accelerators are usually idle, waiting on storage, the network fabric, or a single slow node.

• Three symptom patterns cover most cases: stalls near epoch completion (storage/checkpointing), random timeouts (network fabric), and one node slower than the rest (hardware, NUMA, or thermal).

• Diagnose by symptom, not by guesswork. Measure tail latency (p99), retransmission rates, and per-node variance before you touch the GPUs.

• When diagnosis confirms a fabric bottleneck, the fix is hardware: deeper-buffer switches, clean optics, and a non-blocking or rail-optimized topology.

Introduction

You have completed the hard part of standing up an AI cluster: the budget is approved, the GPUs are installed, the nodes are racked, and the frameworks are configured. Everything looks solid, and yet training jobs still stall, time out, or behave unpredictably. Most of the time the cause is not one dramatic fault but a small inefficiency in storage, the network fabric, or a single node that only surfaces under the synchronized, bursty traffic distributed training generates.

This is a symptom-driven playbook for GPU cluster troubleshooting — how to diagnose why a job stalls, times out, or runs slow, across the storage, network, and node layers. It is written for clusters that are already running and misbehaving; if you are still choosing your fabric, switching, and topology, start with our pillar guide, GPU Cluster Networking for On-Prem AI. Two concepts explain most of what follows: checkpointing and epochs.

What Is Checkpointing in AI Training?

Imagine writing a 500-page novel. You would not type for ten hours and then shut down without saving; you would hit Save every hour. Checkpointing is the Save button for AI training. It periodically writes the current state of a training job to storage so the job can resume from that point instead of starting over.

A training run may last hours, days, or weeks. If power fails, a node crashes, or a job times out, all progress is lost unless a recent checkpoint exists.

What a checkpoint usually contains

Model weights — what the network has learned so far.
Optimizer state — the internal parameters (such as momentum and variance estimates) that guide how learning continues.
Training progress — the epoch and step counters that act as a bookmark.
Sometimes scheduler and RNG state — so a resumed run reproduces the same learning-rate curve and data ordering.

Why checkpointing matters for troubleshooting

Checkpointing causes sudden spikes in storage traffic. During such an event, the GPUs pause or slow down while many nodes try to write very large files at the same time, and a storage layer that was comfortable under steady load can suddenly fall behind. That is why so many jobs freeze near epoch completion, slow down on a regular 30- to 60-minute cadence, or show inconsistent step times. The problem is usually not the GPUs — it is the storage layer struggling to absorb that synchronized write burst.

Modern note
Synchronous checkpointing blocks training until the write completes. Asynchronous and sharded checkpointing (for example, PyTorch Distributed Checkpoint) overlap the write with computation and split state across ranks, which shortens the pause dramatically. If you are still seeing long synchronous stalls, the checkpoint method itself may be worth revisiting alongside the storage hardware.

What Are Epochs?

Think of epochs like studying for an exam. The first read of the material builds a basic understanding, a second pass improves retention, and a third encourages deeper mastery. In AI training, an epoch is one complete pass through the entire training dataset.

Why multiple epochs are needed

Models rarely learn everything in a single pass. Each additional epoch lets the model improve accuracy, reduce prediction error, and adjust its internal weights. Crucially for troubleshooting, checkpoints and synchronization barriers tend to land at epoch boundaries — which is exactly why so many stalls appear near the end of an epoch.

Fast Diagnosis: Match the Symptom to the Likely Cause

Resist the urge to run benchmarks or reboot nodes first. Start by identifying the symptom. The table below is an emergency triage chart for the three most common failure patterns in distributed training. When a job stalls, rely on pattern recognition rather than guesswork.

Symptom	Likely cause	First check
Stalls near completion	Checkpointing bottleneck; storage latency spike	Measure write latency during a checkpoint
Timeouts / random failures (NCCL timeout)	Fabric instability, packet loss, InfiniBand/RoCE misconfiguration	Check retransmits, NCCL logs, NCCL_SOCKET_IFNAME/IB_HCA
One node slower than the rest	Hardware imbalance, NUMA, thermal throttling, version drift	Compare per-node utilization (nvidia-smi); run DCGM diagnostics

The rest of this guide works through each pattern in turn.

Pattern 1: Stalls Near Completion

Many jobs run normally for hours, then slow or freeze near the end of an epoch. The GPUs are usually not failing; they are waiting for checkpointing — the process of saving model state to storage — which lands at epoch boundaries or just before completion. Checkpoint writes are large (tens to hundreds of gigabytes, terabytes for the biggest models) and synchronized across nodes, so if storage cannot keep up, the GPUs sit idle waiting for the writes to confirm. These pauses are intermittent, which makes them hard to catch with standard monitoring.

The tell-tale signature: training slows on a regular cadence matching the checkpoint interval (often every 30 to 60 minutes), GPU utilization falls while I/O wait rises, and the stall disappears if checkpointing is disabled (a diagnostic test, not a fix). The next section covers the storage mechanics and the exact metrics to measure.

Pattern 2: Timeouts and Random Failures

Random timeouts during training are usually a sign of instability somewhere in the cluster rather than a software bug. Distributed training depends on constant communication between nodes, so even brief latency spikes or packet loss can interrupt a collective operation. These failures look inconsistent because they only surface under heavy, synchronized load.

The symptom you will actually see: an NCCL timeout

On NVIDIA clusters this almost always surfaces as an NCCL (NVIDIA Collective Communications Library) watchdog timeout — a log line resembling "Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE) ran for 1800000 milliseconds before timing out." It looks like a software crash, but the timeout is usually the messenger, not the cause: a collective such as all-reduce stalled because one rank never received the data it was waiting for. The job aborts after the default timeout (often 30 minutes) elapses. Raising the timeout hides the symptom; the real fix is finding why the fabric, a node, or the configuration starved that collective.

Common triggers

Distributed training produces large volumes of synchronized, all-to-all traffic between GPUs. When the fabric cannot consistently absorb those communication spikes, instability spreads across the cluster.

Packet loss in east-west (node-to-node) traffic.
Buffer congestion in leaf and spine switches when many flows converge at once (incast).
Inconsistent latency between nodes.
Interface and version misconfiguration — a wrong NCCL_SOCKET_IFNAME (NCCL binding to the management NIC instead of the high-speed fabric) or, on InfiniBand, an NCCL_IB_HCA pointed at the wrong adapter. Mismatched driver, CUDA, NCCL, or firmware versions across nodes produce the same timeout symptom while looking like a network fault.

Fabric type matters: InfiniBand vs. RoCE

Most production AI clusters run remote direct memory access (RDMA) over one of two fabrics, and the troubleshooting path differs between them:

InfiniBand — a purpose-built lossless fabric with credit-based flow control. Check subnet manager health, link errors, and congestion-control settings; on a timeout, confirm the correct HCA via NCCL_IB_HCA.
RoCEv2 (RDMA over Converged Ethernet) — RDMA on standard Ethernet, which only behaves well when the network is tuned for lossless transport. Misconfigured Priority Flow Control (PFC) or Explicit Congestion Notification (ECN / DCQCN), or an MTU left at 1500 instead of 9000, is a frequent and easily overlooked cause of stalls and retransmissions.

Identifying which fabric you run, and confirming its lossless settings are correct, often resolves "random" timeouts faster than any application-level change.

What to check first

Retransmission rates on network interfaces.
Switch buffer utilization and discard counters.
NCCL debug logs (set NCCL_DEBUG=INFO) and confirm NCCL_SOCKET_IFNAME / NCCL_IB_HCA bind to the fabric NICs.
Driver, CUDA, NCCL, and firmware versions are identical across every node (nvidia-smi makes the mismatch obvious).
On RoCE fabrics, PFC pause frames and ECN marking counters.

What engineers often miss

A network can pass basic health checks and still fail under AI workloads, because AI traffic is high-bandwidth, highly synchronized, and burst-heavy. Traditional monitoring, which samples averages over seconds, does not always catch the microbursts that actually cause the drops.

For background on the collective-communication library these clusters rely on, see NVIDIA's NCCL documentation.

Pattern 3: One Node Slower Than the Others

In distributed training, every node is expected to process its share at nearly the same speed. If one server slows down — because of hardware, thermal, storage, or network issues — the entire job waits for it at the next synchronization barrier. Even a small per-node gap compounds into a large slowdown across a big cluster.

Why this is critical

Synchronous training assumes uniform performance. Collective operations such as all-reduce complete only when the slowest participant finishes, so one underperforming node creates a ripple effect that gates every other GPU — the classic "straggler" problem.

Common causes

PCIe lane misconfiguration (for example, a card negotiating fewer lanes than expected).
NUMA misalignment between the GPU, its NIC, and the CPU handling its data.
Thermal throttling — a GPU that throttles to stay inside its power or temperature envelope becomes a permanent straggler. Hot-aisle nodes and failing fans are common triggers, and the slowdown persists silently rather than crashing the job.
Driver, firmware, or version drift — a single node running a mismatched NVIDIA driver, NCCL, CUDA, or VBIOS/firmware version can underperform or fail collectives while looking like a hardware fault. Industry support data attributes a large share of GPU-cluster issues to version mismatches nobody checked before launch.
Silent hardware degradation — rising ECC memory errors or an intermittent link can drag one node down before it fails outright; DCGM diagnostics surface these early.
Background processes consuming CPU, memory bandwidth, or I/O.

Quick validation steps

Compare GPU utilization across nodes with nvidia-smi and look for the outlier.
Run NVIDIA DCGM diagnostics to check power, memory (ECC), PCIe, and thermal health.
Check CPU affinity and NUMA alignment for the GPU and its NIC.
Confirm driver, NCCL, CUDA, and firmware versions match the rest of the cluster.
Monitor temperature and power limits for throttling.

Pro Tip
If one node is consistently slower, isolate it and run a standalone single-node benchmark. If it still lags on its own, you have found your culprit — and you have ruled out the fabric as the cause.

Storage: Why Checkpointing Becomes the Bottleneck

Checkpointing deserves a closer look because it is the single most common hidden cause of performance stalls. Training is not continuous: it periodically stops computing and starts saving, and that save is where storage shows its limits.

The three pressures checkpointing puts on storage

In a typical run across many GPUs and nodes, checkpointing triggers a synchronized spike in writes. All ranks try to persist large state at once, which creates three distinct pressures:

Burst saturation — storage tuned for steady throughput chokes on a sudden, synchronized spike.
Metadata congestion — creating or updating many files in parallel overwhelms filesystem metadata operations.
I/O synchronization — GPUs cannot proceed until writes confirm, so storage latency converts directly into GPU idle time.

If the timing matches the checkpoint cadence, the cause is almost certainly storage — not a failed drive, but storage that cannot absorb burst writes at low latency. Measure these three metrics during a checkpoint event, before investigating anything else:

Metric	What to look for
Write latency (p99)	Sustained tail-latency spikes during checkpoints point to trouble; calibrate the threshold to your storage class (parallel NVMe filesystems should stay far lower than a general-purpose NFS tier).
Parallel write throughput	Compare single-stream vs. multi-node performance; a large gap reveals a fabric or storage-controller limit.
Metadata operation time	Slow file-create or stat calls degrade checkpointing even when raw bandwidth looks fine.

What to Measure Before Blaming the GPUs

GPUs are blamed first because they are the most visible and most expensive part of the cluster, yet they are usually idle — waiting on storage, networking, or a slow peer — rather than computing. Before concluding the accelerators are at fault, check five things across the pipeline:

GPU utilization — with nvidia-smi, though note it only shows that a kernel is running, not useful work; pair it with model-FLOPs-utilization (MFU) where possible.
Network — packet loss, retransmits, and tail latency, not just average throughput.
Storage — IOPS under load, write latency during checkpoints, and queue depth.
CPU and memory — preprocessing saturation, memory bandwidth, and NUMA alignment.
Per-node variance — the slowest node sets the pace, so hunt for outliers rather than trusting cluster averages.

Mistakes to Avoid

Even experienced engineers fall into predictable traps that waste hours and point at the wrong root cause. Avoid these three:

Looking only at average bandwidth. Storage might average 10 GB/s and still stall training because p99 write latency spikes to seconds during checkpoints. Always measure the slowest 1% of operations.
Troubleshooting AI traffic like ordinary server traffic. Enterprise traffic is asynchronous and burst-tolerant. AI training is tightly synchronized: one slow node or dropped packet delays every GPU. Ping tests and average-throughput graphs will not reveal the problem.
Assuming the GPUs are at fault. Most stalls trace back to storage checkpoint bursts, network packet loss, or a single misconfigured node. Before blaming GPUs, measure write latency, retransmission rates, and per-node utilization variance.

From Diagnosis to Hardware

When the evidence points to the fabric — high packet loss, switch buffer congestion, or inconsistent latency — the answer is usually hardware, not another round of driver or NCCL tuning. Diagnostics that implicate the fabric typically call for better switching, optics, or topology. If a fabric bottleneck is confirmed, consider:

Switching. Deeper-buffer switches absorb synchronized traffic bursts without dropping packets, which is why shallow buffers are a leading cause of fabric congestion in AI clusters. Buffer depth is not the whole story, though — it works alongside congestion control (ECN / DCQCN on RoCE, credit-based control on InfiniBand), so size buffers and tune congestion management together rather than treating either alone as a silver bullet.
Optics. Faulty or low-quality transceivers introduce bit errors that trigger retransmissions. Replace optics showing corrected errors or signal degradation.
Topology. Move from an oversubscribed leaf-spine to a non-blocking or rail-optimized design. Every hop removed and every point of oversubscription eliminated reduces tail latency.

For how to size buffers, choose between InfiniBand and RoCE, and design a non-blocking fabric from the start, see our pillar guide, GPU Cluster Networking for On-Prem AI. A note on scope: the backend GPU fabric itself runs on specialized NVIDIA, Broadcom, and Arista silicon that NDI does not resell — and we will tell you so. Where we help is everything around that fabric: the cables and DAC / twinaxial cables the fabric depends on, plus the front-end, management, and out-of-band switching every cluster still needs. Each unit ships factory-sealed with a tracking number and is backed by engineering support.

Summary

AI training stalls are rarely caused by GPU hardware failure. They almost always trace to storage checkpoint spikes, network instability, or a single slow node — so diagnose by symptom: checkpoint write latency for stalls near completion, retransmissions and NCCL logs for random timeouts, and per-node variance (via nvidia-smi and DCGM) for a straggler. Measure tail latency, retransmission rates, and per-node variance before touching the GPUs, and when the evidence points to the fabric, the fix is hardware: switching, optics, or topology.

FAQs

1. Why does AI training stall near the end of an epoch?

Stalls near epoch boundaries are usually caused by checkpointing. At the end of an epoch the job writes large model and optimizer state to storage, and many nodes do this at once. If the storage layer cannot absorb that synchronized write burst at low latency, the GPUs sit idle waiting for the writes to finish, which appears as a stall.

2. Are AI training stalls usually caused by the GPUs?

No. In most cases the GPUs are healthy but idle, waiting on storage, the network fabric, or a single slow node. Before replacing or blaming accelerators, measure checkpoint write latency, network retransmission rates, and per-node utilization variance to find what the GPUs are actually waiting on.

3. What is checkpointing in AI training?

Checkpointing is the periodic saving of a training job's state — model weights, optimizer state, and progress counters — to persistent storage. It works like an auto-save: if a node crashes or a job times out, training resumes from the last checkpoint instead of starting over. The trade-off is a burst of heavy write traffic each time a checkpoint is taken.

4. What network problems cause distributed training timeouts?

Common culprits are packet loss in east-west traffic, switch buffer congestion from incast, and inconsistent latency between nodes. On RoCE fabrics, misconfigured Priority Flow Control or ECN is a frequent cause. Check retransmission rates, switch buffer and discard counters, and NCCL debug logs to confirm.

5. How do I diagnose whether it's the GPU, the network, or storage?

Work by symptom. Stalls on a regular cadence near epoch boundaries point to storage and checkpointing — measure p99 write latency. Random timeouts (NCCL errors) point to the network fabric — check retransmissions, switch discards, and NCCL logs. One consistently slow node points to hardware, NUMA, thermal, or version drift — compare per-node utilization with nvidia-smi and run DCGM diagnostics. Measure all three before replacing any hardware.

« Back to Blog