AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

AI at scale: Architecting Scalable, Deployable and Resilient
Infrastructure
Pratik Mishra
AMD
September 20, 2025
Alluxio AI/ML Meet-up
San Francisco, CA

2 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Agenda
Disclaimer: Please refer to the Copyrights and Disclaimer in the presentation. We have tried to cite most relevant sources. We (the authors and associated organization) owe no responsibility
towards the content’s accuracy or claims, and they should be viewed as personal viewpoints/opinions to cater open discussions.
• AI Deployments and Challenges
• Infrastructure, Reliability, and Foundation Model Training
• Conclusion
SDC’24, FMS’25, SDC’25

3 |
AI Infrastructure: Deployments and Challenges

4 |
How to train a dragon model?
Data & 5Vs Data Storage
Data Ingestion Data Preparation
Training
The magic
ETL to training accessible
formats, annotation,
indexing, etc.
Stream bulk “objects”
to clouds/data-
centers
Foundation Model
Deployment
Deploy FM for
down-ward tasks
Model set-up: training strategies
Execution: Run training
Persistence: Save/load checkpoints
Validation & Monitoring
Tasks Users
GPUs GPUs
Downstream tasks
Fine-tuning,
post-training,
inference endpoints
Prompts,
agent interactions,
etc.
UXI
But think deeper: maximize GPU utilization, minimize stalls, optimize throughput and
reduce latency to drive “real” ROI
AI Developer Priorities: Focus on fast model convergence, efficient algorithm design, rapid
deployment to accelerate time-to-market

5 |
AI Tech Stack : 100K birds-eye view
AI Developers and Applications
Pre-training Fine-tuning Inference Post-training Agents
Data Storage and Management
Ingestion Archive
Processing Data Lake VectorDB
Labeling
File
Block Object
Compute Infrastructure (GPU, Networks, Memory, Local Storage)
GPU NIC/DPU
CPU Frontend + Backend
CSPs &/ On-prem Infrastructure
Training & Inference Frameworks (PyTorch, TensorFlow, vLLM, SGLang)
Distributed AI Compute Managers (Ray, Spark, etc.)
Model Deployment (k8s, slurm) & Container Orchestrators
Multi-Modal Data
What they need to care about?
• The highly simplified AI Tech Stack
• Access to tools, infrastructure, deployments
• Most importantly, access to SOTA GPUs
On top of all that, ecosystems with closed
stacks – limits innovation, flexibility and raises
the barriers for entry.

6 |

7 |
Sovereign AI Case-study:
Motif Technologies Multi-modal Training with AMD ecosystem

8 |
Motif Technologies Training Infrastructure powered by AMD
Motif Technologies (South Korea) runs multi-modal AI workloads on AMD Instinct MI250
GPUs using AMD-optimized Docker containers with SkyPilot orchestration.

9 |
Motif Technologies: AMD Developer Cloud with MI300X
Disclaimer: The performance metrics and results presented are based on partner-provided data and have not been independently verified by AMD. These figures are shared as-is and may vary depending on system configuration, workload characteristics, and optimization levels. AMD makes no
representations or warranties regarding the accuracy or completeness of third-party performance claims.
AI for ALL:
A democratized platform with an open and optimized AI ecosystem and access to SOTA AMD
GPUs – fosters innovations especially for startups, researchers, and emerging markets.
Motif 2.6B on 1xMI250 vs MI300x
5X throughput gains
on 1xMI300x, bigger
batches, etc.
Motif Kernel: https://huggingface.co/Motif-Technologies/activation

10 |
Call to Action!
Built by developers for developers.
• AMD is building for you, come build on us.
• Commitment to open AI ecosystem
• Full AI Lifecycle
• Industry leading GPU technology

11 |
AI Infrastructure: Reliability and Scalability

12 |
AI Training Infra Reliability 101: Metrics
• Training Goodput = Actual progress made / total time
• Effective Training Time Ratio (ETTR) = actual training time / total time
• Model FLOPs Utilization (MFU) = FLOPs a model utilizes/ peak HW FLOPs available
• Mean Time Between Failures (MTBF) = total time / # of failures
Achieving high training goodput and maximizing model FLOPs utilization to improve the
Effective Training Time Ratio remains a significant and ongoing challenge.
Failures and Training Efficiency?

13 |
Reliability and Training Efficiency @scale
With growing scale of AI deployments, the MTBF decreases significantly.
Therefore, resiliency is the core for achieving Training efficiency and increasing Training Goodput
and ETTR.
# of accelerators
Mean
Time
Between
Failure
(MTBF)
log-scale
(normalized
mins)
node rack-scale cluster-scale data-center scale
(<24 hrs) (<30 mins) (<5 mins)
Projections of AI training systems@scale failures not specific to any accelerator.
Across millions and billions of components across the SW & HW stacks in the data-center hierarchy.
(<1-3mon)
(~3-6mon)
(yrs)
𝑴𝑻𝑩𝑭 ∝ 𝟏/(𝒏𝒐. 𝒐𝒇 𝒂𝒄𝒄𝒆𝒍𝒆𝒓𝒂𝒕𝒐𝒓𝒔)

14 |
Fault Tolerance, Training Efficiency and Checkpointing
• Fault-tolerance, resiliency, and recovery are of utmost importance for
Training Efficiency metrics (discussed earlier)
• Critical fault-tolerance mechanism for periodically persisting training
snapshots to enable recovery via rollbacks in the event of failure
• Also: Hardware refresh, Resource re-balancing, post-training, concurrent evaluation, increase accuracy, etc.
With scale and every-lowering MTBFs, the checkpointing frequency, size, and complexity increases
significantly; imposing heavy data-center tax (GPU underutilization).
• Storage community’s poster AI use-case: Checkpointing

15 |
Fault Tolerance Tax: Checkpointing
Tprogress_1 Tprogress_n
Tchkpt_save_1 Tchkpt_save_n Titr_lost Trecovery Tchkpt_load
FToverhead = Tchkpt_save + Titr_lost + Trecovery + Tchkpt_load
ETTR = (1-FToverhead)
• Achieving optimal ETTR @ data-center scale is “real” challenge
• Without optimization, systems may spend more time managing failures than actual training
• Trade-off: Excessive checkpoints increases data-center tax & infrequent increases risks (cost)
• Data-center tax: compute, network, storage
Therefore, to achieve optimal ETTR (+goodput) it is essential for reliability mechanisms to strike the
balance of performance, scalability, and cost-effectiveness.

16 |
Checkpoint (save) = Serialization + Persistence
SDC’24, SDC’25, FMS’25
Asynchronous chkpt: Main Training thread is alleviated from IO persistence
• Overlaps IO with computation
• Reduces peak-pressure on network and storage by “buffering”
• Still not truly asynchronous (IO verbs!)
Existing implementations need further optimizations to reduce @scale overheads.
Reliable and Unified memory + storage tiering is essential – masking I/O and communication overheads with
computation.
Example: Local NVMe → PFS → Object (or) combinations
Synchronous chkpt: Main Training thread waits till checkpoint is persisted
• Short, periodic, bursty writes.
• Over-subscribes front-end NICs and storage infrastructure
• Leads to GPU stalling to resume training.

17 |
Checkpoint (loads) = Loading + Deserialization
• Loading checkpoints is mission critical
• All GPU simultaneously load their states to resume training
• Massive IO amplification compared to save(s)
• Deserialization overheads are massive
• Concurrent loading can de-stabilize entire infrastructure
• Also, downstream tasks – post-training, inference, etc.
• Optimizations
• GPU-GPU network-aware checkpoint loading.
• Metadata optimizations (unpickling), and file-formats
• Predictive storage tiering.
Efficient fault-tolerant checkpointing loading at scale requires GPU–storage path optimizations
and topology-aware strategies to sustain robust infrastructure and high MFU.

18 |
Data Movement: The necessary Evil!
The goal is to maximize GPU utilization while ensuring reducing the impact of data-entropy.
Large amounts of data must move across inter/intra nodes, servers,
racks, and even data-centers, in all directions (E-W, N-S).

19 |
Fault-Tolerance and Reliability in Cloud AI
Collaborator: Tian Xia (PhD Student), Zhifei Li (Visiting Research Student)
with Dr. Ion Stoica
Sky Computing Labs, UC Berkeley

20 |
AI Training in the Cloud
Training Interruptions are common (discussed earlier)
• VM failures due to HW(or) SW failures in allocated
servers
• VM preemptions/re-allocation to different locations
(servers, regions, etc.)
How to recover efficiently to retain the cost-savings while striking the balance
between performance and scalability across cloud networks?
Emerging Use-case: Spot Instances
• Significant cost-effectiveness across regions and clouds
• Useful particularly for offline training jobs
• However, preemptions can happen any moment
Tian Xia, UC-Berkeley

21 |
Spot-Training Resumption: Checkpoint Migration
Checkpoint migration enables spot-instance recovery by overlapping instance startup with checkpoint
transfer and loading across regions or geographic boundaries.
Lots of dynamically moving parts: Which location, data egress cost, move & load checkpoints.
How to achieve high training throughput and ETTR while being on tight-cost and time-
budget?

22 |
Need for a Unified Global Storage System
A unified geo-distributed storage system can reduce north-south data entropy tax while
optimizing compute, network, and storage utilization—balancing infrastructure
constraints for GPU-accelerated AI workloads

23 |
Conclusion
Unlocking the full potential of GPU-accelerated AI requires overcoming key barriers.
The community must unite to innovate and strike a balance between performance, scalability, and cost
with an open AI ecosystem—building an inclusive AI future for all.

24 |
Thank-you!
Pratik Mishra | AMD

25 |
COPYRIGHT AND DISCLAIMER
©2025 Advanced Micro Devices, Inc. All rights reserved.
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for
identification purposes only and may be trademarks of their respective companies.
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The
information contained herein is subject to change and may be rendered inaccurate releases, for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product differences between differing manufacturers, software changes, BIOS
flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD
assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes
from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND
ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY
DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT
WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

More Related Content

Similar to AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

More from Alluxio, Inc.

Recently uploaded

AI at scale Architecting Scalable, Deployable and Resilient Infrastructure