AI at scale: Architecting Scalable, Deployable and Resilient
Infrastructure
Pratik Mishra
AMD
September 20, 2025
Alluxio AI/ML Meet-up
San Francisco, CA
2 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Agenda
Disclaimer: Please refer to the Copyrights and Disclaimer in the presentation. We have tried to cite most relevant sources. We (the authors and associated organization) owe no responsibility
towards the content’s accuracy or claims, and they should be viewed as personal viewpoints/opinions to cater open discussions.
• AI Deployments and Challenges
• Infrastructure, Reliability, and Foundation Model Training
• Conclusion
SDC’24, FMS’25, SDC’25
3 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
AI Infrastructure: Deployments and Challenges
4 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
How to train a dragon model?
Data & 5Vs Data Storage
Data Ingestion Data Preparation
Training
The magic
ETL to training accessible
formats, annotation,
indexing, etc.
Stream bulk “objects”
to clouds/data-
centers
Foundation Model
Deployment
Deploy FM for
down-ward tasks
Model set-up: training strategies
Execution: Run training
Persistence: Save/load checkpoints
Validation & Monitoring
Tasks Users
GPUs GPUs
Downstream tasks
Fine-tuning,
post-training,
inference endpoints
Prompts,
agent interactions,
etc.
UXI
But think deeper: maximize GPU utilization, minimize stalls, optimize throughput and
reduce latency to drive “real” ROI
AI Developer Priorities: Focus on fast model convergence, efficient algorithm design, rapid
deployment to accelerate time-to-market
5 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
AI Tech Stack : 100K birds-eye view
AI Developers and Applications
Pre-training Fine-tuning Inference Post-training Agents
Data Storage and Management
Ingestion Archive
Processing Data Lake VectorDB
Labeling
File
Block Object
Compute Infrastructure (GPU, Networks, Memory, Local Storage)
GPU NIC/DPU
CPU Frontend + Backend
CSPs &/ On-prem Infrastructure
Training & Inference Frameworks (PyTorch, TensorFlow, vLLM, SGLang)
Distributed AI Compute Managers (Ray, Spark, etc.)
Model Deployment (k8s, slurm) & Container Orchestrators
Multi-Modal Data
What they need to care about?
• The highly simplified AI Tech Stack
• Access to tools, infrastructure, deployments
• Most importantly, access to SOTA GPUs
On top of all that, ecosystems with closed
stacks – limits innovation, flexibility and raises
the barriers for entry.
6 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
7 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Sovereign AI Case-study:
Motif Technologies Multi-modal Training with AMD ecosystem
8 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Motif Technologies Training Infrastructure powered by AMD
Motif Technologies (South Korea) runs multi-modal AI workloads on AMD Instinct MI250
GPUs using AMD-optimized Docker containers with SkyPilot orchestration.
9 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Motif Technologies: AMD Developer Cloud with MI300X
Disclaimer: The performance metrics and results presented are based on partner-provided data and have not been independently verified by AMD. These figures are shared as-is and may vary depending on system configuration, workload characteristics, and optimization levels. AMD makes no
representations or warranties regarding the accuracy or completeness of third-party performance claims.
AI for ALL:
A democratized platform with an open and optimized AI ecosystem and access to SOTA AMD
GPUs – fosters innovations especially for startups, researchers, and emerging markets.
Motif 2.6B on 1xMI250 vs MI300x
5X throughput gains
on 1xMI300x, bigger
batches, etc.
Motif Kernel: https://huggingface.co/Motif-Technologies/activation
10 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Call to Action!
Built by developers for developers.
• AMD is building for you, come build on us.
• Commitment to open AI ecosystem
• Full AI Lifecycle
• Industry leading GPU technology
11 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
AI Infrastructure: Reliability and Scalability
12 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
AI Training Infra Reliability 101: Metrics
• Training Goodput = Actual progress made / total time
• Effective Training Time Ratio (ETTR) = actual training time / total time
• Model FLOPs Utilization (MFU) = FLOPs a model utilizes/ peak HW FLOPs available
• Mean Time Between Failures (MTBF) = total time / # of failures
Achieving high training goodput and maximizing model FLOPs utilization to improve the
Effective Training Time Ratio remains a significant and ongoing challenge.
Failures and Training Efficiency?
13 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Reliability and Training Efficiency @scale
With growing scale of AI deployments, the MTBF decreases significantly.
Therefore, resiliency is the core for achieving Training efficiency and increasing Training Goodput
and ETTR.
# of accelerators
Mean
Time
Between
Failure
(MTBF)
log-scale
(normalized
mins)
node rack-scale cluster-scale data-center scale
(<24 hrs) (<30 mins) (<5 mins)
Projections of AI training systems@scale failures not specific to any accelerator.
Across millions and billions of components across the SW & HW stacks in the data-center hierarchy.
(<1-3mon)
(~3-6mon)
(yrs)
𝑴𝑻𝑩𝑭 ∝ 𝟏/(𝒏𝒐. 𝒐𝒇 𝒂𝒄𝒄𝒆𝒍𝒆𝒓𝒂𝒕𝒐𝒓𝒔)
14 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Fault Tolerance, Training Efficiency and Checkpointing
• Fault-tolerance, resiliency, and recovery are of utmost importance for
Training Efficiency metrics (discussed earlier)
• Critical fault-tolerance mechanism for periodically persisting training
snapshots to enable recovery via rollbacks in the event of failure
• Also: Hardware refresh, Resource re-balancing, post-training, concurrent evaluation, increase accuracy, etc.
With scale and every-lowering MTBFs, the checkpointing frequency, size, and complexity increases
significantly; imposing heavy data-center tax (GPU underutilization).
• Storage community’s poster AI use-case: Checkpointing
15 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Fault Tolerance Tax: Checkpointing
Tprogress_1 Tprogress_n
Tchkpt_save_1 Tchkpt_save_n Titr_lost Trecovery Tchkpt_load
FToverhead = Tchkpt_save + Titr_lost + Trecovery + Tchkpt_load
ETTR = (1-FToverhead)
• Achieving optimal ETTR @ data-center scale is “real” challenge
• Without optimization, systems may spend more time managing failures than actual training
• Trade-off: Excessive checkpoints increases data-center tax & infrequent increases risks (cost)
• Data-center tax: compute, network, storage
Therefore, to achieve optimal ETTR (+goodput) it is essential for reliability mechanisms to strike the
balance of performance, scalability, and cost-effectiveness.
16 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Checkpoint (save) = Serialization + Persistence
SDC’24, SDC’25, FMS’25
Asynchronous chkpt: Main Training thread is alleviated from IO persistence
• Overlaps IO with computation
• Reduces peak-pressure on network and storage by “buffering”
• Still not truly asynchronous (IO verbs!)
Existing implementations need further optimizations to reduce @scale overheads.
Reliable and Unified memory + storage tiering is essential – masking I/O and communication overheads with
computation.
Example: Local NVMe → PFS → Object (or) combinations
Synchronous chkpt: Main Training thread waits till checkpoint is persisted
• Short, periodic, bursty writes.
• Over-subscribes front-end NICs and storage infrastructure
• Leads to GPU stalling to resume training.
17 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Checkpoint (loads) = Loading + Deserialization
• Loading checkpoints is mission critical
• All GPU simultaneously load their states to resume training
• Massive IO amplification compared to save(s)
• Deserialization overheads are massive
• Concurrent loading can de-stabilize entire infrastructure
• Also, downstream tasks – post-training, inference, etc.
• Optimizations
• GPU-GPU network-aware checkpoint loading.
• Metadata optimizations (unpickling), and file-formats
• Predictive storage tiering.
Efficient fault-tolerant checkpointing loading at scale requires GPU–storage path optimizations
and topology-aware strategies to sustain robust infrastructure and high MFU.
18 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Data Movement: The necessary Evil!
The goal is to maximize GPU utilization while ensuring reducing the impact of data-entropy.
Large amounts of data must move across inter/intra nodes, servers,
racks, and even data-centers, in all directions (E-W, N-S).
19 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Fault-Tolerance and Reliability in Cloud AI
Collaborator: Tian Xia (PhD Student), Zhifei Li (Visiting Research Student)
with Dr. Ion Stoica
Sky Computing Labs, UC Berkeley
20 |
AI Training in the Cloud
Training Interruptions are common (discussed earlier)
• VM failures due to HW(or) SW failures in allocated
servers
• VM preemptions/re-allocation to different locations
(servers, regions, etc.)
How to recover efficiently to retain the cost-savings while striking the balance
between performance and scalability across cloud networks?
Emerging Use-case: Spot Instances
• Significant cost-effectiveness across regions and clouds
• Useful particularly for offline training jobs
• However, preemptions can happen any moment
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Tian Xia, UC-Berkeley
Tian Xia, UC-Berkeley
21 |
Spot-Training Resumption: Checkpoint Migration
Checkpoint migration enables spot-instance recovery by overlapping instance startup with checkpoint
transfer and loading across regions or geographic boundaries.
Lots of dynamically moving parts: Which location, data egress cost, move & load checkpoints.
How to achieve high training throughput and ETTR while being on tight-cost and time-
budget?
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Tian Xia, UC-Berkeley
22 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Need for a Unified Global Storage System
A unified geo-distributed storage system can reduce north-south data entropy tax while
optimizing compute, network, and storage utilization—balancing infrastructure
constraints for GPU-accelerated AI workloads
23 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Conclusion
Unlocking the full potential of GPU-accelerated AI requires overcoming key barriers.
The community must unite to innovate and strike a balance between performance, scalability, and cost
with an open AI ecosystem—building an inclusive AI future for all.
24 |
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Thank-you!
Pratik Mishra | AMD
25 |
COPYRIGHT AND DISCLAIMER
©2025 Advanced Micro Devices, Inc. All rights reserved.
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for
identification purposes only and may be trademarks of their respective companies.
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The
information contained herein is subject to change and may be rendered inaccurate releases, for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product differences between differing manufacturers, software changes, BIOS
flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD
assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes
from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND
ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY
DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT
WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

  • 1.
    AI at scale:Architecting Scalable, Deployable and Resilient Infrastructure Pratik Mishra AMD September 20, 2025 Alluxio AI/ML Meet-up San Francisco, CA
  • 2.
    2 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Agenda Disclaimer: Please refer to the Copyrights and Disclaimer in the presentation. We have tried to cite most relevant sources. We (the authors and associated organization) owe no responsibility towards the content’s accuracy or claims, and they should be viewed as personal viewpoints/opinions to cater open discussions. • AI Deployments and Challenges • Infrastructure, Reliability, and Foundation Model Training • Conclusion SDC’24, FMS’25, SDC’25
  • 3.
    3 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA AI Infrastructure: Deployments and Challenges
  • 4.
    4 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA How to train a dragon model? Data & 5Vs Data Storage Data Ingestion Data Preparation Training The magic ETL to training accessible formats, annotation, indexing, etc. Stream bulk “objects” to clouds/data- centers Foundation Model Deployment Deploy FM for down-ward tasks Model set-up: training strategies Execution: Run training Persistence: Save/load checkpoints Validation & Monitoring Tasks Users GPUs GPUs Downstream tasks Fine-tuning, post-training, inference endpoints Prompts, agent interactions, etc. UXI But think deeper: maximize GPU utilization, minimize stalls, optimize throughput and reduce latency to drive “real” ROI AI Developer Priorities: Focus on fast model convergence, efficient algorithm design, rapid deployment to accelerate time-to-market
  • 5.
    5 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA AI Tech Stack : 100K birds-eye view AI Developers and Applications Pre-training Fine-tuning Inference Post-training Agents Data Storage and Management Ingestion Archive Processing Data Lake VectorDB Labeling File Block Object Compute Infrastructure (GPU, Networks, Memory, Local Storage) GPU NIC/DPU CPU Frontend + Backend CSPs &/ On-prem Infrastructure Training & Inference Frameworks (PyTorch, TensorFlow, vLLM, SGLang) Distributed AI Compute Managers (Ray, Spark, etc.) Model Deployment (k8s, slurm) & Container Orchestrators Multi-Modal Data What they need to care about? • The highly simplified AI Tech Stack • Access to tools, infrastructure, deployments • Most importantly, access to SOTA GPUs On top of all that, ecosystems with closed stacks – limits innovation, flexibility and raises the barriers for entry.
  • 6.
    6 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
  • 7.
    7 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Sovereign AI Case-study: Motif Technologies Multi-modal Training with AMD ecosystem
  • 8.
    8 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Motif Technologies Training Infrastructure powered by AMD Motif Technologies (South Korea) runs multi-modal AI workloads on AMD Instinct MI250 GPUs using AMD-optimized Docker containers with SkyPilot orchestration.
  • 9.
    9 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Motif Technologies: AMD Developer Cloud with MI300X Disclaimer: The performance metrics and results presented are based on partner-provided data and have not been independently verified by AMD. These figures are shared as-is and may vary depending on system configuration, workload characteristics, and optimization levels. AMD makes no representations or warranties regarding the accuracy or completeness of third-party performance claims. AI for ALL: A democratized platform with an open and optimized AI ecosystem and access to SOTA AMD GPUs – fosters innovations especially for startups, researchers, and emerging markets. Motif 2.6B on 1xMI250 vs MI300x 5X throughput gains on 1xMI300x, bigger batches, etc. Motif Kernel: https://huggingface.co/Motif-Technologies/activation
  • 10.
    10 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Call to Action! Built by developers for developers. • AMD is building for you, come build on us. • Commitment to open AI ecosystem • Full AI Lifecycle • Industry leading GPU technology
  • 11.
    11 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA AI Infrastructure: Reliability and Scalability
  • 12.
    12 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA AI Training Infra Reliability 101: Metrics • Training Goodput = Actual progress made / total time • Effective Training Time Ratio (ETTR) = actual training time / total time • Model FLOPs Utilization (MFU) = FLOPs a model utilizes/ peak HW FLOPs available • Mean Time Between Failures (MTBF) = total time / # of failures Achieving high training goodput and maximizing model FLOPs utilization to improve the Effective Training Time Ratio remains a significant and ongoing challenge. Failures and Training Efficiency?
  • 13.
    13 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Reliability and Training Efficiency @scale With growing scale of AI deployments, the MTBF decreases significantly. Therefore, resiliency is the core for achieving Training efficiency and increasing Training Goodput and ETTR. # of accelerators Mean Time Between Failure (MTBF) log-scale (normalized mins) node rack-scale cluster-scale data-center scale (<24 hrs) (<30 mins) (<5 mins) Projections of AI training systems@scale failures not specific to any accelerator. Across millions and billions of components across the SW & HW stacks in the data-center hierarchy. (<1-3mon) (~3-6mon) (yrs) 𝑴𝑻𝑩𝑭 ∝ 𝟏/(𝒏𝒐. 𝒐𝒇 𝒂𝒄𝒄𝒆𝒍𝒆𝒓𝒂𝒕𝒐𝒓𝒔)
  • 14.
    14 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Fault Tolerance, Training Efficiency and Checkpointing • Fault-tolerance, resiliency, and recovery are of utmost importance for Training Efficiency metrics (discussed earlier) • Critical fault-tolerance mechanism for periodically persisting training snapshots to enable recovery via rollbacks in the event of failure • Also: Hardware refresh, Resource re-balancing, post-training, concurrent evaluation, increase accuracy, etc. With scale and every-lowering MTBFs, the checkpointing frequency, size, and complexity increases significantly; imposing heavy data-center tax (GPU underutilization). • Storage community’s poster AI use-case: Checkpointing
  • 15.
    15 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Fault Tolerance Tax: Checkpointing Tprogress_1 Tprogress_n Tchkpt_save_1 Tchkpt_save_n Titr_lost Trecovery Tchkpt_load FToverhead = Tchkpt_save + Titr_lost + Trecovery + Tchkpt_load ETTR = (1-FToverhead) • Achieving optimal ETTR @ data-center scale is “real” challenge • Without optimization, systems may spend more time managing failures than actual training • Trade-off: Excessive checkpoints increases data-center tax & infrequent increases risks (cost) • Data-center tax: compute, network, storage Therefore, to achieve optimal ETTR (+goodput) it is essential for reliability mechanisms to strike the balance of performance, scalability, and cost-effectiveness.
  • 16.
    16 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Checkpoint (save) = Serialization + Persistence SDC’24, SDC’25, FMS’25 Asynchronous chkpt: Main Training thread is alleviated from IO persistence • Overlaps IO with computation • Reduces peak-pressure on network and storage by “buffering” • Still not truly asynchronous (IO verbs!) Existing implementations need further optimizations to reduce @scale overheads. Reliable and Unified memory + storage tiering is essential – masking I/O and communication overheads with computation. Example: Local NVMe → PFS → Object (or) combinations Synchronous chkpt: Main Training thread waits till checkpoint is persisted • Short, periodic, bursty writes. • Over-subscribes front-end NICs and storage infrastructure • Leads to GPU stalling to resume training.
  • 17.
    17 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Checkpoint (loads) = Loading + Deserialization • Loading checkpoints is mission critical • All GPU simultaneously load their states to resume training • Massive IO amplification compared to save(s) • Deserialization overheads are massive • Concurrent loading can de-stabilize entire infrastructure • Also, downstream tasks – post-training, inference, etc. • Optimizations • GPU-GPU network-aware checkpoint loading. • Metadata optimizations (unpickling), and file-formats • Predictive storage tiering. Efficient fault-tolerant checkpointing loading at scale requires GPU–storage path optimizations and topology-aware strategies to sustain robust infrastructure and high MFU.
  • 18.
    18 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Data Movement: The necessary Evil! The goal is to maximize GPU utilization while ensuring reducing the impact of data-entropy. Large amounts of data must move across inter/intra nodes, servers, racks, and even data-centers, in all directions (E-W, N-S).
  • 19.
    19 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Fault-Tolerance and Reliability in Cloud AI Collaborator: Tian Xia (PhD Student), Zhifei Li (Visiting Research Student) with Dr. Ion Stoica Sky Computing Labs, UC Berkeley
  • 20.
    20 | AI Trainingin the Cloud Training Interruptions are common (discussed earlier) • VM failures due to HW(or) SW failures in allocated servers • VM preemptions/re-allocation to different locations (servers, regions, etc.) How to recover efficiently to retain the cost-savings while striking the balance between performance and scalability across cloud networks? Emerging Use-case: Spot Instances • Significant cost-effectiveness across regions and clouds • Useful particularly for offline training jobs • However, preemptions can happen any moment Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Tian Xia, UC-Berkeley Tian Xia, UC-Berkeley
  • 21.
    21 | Spot-Training Resumption:Checkpoint Migration Checkpoint migration enables spot-instance recovery by overlapping instance startup with checkpoint transfer and loading across regions or geographic boundaries. Lots of dynamically moving parts: Which location, data egress cost, move & load checkpoints. How to achieve high training throughput and ETTR while being on tight-cost and time- budget? Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Tian Xia, UC-Berkeley
  • 22.
    22 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Need for a Unified Global Storage System A unified geo-distributed storage system can reduce north-south data entropy tax while optimizing compute, network, and storage utilization—balancing infrastructure constraints for GPU-accelerated AI workloads
  • 23.
    23 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Conclusion Unlocking the full potential of GPU-accelerated AI requires overcoming key barriers. The community must unite to innovate and strike a balance between performance, scalability, and cost with an open AI ecosystem—building an inclusive AI future for all.
  • 24.
    24 | Pratik Mishra| AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA Thank-you! Pratik Mishra | AMD
  • 25.
    25 | COPYRIGHT ANDDISCLAIMER ©2025 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate releases, for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.