By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
TechgoonduTechgoonduTechgoondu
  • Audio-visual
  • Enterprise
    • Software
    • Cybersecurity
  • Gaming
  • Imaging
  • Internet
  • Media
  • Mobile
    • Cellphones
    • Tablets
  • PC
  • Telecom
Search
© 2023 Goondu Media Pte Ltd. All Rights Reserved.
Reading: Data centre networks and the challenges of scaling AI clusters
Share
Font ResizerAa
TechgoonduTechgoondu
Font ResizerAa
  • Audio-visual
  • Enterprise
  • Gaming
  • Imaging
  • Internet
  • Media
  • Mobile
  • PC
  • Telecom
Search
  • Audio-visual
  • Enterprise
    • Software
    • Cybersecurity
  • Gaming
  • Imaging
  • Internet
  • Media
  • Mobile
    • Cellphones
    • Tablets
  • PC
  • Telecom
Follow US
© 2023 Goondu Media Pte Ltd. All Rights Reserved.
Techgoondu > Blog > Enterprise > Data centre networks and the challenges of scaling AI clusters
Enterprise

Data centre networks and the challenges of scaling AI clusters

Techgoondu
Last updated: July 15, 2025 at 5:38 PM
Techgoondu
Published: July 15, 2025
7 Min Read

In association with Keysight Technologies

By Emily Yan

AI is evolving at an unprecedented pace, driving an urgent need for more powerful and efficient data centres. In response, nations and companies are ramping up investments into AI infrastructure.

According to Forbes, AI spending from the Big Tech sector will exceed US$250 billion in 2025, with the bulk going towards infrastructure. By 2029, global investments in AI infrastructure, including data centres, networks, and hardware, will reach US$423 billion.

However, the rapid AI innovations also put unprecedented strain on data centre networks. For instance, Meta’s recent paper on the Llama 3 405B model training cluster shows it requires over 700TB of memory and 16,000 Nvidia H100 graphical processing units (GPUs) during the pre-training phase. Epoch AI estimates that AI models will need 10,000 times more computational power by 2030 than today’s leading models.

The rise of AI clusters

An AI cluster is a large, highly interconnected network of computing resources that handles AI workloads. Unlike traditional computing clusters, AI clusters are optimised for tasks such as AI model training, inference, and real-time analytics. They rely on thousands of GPUs, high-speed interconnects, and low-latency networks to meet the intensive computational and data throughput requirements of AI.

Building AI clusters

An AI cluster is, at its core, functions like a mini network. Building an AI cluster involves connecting the GPUs to form a high-performance computing network where data can flow seamlessly between GPUs. Robust network connectivity is essential, as distributed training relies on the coordination of thousands of GPUs over extended periods.

Key components of AI clusters

AI clusters consist of multiple essential components, as shown in Figure 1.

Figure 1: AI data centre cluster
  • Compute nodes behave like the brain of the AI clusters, with thousands of GPUs connecting to the top-of-rack switches. As problem complexity increases, so does the need for GPUs.
  • High-speed interconnects such as Ethernet enable rapid data transfer between compute nodes.
  • Networking infrastructure includes network hardware and protocols supporting data communications between multiple thousands of GPUs over extended periods.

Scaling AI clusters

AI clusters scale to meet the growing AI workloads and complexities. Until recently, network bandwidth, latency, and other factors had limited AI clusters to around 30,000 GPUs. However, xAI’s Colossus supercomputer project shattered this barrier by scaling to over 100,000 Nvidia H100 GPUs – a breakthrough made possible by advancements in networking and memory technologies.

Key scaling challenges

As AI models grow to trillions of parameters, scaling AI clusters involves myriad technical and financial hurdles.

Network challenges

GPUs are effective at performing math calculations in parallel. However, when thousands – or even hundreds of thousands – of GPUs work together on the same task in an AI cluster, if even one GPU lacks the data it needs or encounters delays, every other GPU stalls.

Such prolonged packet latency or packet loss contributed by a congested network can cause packet retransmissions, significantly increasing job completion time (JCT) and leaving millions of dollars’ worth of GPUs sitting idle.

Additionally, AI workloads generate a dramatic rise in east-west traffic (data moving between nodes within the data centre), potentially leading to network congestion and latency issues if the traditional network infrastructure isn’t optimised for these loads.

Interconnect challenges

As AI clusters expand, traditional interconnects may struggle to provide the necessary throughput. To avoid bottlenecks, organisations must upgrade to higher-speed interconnects, such as 800G or even 1.6T solutions.

However, deploying and validating such high-speed links is no easy feat to meet the rigorous requirements of AI workloads. The high-speed serial paths must be carefully tuned and tested for the best signal integrity, lower bit error rates, and reliable long forward error correction (FEC) performance. Any instability in high-speed serial paths can degrade reliability and slow down AI training. Companies need highly accurate and efficient test systems to validate them before deployment.

Financial challenges

The total cost of scaling AI clusters goes well beyond the expense of GPUs. Organisations must factor in power, cooling, networking equipment, and broader data centre infrastructure. However, accelerating AI workloads through better interconnects and optimised network performance can shorten training cycles and free up resources for additional tasks. Each day saved on training can translate into significant cost reductions, making the financial stakes as high as the technical ones.

Validation Challenges

Optimising an AI cluster’s network performance requires testing and benchmarking the performance of both the network fabric and the interconnects between GPUs. However, validating these components and systems is challenging because of the intricate relationships among hardware, architectural design, and dynamic workload characteristics.

There are three common validation issues.

No 1. Lab deployment constraints

The high cost of AI hardware, limited equipment availability, and the need for specialised network engineers make full-scale replication impractical. Additionally, lab environments often have space, power, and thermal constraints that differ from real-world data centre conditions.

No 2. Impact on production system

Testing on a production system reduces can be disruptive, potentially affecting critical AI operations.

No 3. Complex AI workloads

The diverse nature of AI workloads and data sets – varying significantly in size and communication patterns – makes it difficult to reproduce issues and benchmark consistently.

As AI reshapes the data centre landscape, future-proofing network infrastructure is crucial to staying ahead of rapidly evolving technologies and standards. Keysight’s advanced emulation solutions provide a critical advantage by enabling comprehensive validation of network protocols and operational scenarios before deployment. 

Emily Yan is product marketing manager at Keysight Technologies. Before joining Keysight, she worked on AI and big data marketing across multiple industries. Yan has an MPA degree from Columbia University and bachelor’s degrees from the University of California, Berkeley, in applied mathematics and economics.

Subaru forges ahead in AI-powered driver assistance
The most demanding IT jobs
Massive price drops for Nvidia cards
A year after the first commercial 5G networks, a search for use cases as the tech evolves
More than 350 in Singapore sign up for “massive” online course on data analytics
TAGGED:data centredata centre networksgpuKeysight TechnologiesLLMNvidiasponsoredxAI

Sign up for the TG newsletter

Never miss anything again. Get the latest news and analysis in your inbox.

By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Whatsapp Whatsapp LinkedIn Copy Link Print
Previous Article Canon EOS R50 V review: Compact mirrorless camera for vloggers
Leave a Comment

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Stay Connected

FacebookLike
XFollow

Latest News

Canon EOS R50 V review: Compact mirrorless camera for vloggers
Imaging
July 14, 2025
Razer Freyja, Kraken V4 Pro review: Haptic feedback promises better feel for games
Gaming PC
July 14, 2025
Thermomix TM7: A German kitchen helper that cooks Singapore chilli crab
Internet
July 11, 2025
Grab goes driverless with electric shuttle trial for employees in Singapore
Enterprise
July 10, 2025

Techgoondu.com is published by Goondu Media Pte Ltd, a company registered and based in Singapore.

.

Started in June 2008 by technology journalists and ex-journalists in Singapore who share a common love for all things geeky and digital, the site now includes segments on personal computing, enterprise IT and Internet culture.

banner banner
Everyday DIY
PC needs fixing? Get your hands on with the latest tech tips
READ ON
banner banner
Leaders Q&A
What tomorrow looks like to those at the leading edge today
FIND OUT
banner banner
Advertise with us
Discover unique access and impact with TG custom content
SHOW ME

 

 

POWERED BY READYSPACE
The Techgoondu website is powered by and managed by Readyspace Web Hosting.

TechgoonduTechgoondu
© 2024 Goondu Media Pte Ltd. All Rights Reserved | Privacy | Terms of Use | Advertise | About Us | Contact
Follow Us!
Never miss anything again. Get the latest news and analysis in your inbox.

Zero spam, Unsubscribe at any time.
 

Loading Comments...
 

    Welcome Back!

    Sign in to your account

    Username or Email Address
    Password

    Lost your password?