Infra
AWS offers a glimpse of its AI networking infrastructure
Amazon Web Services has been seriously bolstering its network to handle the increased demands associated with its AI-based applications and services.
In a recent blog post, Prasad Kalyanaraman, vice president of infrastructure services at AWS, offered a glimpse into what it takes to optimize the service provider’s global network to handle AI workloads.
Amazon has been using AI and machine learning (ML) for more than 25 years to drive functions such as shopping recommendations and packaging decisions, according to Kalyanaraman, while customers have been accessing AI and ML services through AWS. Today, AI is a multibillion-dollar business for AWS. “Over 100,000 customers across industries – including adidas, New York Stock Exchange, Pfizer, Ryanair, and Toyota – are using AWS AI and ML services to reinvent experiences for their customers,” Kalyanaraman wrote. “Additionally, many of the leading generative AI models are trained and run on AWS.”
AWS built its own Ethernet-based architecture that relies on its custom-built Elastic Fabric Adapter (EFA) network interface, which uses technology known as scalable reliable datagram (SRD), a network transport protocol designed by AWS. According to an IEEE-published abstract about the protocol:
“We built a new network transport protocol, scalable reliable datagram (SRD), designed to utilize modern commodity multitenant datacenter networks (with a large number of network paths) while overcoming their limitations (load imbalance and inconsistent latency when unrelated flows collide). Instead of preserving packets order, SRD sends the packets over as many network paths as possible, while avoiding overloaded paths. To minimize jitter and to ensure the fastest response to network congestion fluctuations, SRD is implemented in the AWS custom Nitro networking card.”
Building its own network architecture, including its own NICs and routers, has a number of upsides for AWS, according to Kalyanaraman.
“Our approach is unique in that we have built our own network devices and network operating systems for every layer of the stack – from the Network Interface Card, to the top-of-rack switch, to the data center network, to the internet-facing router and our backbone routers. This approach not only gives us greater control over improving security, reliability, and performance for customers, but also enables us to move faster than others to innovate,” Kalyanaraman wrote.
For example, AWS recently delivered a new network optimized for generative AI workloads – and it did it in seven months.
“Our first generation UltraCluster network, built in 2020, supported 4,000 graphics processing units, or GPUs, with a latency of eight microseconds between servers. The new network, UltraCluster 2.0, supports more than 20,000 GPUs with 25% latency reduction. It was built in just seven months, and this speed would not have been possible without the long-term investment in our own custom network devices and software,” Kalyanaraman wrote.
Known internally as the “10p10u” network, the UltraCluster 2.0, introduced in 2023, delivers tens of petabits per second of throughput, with a round-trip time of less than 10 microseconds. “The new network results in at least 15% reduction in time to train a model,” Kalyanaraman wrote.
Cooling tactics, chip designs aim for energy efficiency
Another infrastructure priority at AWS is to continuously improve the energy efficiency of its data centers. Training and running AI models can be extremely energy-intensive.
“AI chips perform mathematical calculations at high speed, making them critical for ML models. They also generate much more heat than other types of chips, so new AI servers that require more than 1,000 watts of power per chip will need to be liquid-cooled. However, some AWS services utilize network and storage infrastructure that does not require liquid cooling, and therefore, cooling this infrastructure with liquid would be an inefficient use of energy,” Kalyanaraman wrote. “AWS’s latest data center design seamlessly integrates optimized air-cooling solutions alongside liquid cooling capabilities for the most powerful AI chipsets, like the NVIDIA Grace Blackwell Superchips. This flexible, multimodal cooling design allows us to extract maximum performance and efficiency whether running traditional workloads or AI/ML models.”
For the last several years, AWS has been designing its own chips, including AWS Trainium and AWS Inferentia, with a goal of making it more energy-efficient to train and run generative AI models. “AWS Trainium is designed to speed up and lower the cost of training ML models by up to 50 percent over other comparable training-optimized Amazon EC2 instances, and AWS Inferentia enables models to generate inferences more quickly and at lower cost, with up to 40% better price performance than other comparable inference-optimized Amazon EC2 instances,” Kalyanaraman wrote.
AWS’ third-generation AI chip, Trainium2, will be available later this year. “Trainium2 is designed to deliver up to 4 times faster training than first-generation Trainium chips and will be able to be deployed in EC2 UltraClusters of up to 100,000 chips, making it possible to train foundation models and large language models in a fraction of the time, while improving energy efficiency up to 2 times,” Kalyanaraman wrote.
In addition, AWS works with partners including Nvidia, Intel, Qualcomm, and AMD to offer accelerators in the cloud for ML and generative AI applications, according to Kalyanaraman.
Related news: AWS launches 400 Gbps Dedicated Connections
Earlier this month, AWS announced that its private, high-bandwidth Direct Connect service now offers native 400 Gbps dedicated connections between AWS and data centers and colocation facilities.
“Native 400 Gbps connections provide higher bandwidth, without the operational overhead of managing multiple 100 Gbps connections in a link aggregation group. The increased capacity delivered by 400 Gbps connections is particularly beneficial to applications that transfer large-scale datasets, such as for machine learning and large language model training or advanced driver assistance systems for autonomous vehicles, AWS stated.
Read more: