Google Cloud WAN: A Network Built for and powered by, AI
The short answer is a lot.
AI is one of the next evolutions in technology and it represents its own set of workload challenges. This raises an issue that many IT operations teams don’t see coming until it is often too late—the network.
There were 229 Product and Customer announcements at Google Cloud Next. With all of the pageantry, glitz and glamour around the AI announcements, you may have missed this one on plain old layer 2 networking.
But, it was kind of a big deal in its own little foundational way. Let’s face it, WANs are not particularly sexy . They are the plumbing.
But they matter..a lot
Early in my career, I did core routing and switching work. Networks were simple and flat. They aren’t so simple and flat anymore. So let’s talk about the challenges we’re seeing in networks handling AI workflows today and why Google is rearchitecting their approach to AI traffic.
Google Cloud WAN will offer dedicated, private connections to Google Cloud and will enable cross-cloud connectivity with other public cloud services, including AWS, Oracle Cloud Infrastructure and Azure.
Cloud WAN leverages Google’s network, which is built for application optimization. Google announced at Next that Cloud WAN provides up to 40% faster performance compared to the public internet, and up to a 40% savings in total cost of ownership (TCO) over a customer-managed WAN solution.
Cloud WAN address two primary use cases. The first is the need for high-performance connectivity between geographically dispersed data centers. The second use case revolves around connecting branch and campus environments over Google’s Premium Tier network.
There are 3 product components around the first use case:
Cloud Interconnect–This product provides private and dedicated low-latency connections between Google Cloud regions and on-prem data centers, allowing you to connect your data center to Google Cloud from over 159 locations around the globe.
Cross-Cloud Interconnect –This is multicloud connectivity between Google Cloud and other public cloud environments, including AWS, Azure and OCI, and is available in 21 locations.
Cross-Site Interconnect—This product provides dedicated point-to-point layer 2 private 10/100G connectivity optimized to connect applications between data centers. Its currently in preview in select countries, we’ll expand it to additional edge locations in the coming year.
The second use case is built around Google’s Premium Tier network. This offering serves as a powerful cloud on-ramp, helping to securely connect branch offices and campuses to public cloud resources, SaaS applications, and the internet.
Let’s dig in and cover a few things about the Google Network that you may not already know.
Google’s Network Chops
The word robust comes to mind when I think about Google’s backbone network. This network spans over 2 million miles of lit fiber, including 33 subsea cable investments, with 202 network edge locations and more than 3,000 media content delivery network (CDN) locations across the globe. It connects 42 Google Cloud regions and 127 zones. While network may not be the first thing that comes to mind when you think of Google, the organization drinks much of its own champagne in the networking space. Google has innovated out of necessity over the last 25 years. In fact, when Google started planning and then building software-defined networks more than a decade ago, the term did not even exist yet.
Bikash Koley , Vice President, Global Networking and Infrastructure, writes that “ There have been several fundamental inflection points for the Google network over the last 25 years, leading to three distinct networking eras…Internet Era, Streaming Era, and the Cloud Era.” It seems we’ve just entered a new networking era, the AI Era
Let’s talk about what’s different about how we network today in this era of AI. There are 4 key differences:
📍 The WAN is now the LAN — Google trains the largest of their foundation models across multiple campuses and even multiple metros to pool together large numbers of TPUs. These applications have unique traffic patterns like Elephant flows, which are extremely large, continuous flows set up by a TCP flow measured over a network link. By default, elephant flows are those larger than 1 GB/10 seconds. Suffice to say, this type of high volume traffic can cause performance issues.
📍 AI demands zero impact from any outages — AI applications, particularly those involving training, fine-tuning, and inferencing, are extremely sensitive to network outages, as they rely heavily on GPU/TPU resources and require continuous connectivity. A network outage can disrupt these intensive processes, potentially leading to significant delays and disruptions.
📍 AI requires a heightened need for security and control— Securing AI traffic on an enterprise network requires a layered approach, combining traditional security measures with AI-specific protections. This includes network segmentation, zero trust principles, encryption, continuous monitoring, and regular audits.
📍 Operational excellence—Google has been approaching network as a software problem for more than 10 years. In fact, Google has put the concept of Network Site Reliability Engineering (NRE) in play as a policy as much as a job title. NRE aims to align network reliability with service-level objectives, agreements, and goals of the IT organization and business.
As a job title, Network Site Reliability Engineering is an IT operations role that applies engineering principles to measure and automate the reliability of a network. It’s a specialization of Site Reliability Engineering (SRE) that focuses on the network infrastructure. Network Site Reliability Engineering (SRE) becomes crucial with AI traffic because AI systems. With their complex architectures and high-volume data processing, these workloads demand robust, reliable, and performant networks. SRE principles help ensure that the underlying infrastructure can handle the demands of AI workloads, preventing outages and optimizing performance.
SRE is essential for AI traffic traversing the WAN because it provides optimized performance, better security and threat mitigation, efficient resource utilization and enhanced scalability and flexibility.
Guided by the changing needs of AI traffic, Google has built their next-generation global network by making foundational networking advancements. Google believes that performance, security and efficiency are table stakes
When the Google engineers went about re-imagining their approach to WAN, they came to the white board with 4 Design Principles:
o Exponential scalability
o Beyond-9s reliability
o Intent-driven programmability
o Autonomous network
Using these design principles, here are the architectural changes that Google put in place.
📍 Multi-shard network: Google is moving beyond traditional vertical scaling limitations to elastic, horizontal scalability with their multi-shard network architecture. A Multi-shard network is technology that is often found in the context of databases and blockchains. It refers to a system where data or the network itself is divided into smaller, independent units called shards. These shards are then distributed across multiple servers or nodes, enabling increased scalability, performance, and transaction processing. This approach lends itself well to traditional networking.
📍Multi-shard isolation, region isolation, and protective reroute: Each of Google’s network shards has its own control plane, data plane, and management plane, and operates independently of other shards. Some of the benefits of sharding include increased scalability. Sharding allows for horizontal scaling, enabling the system to handle larger datasets and more users without performance degradation.
It’s likely that these networks will see improved performance. By distributing data and workloads across multiple shards, queries can be processed faster and the overall system becomes more responsive. Sharding also allows for the parallel processing of transactions or queries, leading to increased throughput and efficiency.
📍Fully intent-driven, fine-grained programmability: Google has built a highly programmable network with SDN controllers, standard APIs, and universal network models such as the Multi-Abstraction-Layer Topology representation, or MALT. MALT modeling allows for a comprehensive understanding of the network’s structure and functionality. This means MALT can show both high-level design intents and low-level physical details, enabling various network management phases like design, deployment, and analysis.
📍 Autonomous network: Over the last decade, Google has transformed their network, moving from event-driven to machine-driven to now autonomous operations. This journey is fueled by ML, which provides Google with actionable intelligence. Inspired by Google DeepMind’s work with graph neural networks (GNN) for accurate arrival-time predictions in Google Maps, Google used GNN to create a digital twin of their network. This like-for-like topology of a production environment allows for testing, validation, and assurance
#GoogleCloudNext #GoogleCloudWAN
Read more: https://cloud.google.com/blog/products/networking/connect-globally-with-cloud-wan-for-the-ai-era