Microsoft, Meta & OpenAI Back New Optical Interconnect Standard for AI Scale-Up

by Chief Editor

The Rise of Optical Interconnects: A New Era for AI Infrastructure

As AI models grow exponentially in size and complexity, the demand for faster, more efficient data transfer within AI clusters is reaching a critical point. Traditional copper-based interconnects are nearing their limitations, prompting hyperscalers like Microsoft, Meta and OpenAI to collaborate on a new solution: optical interconnects. This week, these tech giants joined forces with AMD, Broadcom, and Nvidia to establish the Optical Compute Interconnect (OCI) Multi-Source Agreement (MSA) group, signaling a significant shift in AI infrastructure development.

Why Optical Interconnects Now?

AI clusters are increasingly relying on optical interconnects for scale-out connectivity – connecting multiple servers together. However, the need for optical interconnects is now extending to scale-up connectivity – the connections within a single server, linking processors, and accelerators. This transition is driven by the insatiable appetite for bandwidth required by modern AI workloads.

The OCI MSA aims to define an open optical connectivity specification, enabling hyperscalers to utilize optical cables instead of copper to connect more accelerators at higher speeds and with predictable power consumption. This standardization is crucial for fostering a robust ecosystem and avoiding vendor lock-in.

Standardizing on a Common Optical Layer

The consortium will focus on developing a common optical physical layer (PHY) supporting various protocols. This includes UALink, used by AMD and Broadcom, and NVLink, utilized by Nvidia. The initial specification will be based on Non-Return-to-Zero (NRZ) signaling and wavelength-division multiplexing (WDM), starting at 200 Gb/s per direction and scaling to 800 Gb/s per fiber. Future roadmaps anticipate even higher speeds, potentially reaching 3.2 Tb/s per fiber and beyond.

This standardized approach will support different form factors, including pluggable optical modules, on-board optics, and co-packaged optics (CPO) integrated directly with compute silicon, offering flexibility for system designers.

A Hyperscaler-Driven Approach

The OCI MSA differs from traditional industry consortia in a key aspect: it’s driven by hyperscalers rather than independent hardware vendors. This approach reflects the growing influence of large cloud providers in shaping the future of AI infrastructure. While organizations like JEDEC and the Ultra Ethernet Consortium typically unite a broader range of companies, the OCI MSA’s focused membership allows for faster alignment and quicker development cycles.

The group’s focus is also narrow – specifically targeting short-reach links within scale-up domains. This contrasts with broader standardization efforts that aim to encompass entire technology stacks.

Implications for the AI Landscape

The OCI MSA has the potential to significantly impact the AI landscape by:

  • Reducing Costs: Standardization can drive down the cost of optical interconnects through economies of scale.
  • Increasing Performance: Higher bandwidth and lower latency will enable faster AI training and inference.
  • Enhancing Flexibility: A common optical layer allows for greater interoperability between different processors and interconnect protocols.
  • Accelerating Innovation: Simplified system integration and reduced development risk will accelerate the deployment of new AI hardware.

Nvidia’s Gilad Shainer emphasized that the OCI MSA will equip “best-in-class compute with state-of-the-art optics,” delivering the scale and performance needed for the next generation of AI.

FAQ: Optical Interconnects and AI

Q: What are optical interconnects?
A: Optical interconnects leverage light to transmit data, offering higher bandwidth and lower latency compared to traditional copper-based interconnects.

Q: What is the OCI MSA?
A: The Optical Compute Interconnect Multi-Source Agreement group is a collaboration between hyperscalers and hardware vendors to define an open specification for optical interconnects in AI systems.

Q: Why are optical interconnects important for AI?
A: AI models require massive amounts of data to be transferred quickly and efficiently. Optical interconnects provide the bandwidth needed to support these demanding workloads.

Q: What is the difference between scale-out and scale-up connectivity?
A: Scale-out connectivity connects multiple servers, while scale-up connectivity connects components within a single server.

Q: Which companies are involved in the OCI MSA?
A: AMD, Broadcom, Meta, Microsoft, Nvidia, and OpenAI are founding members of the OCI MSA.

Did you recognize? Microsoft invested over $13 billion in OpenAI since 2019, securing exclusive rights to host OpenAI’s models on Azure.

Stay informed about the latest advancements in AI and data center technology. Follow Tom’s Hardware on Google News or add us as a preferred source to receive our updates directly in your feed.

You may also like

Leave a Comment