Oracle has marked a new milestone in the field of cloud infrastructure by announcing the general availability of its innovative bare metal instances in Oracle Cloud Infrastructure (OCI). Built on AMD's latest GPUs, the Instinct™ MI355X, these instances promise a significant advance in memory capacity and bandwidth performance. Not only that, but Oracle stands as the first hyperscaler to publicly offer the MI355X technology, maintaining its status as the sole provider that includes both MI355X and MI300X in its catalog.
The new CDNA 4 architecture of the MI355X brings substantial improvements over the previous generation, the MI300X. Each GPU now features 288 GB of HBM3e memory, a 50% increase, as well as a bandwidth of 8 TB/s, outperforming its predecessor by 51%. In addition, it provides support for new FP4/FP6/FP8 precisions, which represents a performance improvement of approximately 2.5x in FP8/FP16 relative to CDNA 3.
In terms of system resources, each server now incorporates fifth-generation AMD EPYC CPUs with 128 cores, 3 TB of DDR5 RAM, and local storage expanded to 61.44 TB, thereby doubling the previous capacity. This setup is complemented by a 400 Gbps front-end network and liquid-cooled racks that allow scaling up to 64 GPUs per rack. For distributed training, the cluster's connectivity reaches an impressive 3,200 Gbps.
The BM.GPU.MI355X.8 instance, which is already available for requests in OCI, offers eight AMD Instinct™ MI355X accelerators, providing an aggregate GPU memory of 2.3 TB. With a competitive price starting from $8.60/hour, these instances are designed for demanding tasks such as training LLMs, real-time inferences, and HPC applications such as digital twins and genomics.
This advancement is framed within the OCI Supercluster Zettascale ecosystem, capable of scaling up to 131,072 GPUs. It is shaping up to be the largest AI cloud 'supercomputer' according to Oracle, thanks to its high-efficiency RDMA network and ultra-low latency. With the MI355X, a significant improvement in time-to-train and efficiency is expected, with a threefold increase in computational power.
Oracle continues to bet on an open ecosystem with support for ROCm™ and standard frameworks such as PyTorch and TensorFlow. They will seek to facilitate the migration from CUDA to ROCm, avoiding complex rewrites. Clients like Absci and Seekr are already leveraging these innovations to accelerate their drug discovery platforms with generative AI and training of advanced AI models, respectively.
With this release, Oracle reaffirms its commitment to expanding cloud capabilities, providing powerful tools for high-scale developments and fostering the adoption of industrialized AI.
More information and references in Cloud News.


