AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

158

by rbanffy

haunter

22h

>768 GB of DDR5-5200. The 12 memory controllers on the IO die provide a 768-bit memory bus, so the setup provides just under 500 GB/s of theoretical bandwidth

I know it's a server but I'd be so ready to use all of that as RAM disk. Crazy amount at a crazy high speed. Even 1% would be enough just to play around with something.

ksec

11h

I have been waiting for Netflix using FreeBSD to serve video at 1600Gb/s. They announced their 800Gbps record in 2021, and they were previously limited by CPU and Memory bandwidth. With 500GB/s that is pretty much not a thing.

summarity

> Crazy amount at a crazy high speed

That's 300GB/s slower than my old Mac Studio (M1 Ultra). Memory speeds in 2025 remain thouroughly unimpressive outside of high-end GPUs and fully integrated systems.

AnthonyMouse

The server systems have that much memory bandwidth per socket. Also, that generation supports DDR5-6400 but they were using DDR5-5200. Using the faster stuff gets you 614GB/s per socket, i.e. a dual socket system with DDR5-6400 is >1200GB/s. And in those systems that's just for the CPU; a GPU/accelerator gets its own.

The M1 Ultra doesn't have 800GB/s because it's "integrated", it simply has 16 channels of DDR5-6400, which it could have whether it was soldered or not. And none of the more recent Apple chips have any more than that.

It's the GPUs that use integrated memory, i.e. GDDR or HBM. That actually gets you somewhere -- the RTX 5090 has 1.8TB/s with GDDR7, the MI300X has 5.3TB/s with HBM3. But that stuff is also more expensive which limits how much of it you get, e.g. the MI300X has 192GB of HBM3, whereas normal servers support 6TB per socket.

And it's the same problem with Apple even though there's no great reason for it to be. The 2019 Intel Xeon Mac Pro supported 1.5TB of RAM -- still in slots -- but the newer ones barely reach a third of that at the top end.

mtoner23

21h

For our build servers for devs we utilize roughly this setup as a ram disk. It's amazing. Build times are lighting fast (compared to HDD/SSD)

privatelypublic

19h

I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling, or network load/latency(one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms)

finaard

14h

I'm running the same setup - our larger builders have 2 32-core epycs with 2TB RAM. We were doing that type of setup already almost two decades ago in a different company, and in that one for over a decade now - back then that was the only option for speed.

Nowadays nvmes might indeed be able to get close - but we'd probably need to still span over multiple SSDs (reducing the cost savings), and the developers there are incredible sensitive to build times. If a 5 minute build suddenly takes 30 seconds more we have some unhappy developers.

Another reason is that it'd eat SSDs like candy. Current enterprise SSDs have something like a 10000 TBW rating, which we'd exceed in the first month. So we'd either get cheap consumer SSDs and replace them every few days, or enterprise SSDs and replace them every few months - or stick with the RAM setup, which over the live of the build system will be cheaper than constantly buying SSDs.

rbanffy

> If a 5 minute build suddenly takes 30 seconds more we have some unhappy developers

They sound incredibly spoiled. Where should I send my CV?

finaard

You don't really want that. I'm keeping my sanity there just because my small company is running their CI and testing as contractor.

They indeed are quite spoiled - and that's not necessarily a good thing. Part of the issue is that our CI was good and fast enough that at some point a lot of the new hires never bothered to figure out how to build the code - so for quite a few the workflow is "commit to a branch, push it, wait for CI, repeat". And as they often just work on a single problem the "wait" is time lost for them, which leads to the unhappiness if we are too slow.

trogdor

> Current enterprise SSDs have something like a 10000 TBW rating, which we'd exceed in the first month

Wow. What’s your use case?

finaard

Same as the one earlier in the thread: Build servers, nicely loaded. A build generates a ridiculous amount of writes for stuff that just gets thrown out after the build.

We actually did try with SSDs about 15 years ago, and had a lot of dead SSDs in a very short time. After that we went for estimating data written, it's cheaper. While SSD durability increased a lot since then everything else got faster as well - so we'd have SSDs last a bit longer now (back then it was a weekly thing), but still nowhere near where it'd be a sensible thing to do.

motorest

17h

> I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling (...)

This has been the basic pattern for ages, particularly with large C++ projects. C++ builds, specially with the introduction of multi-CPU and multi-core systems, turns builds into IO-bound workflows, specially during linking.

Creating RAM disks to speed up builds is one of the most basic and low effort strategies to improve build times, and I think it was the main driver for a few commercial RAM drive apps.

john01dav

Why do we need commercial ram drive apps when Linux has tmpfs, or is this a historical thing?

p_l

Historical, but also there was a bunch of physical ram drives - RAMsan, for example, sold DRAM-based (with battery backup) appliances connected by fiber channel - they were used for all kinds of tasks but often as very fast scratch space for databases. Some VAXen had a "RAM disk" card that was IIRC used as NFS cache on some unix variants. etc. etc.

rbanffy

Still odd. The OS should be able to manage the memory and balance performance more efficiently than that. There’s no reason to preallocate memory by hardware.

p_l

It was often used to supplement memory available in cheaper ways or otherwise more flexible. For example many hardware solutions allowed connecting more RAM than otherwise possible to be accessed by main bus, or at lower cost than the main memory (for example due to differences in interfaces required, adding battery backup, etc.)

RAMsan line for example started in 2000 with 64GB DRAM-based SSD with up to 15 1Gbit FC interfaces, providing a shared SAN SSD for multiple hosts (very well utilized by some of the beefier cluster SQL databases like Oracle RAC) but the company itself has been providing high speed specialized DRAM-based SSDs since 1978

bob1029

13h

> one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms

I don't know where you're buying your NVMe drives, but mine usually respond within a hundred microseconds.

mikepurvis

15h

For the ROS ecosystem you’re often building dozens or hundreds of small CMake packages, and those configure steps are very io bound— it’s a ton of does this file exist, what’s in this file, compile this tiny test program, etc.

I assume the same would be true for any project that is configure-heavy.

skhameneh

17h

12 memory channels per CPU and DDR5-6400 may be supported (for reference, I found incorrect specs when I was looking at Epyc CPU retail listings some weeks ago), see https://www.amd.com/en/products/processors/server/epyc/9005-...

tehlike

17h

I have 1TB ram on my home server. It's 2666 though...

WarOnPrivacy

17h

> I have 1TB ram on my home server. It's 2666 though...

this kit? https://www.newegg.com/nemix-ram-1tb/p/1X5-003Z-01930

prodipto81

3 just !!!!

mulmen

14h

Wow. I tried to tap but the Newegg app has an unskippable 5 second ad for something I didn’t read. What a shame. My fault for having their app installed I guess.

userbinator

11h

It's roughly $3/GB.

prodipto81

Bro just 3 ?

elorant

Even better you could use it for inference and with that much RAM you could load any model.

bigiain

16h

Indeed. I wonder what a system like that would cost (at consumer available prices)?

magicalhippo

15h

From what I can find here in Norway the CPU would be $3800, mobo around $2000, and one stick of 64 GB 6400 MHz registered ECC runs about $530, so about $6400 for the full 768 GB. Couldn't find any kits for those.

So just those components would be just over $12k.

That's just from regular consumer shops, and includes 25% VAT. Without the VAT it's about $9800.

Problem for consumers is that a just about all the shops that sells such and you might get a deal from would be geared towards companies, and not interested in deal with consumers due to consumer protection laws.

mlrtime

The best deals on these high end servers for consumers is to find a local large server reseller. Meaning a company who buys used datacenter equipment in bulk then resells. It may not always be used equipment or old.

magicalhippo

True, though at least here that'll be older stuff, and seems almost exclusively Intel parts.

I found a used server with 768 GB DDR4 and dual Intel Gold 6248 CPUs for $4200 including 25% VAT.

That's a complete 2U server, the CPUs are a bit weak but not too bad all in all.

ashvardanian

22h

Those are extremely uniform latencies. Seems like on these CPUs most benefits from NUMA-aware thread-pools will be coming from reduced contention - mostly synchronizing small subsets of cores, rather than the actual memory affinity.

afr0ck

12h

NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.

PunchyHamster

21h

Well, all of the memory is at IO die. I remember AMD docs outright recommend to make processor hide NUMA nodes from the workload as trying to optimize for it might not even do anything for a lot of workloads

phire

21h

That AMD slide (in the conclusion) claims their switching fabric has some kind of bypass mode to improve latency when utilisation is low.

So they have been really optimising that IO die for latency.

NUMA is already workload sensitive, you need to benchmark your exact workload to know if it’s worth enabling or not, and this change is probably going to make it even less worthwhile. Sounds like you will need a workload that really pushes total memory bandwidth to make NUMA worthwhile.

flumpcakes

22h

The first picture has a typo on it's left hand side.

It says 16 cores per die with up 16 zen 5 dies per chip. For zen 5 it's 8 cores per die, 16 dies per chip giving a total of 128 cores.

For zen 5c it's 16 cores per die, 12 dies per chip giving a total of 192 cores.

Weirdly it's correct on the right side of the image.

iberator

14h

Is it true that EPYC doesn't use the program counter as in: next instruction address is in the second operand for some operations?

nine_k

13h

EPYC runs x64 code. In it, jump instructions work exactly as you describe.

Crafted by Rajat

Source Code

hckrnws

AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip