Partitioning in the Chiplet Era | Svelte Hacker News

yatrios 9 months ago

I find this new path pretty fascinating. Have there been any recent advancements in terms of the signal integrity issue when partitioning these designs? To me these chiplets currently seem to still be very proof of concept and I'm not sure of how feasible this is in large scale designs. Could someone care to clarify?

lizknope 9 months ago

I've been in integrated circuit physical design for almost 30 years.
What signal integrity issue are you referring to? For on chip nets we have SI issues from cross coupling capacitance. For the last 25+ years the routers will try to move these nets apart and jump layers to avoid long cross coupled nets. The RC extraction tools have supported extraction of cross coupled nets and all of the delay calculators and static timing analysis tools to analyze victim / aggressor net coupling and filter out irrelevant nets if they don't switch in the same timing windows.
Chiplets are combining multiple chips in the same package. I had a Pentium Pro from 1996 that did that. In the last 5 years chip packaging technology has continued to advance and we are stacking dies and more within the same package.
SI between chips is not a new issue. There are tools to analyze that as well.
trynumber9 9 months ago

AMD has been shipping billions of dollars of "chiplet" GPGPUs by the name of MI300A and MI300X.
So I think they're beyond the experimental phase.
- latchkey 9 months ago
  
  As someone deploying production supercomputers using those chips, I can't agree with you more.
  - Numerlor 9 months ago
    
    From their die size and power usage relative to the perf compared to nvidia it's clear they didn't hit the goals they wanted to with navi 31&32 and it's definitely because of the chiplet design.
    I don't have any experiences data center wise but consumer side the Navi 31 7900 XTX is also a bit of a temperamental gpu, but don't know how much of that is on the silicon and how much of it is software.
    Though it is clear that some form of chiplets will have to be used as building large portions of the chips on cutting edge nodes with higher failure rates will just become more expensive as time goes on. More so with parts of the chips that just don't scale down with nodes anymore anyway.
    
    latchkey 9 months ago
    
    I ran 150,000 AMD gpus previously for an ethereum mining operation. We ran them on edge of crashing, individually tuned to their highest clock/lowest voltage.
    I can definitively say they are ALL snowflakes. Every single one. Wide variance in each chip and across batches as well. We had them OEM placed onto the boards too. Combine that with the OEM and their batches as well, there was variance in that too. Then, it went down to even the datacenter and psu's, and how clean the power was.
    I actually started to collect all the data around where they were cut from the wafer, but never got a chance to process it and correlate it to their performance. There was a running theory that the edge chips were not as good.
    
    alephnerd 9 months ago
    
    What was the rate of variance you were seeing? I'm curious how much it compares to other similar setups. I haven't been technically in the weeds for almost 6-7 years so kind of curious.
    That said, HPC is hard, and even minor hardware/OEM failures can have a massive downstream impact.
    Also you seem to be downvoted for no reason. At that size of a cluster, I'd be surprised if you didn't have a subset of flawed GPUs.
    > There was a running theory that the edge chips were not as good
    My gut would agree with that sentiment, simply because engineering is hard and it takes time to stabilize QA (eg. look at how long it took to stabilize hard disk QA, let alone a bleeding edge GPU architecture).
    
    Numerlor 9 months ago
    
    Again only consumer side knowledge but there's clearly very wide degree of variance Navi 31 just from all the models it's used on, as I doubt they're puposefully cutting the silicon that much without the parts already being unusable for more expensive gpus.
    It goes from a 7900 GRE, 7900 xt to a 7900 XTX with roughly a fifth of the processing unavailable on the lowest tier compared to highest. Then from what I know from overclocking communities the different 7900 XTXs also vary up to over 10% when on the same 550w bios. It's probably more apparent when going over that power limit but that requires hardware modding and people aren't doing that on low tier bins and there isn't as much data.
    
    latchkey 9 months ago
    
    Yea, don't know about the downvotes. ¯\_(ツ)_/¯
    We had a minimum expected hashrate and power profile depending on the card model and batch. It would be in the 0-20% range.
    For what I'm doing now with my company which is more bleeding edge... we just deployed a cluster of 128 MI300x (8x gpus in a chassis). A good 50% of the chassis had 1-2 issues either in the gpu or the baseboard, on delivery.
wtallis 9 months ago

Chiplet is a term that is applied when using any of several fairly different packaging techniques. AMD's consumer Ryzen desktop and Epyc server processors use an organic substrate with chip-to-chip distances on the order of a centimeter, which is relatively bad for performance and power consumption of the interconnect, but it's cheap and has been working for them for seven years and counting. At the other end of the spectrum are techniques like die stacking using TSVs: more expensive but very high performance and low power.
There are plenty of chiplet-based designs that meet whatever definition of "large scale" you may have had in mind. Intel's done datacenter processors consisting of four CPU chiplets (~400mm^2 each) in a square layout with silicon bridge interconnects between adjacent dies (as opposed to mounting everything on one large piece of silicon), and each of the four CPU chiplets is also connected to an 8-high TSV stack of HBM DRAM (Sapphire Rapids Xeon Max, currently used in the second-fastest supercomputer). That's way more silicon per socket than any monolithic die can provide, barring something like Cerebras's wafer-scale packaging.
Intel's Meteor Lake laptop processor family launched at the end of last year has across its various configurations: CPU chiplets of two different sizes, GPU chiplets of two different sizes, a common SoC chiplet, and an optional extra IO chiplet for a total of six active chiplets designed for that generation, plus two sizes of passive base die those chiplets are mounted on, plus two or three sizes of dummy silicon filling in gaps at the edges of the arrangement of the active tiles. That's a lot of tapeouts across four different fab processes all for one generation of consumer processors, and it's been shipping in volume for basically the whole year.
- adrian_b 9 months ago
  
  I agree with what you have said, except that AMD's experience with packaging techniques for multiple chips is way longer than "seven years".
  The first multi-chip AMD CPU has been Opteron Magny-Cours, in 2010. It had up to 12 cores in two 6-core chips.
  Zen 1 Epyc and Threadripper CPUs had a multi-chip partitioning more similar to Opteron Magny-Cours than to the following AMD CPUs, from Zen 2 to Zen 5.
  Intel's multi-chip Xeons also have a partitioning like that of Magny-Cours and Zen 1, except that their peripheral interfaces are extracted into separate chiplets made with a different lower-resolution CMOS process.
  The Magny-Cours/Zen 1/Xeon partitioning, where the memory controllers are distributed over the compute tiles, has the advantage of lower memory latency, but the disadvantage of non-uniform memory access (NUMA).