by mk_stjames on 4/9/2024, 8:01:45 PM
by neilmovva on 4/9/2024, 5:20:39 PM
A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020. But Intel is using 8 stacks here, so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100 (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM has better supply - HBM3 is hard to get right now!
The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.
by kylixz on 4/9/2024, 6:11:15 PM
This is a bit snarky — but will Intel actually keep this product line alive for more than a few years? Having been bitten by building products around some of their non-x86 offerings where they killed good IP off and then failed to support it… I’m skeptical.
I truly do hope it is successful so we can have some alternative accelerators.
by riskable on 4/9/2024, 5:17:24 PM
> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator
WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?
Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.
I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.
by sairahul82 on 4/9/2024, 5:25:14 PM
Can we expect the price of 'Gaudi 3 PCIe' to be reasonable enough to put in a workstation? That would be a game changer for local LLMs
by rileyphone on 4/9/2024, 5:09:22 PM
128GB in one chip seems important with the rise of sparse architectures like MoE. Hopefully these are competitive with Nvidia's offerings, though in the end they will be competing for the same fab space as Nvidia if I'm not mistaken.
by kaycebasques on 4/9/2024, 5:28:09 PM
Wow, I very much appreciate the use of the 5 Ws and H [1] in this announcement. Thank you Intel for not subjecting my eyes to corp BS
by latchkey on 4/9/2024, 5:17:13 PM
> the only MLPerf-benchmarked alternative for LLMs on the market
I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.
by yieldcrv on 4/9/2024, 5:27:46 PM
Has anyone here bought an AI accelerator to run their AI SaaS service from their home to customers instead of trying to make a profit on top of OpenAI or Replicate
Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.
by 1024core on 4/9/2024, 5:05:03 PM
> Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth ...
I didn't know "terabytes (TB)" was a unit of memory bandwidth...
by throwaway4good on 4/9/2024, 8:36:27 PM
Worth noting that it is fabbed by TSMC.
by InvestorType on 4/9/2024, 11:48:47 PM
This appears to be manufactured by TSMC (or Samsung). The press release says it will use a 5nm process, which is not on Intel's roadmap.
"The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"
by geertj on 4/9/2024, 7:52:15 PM
I wonder if someone knowledgeable could comment on OneAPI vs Cuda. I feel like if Intel is going to be a serious competitor to Nvidia, both software and hardware are going to be equally important.
by einpoklum on 4/9/2024, 9:20:59 PM
If your metric is memory bandwidth or memory size, then this announcement gives you some concrete information. But - suppose my metric for performance is matrix-multiply-add (or just matrix-multiply) bandwidth. What MMA primitives does Gaudi offer (i.e. type combinations and matrix dimension combinations), and how many of such ops per second, in practice? The linked page says "64,000 in parallel", but that does not actually tell me much.
by alecco on 4/9/2024, 6:25:11 PM
Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth). Probably not a deal-breaker but it's strange for Intel (of all vendors) to lag behind in PCIe.
by ancharm on 4/9/2024, 6:48:11 PM
Is the scheduling / bare metal software open source through OneAPI? Can a link be posted showing it if so?
by cavisne on 4/10/2024, 2:28:19 AM
Is there an equivalent to this reference for Intel Gaudi?
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
by AnonMO on 4/9/2024, 6:16:40 PM
it's crazy that Intel can't manufacture its own chips atm, but it looks like that might change in the coming years as new fabs come online.
by colechristensen on 4/9/2024, 5:18:22 PM
Anyone have experience and suggestions for an AI accelerator?
Think prototype consumer product with total cost preferably < $500, definitely less than $1000.
by MrYellowP on 4/10/2024, 8:42:09 AM
That's amusing. :D
by sandGorgon on 4/10/2024, 4:40:01 AM
>Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most-common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types.
what is the programming interface here ? this is not CUDA right ...so how is this being done ?
by chessgecko on 4/9/2024, 6:58:32 PM
I feel a little misled by the speedup numbers. They are comparing lower batch size h100/200 numbers to higher batch size gaudi 3 numbers for throughput (which is heavily improved by increasing batch size). I feel like there are some inference scenarios where this is better, but its really hard to tell from the numbers in the paper.
by andersa on 4/9/2024, 5:45:06 PM
Price?
by amelius on 4/9/2024, 9:08:37 PM
Missing in these pictures are the thermal management solutions.
by KeplerBoy on 4/9/2024, 9:13:02 PM
vector floating point performance comes in at 14 Tflops/s for FP32 and 28 Tflop/s for FP16.
Not the best of times for stuff that doesn't fit matrix processing units.
by mpreda on 4/9/2024, 5:50:59 PM
How much does one such card cost?
by metadat on 4/10/2024, 12:16:04 AM
> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator
How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..
Simultaneously impressive and disappointing.
by YetAnotherNick on 4/9/2024, 5:35:18 PM
So now hardware companies stopped reporting FLOP/s number and reports in arbitrary unit of parallel operation/s.
by m3kw9 on 4/9/2024, 7:06:38 PM
Can you run Cuda on it?
by brcmthrowaway on 4/9/2024, 6:46:32 PM
Does this support apple silicon?
by whalesalad on 4/9/2024, 5:38:33 PM
One nice thing about this (and the new offerings from AMD) is that they will be using the "open accelerator module (OAM)" interface- which standardizes the connector that they use to put them on baseboards, similar to the SXM connections of Nvidia that use MegArray connectors to thier baseboards.
With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).
There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.
I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.