Andromeda Cluster: 10 Exaflops* for Startups from Nat Friedman and Daniel Gross

robwwilliams · on June 13, 2023

Forget LinPack and friends. Jack Dongarra is going to need to switch to the new metric for supercomputers—-kilograms of H100 GPUs—- about 3,300 give or take a few grams for this system.

Dylan16807 · on June 13, 2023

That would be mostly heatsinks, right? If you switch to liquid cooling does your score change?

nordsieck · on June 13, 2023

> That would be mostly heatsinks, right? If you switch to liquid cooling does your score change?

You still need a radiator with liquid cooling. It just makes the setup more fragile. And lets you use a bigger cooler/fan.

Dylan16807 · on June 13, 2023

If you count the radiator then the score must go up dramatically!

paxys · on June 13, 2023

> For use by startup investments of Nat Friedman and Daniel Gross

> Reach out if you want access

I'm confused by the last two bullet points. Is this website only meant to be used by these "startup investments" or can anyone fill out the linked form?

jrott · on June 13, 2023

Probably deal flow. Seems like a good way to find AI startups.

unkulunkulu · on June 13, 2023

anyone can fill out the form and agree to be invested in by those guys?

dekhn · on June 13, 2023

Can the creators explain in more detail: how is this different from (for example) the OpenAI cluster that MSFT built in Azure? Is it hosted in an existing cloud provider, or in a data center? Which data center? Who admins the system, is there an SRE team in case it goes down during training? And can you attempt ot run the same benchmarks that Top500 uses to determine what your double precision flops are and give that number in addition to your "10 exaflops" (which I believe is single precision).

jsnell · on June 13, 2023

Pretty sure it's FP8, not singles. (Which for the H100 makes a 60x difference.)

dekhn · on June 13, 2023

as an ex-supercomputer nerd (where the fastest system in teh world finally reached over 1 exaflops of double precision), it seems awfully weird to call FP8 "flops". There's nothing truly wrong with it (since "flops" is a fairly poorly defined term), but it makes it clear that ML supercomputers are very different beasts from classic supercomputers. And also makes me wonder if/when the classic folks will try to make more codes work correctly with smaller precision (for example, in molecular dynamics).

mobileexpert · on June 13, 2023

Emad from Stability estimates this at 4M/month. https://twitter.com/emostaque/status/1668666509298745344

arbuge · on June 13, 2023

Or an upfront payment of ~$25m at 10k per H100.

natfriedman · on June 13, 2023

H100s cost $35-40k each

arbuge · on June 13, 2023

My bad. I Googled around for the cost and misread things. $10k was the A100 cost...

Very impressive project sir.

And Nvidia has a very good business.

huac · on June 13, 2023

Per card? DGX is 8 H100, so ~$300k per node?

natfriedman · on June 13, 2023

Yeah, ~250-350 per node depending on node spec, volume, etc.

mobileexpert · on June 13, 2023

Did you buy the cards or lease the system?

bigdict · on June 13, 2023

lmao are they trolling with the naming

https://www.cerebras.net/andromeda/

mulmen · on June 13, 2023

I don’t get it. Why would this be a troll?

bigdict · on June 13, 2023

An NVIDIA competitor creates a deep learning cluster, calls it Andromeda.

Later, these guys create an NVIDIA-based deep learning cluster and call it… Andromeda.

mulmen · on June 13, 2023

Thanks, I'm barely aware of this domain.

MR4D · on June 13, 2023

The mainframe is dead. Long love the new mainframe. We just call it DGX because it’s cool.

Even the leasing model has made a comeback!

ryanwaggoner · on June 13, 2023

Same guys behind https://aigrant.org, maybe it's mainly as a way to get dealflow?

natfriedman · on June 13, 2023

And to support startups we invest in who need GPUs!

shon · on June 13, 2023

@nat would you be interested in presenting about this at The AI Conference in SF?

Your take on hardware-enabled investment would be interesting.

We also have folks like Hugging Face’s GPT Research Lead, Langchain and LlamaIndex founders, Cerebras’s CEO and many more speaking. It’s A builder-heavy audience.

https://aiconference.com

* this is my event

ignoramous · on June 13, 2023

AI Grant, back in 2018, offered £2500 and got all sorts of skeptical folks doubting Nat's and Daniel's motives (they will steal IP! there's some gotcha here!): https://news.ycombinator.com/item?id=16760736

They offer 100x that at £250000 per team now, plus this humongous GPU cluster. Way to start small and work your way to this. Amazing execution.

minimaxir · on June 13, 2023

April 2018 was a more nascient time for AI: BERT wasn't open sourced until November 2018 and GPT-2 wasn't open sourced until February 2019, both of which kicked off the AI boom.

ignoramous · on June 23, 2023

Hey there! Love your Python AI API lib! (:

Ameo · on June 13, 2023

Would be very cool to see some pictures of the cluster; sounds like an impressive build!

mobileexpert · on June 13, 2023

I’m assuming this is visually just some racks in (probably) Azure data centers and is leased not owned.

zora_goron · on June 13, 2023

This line makes me think this might be a physically-owned system:

>> Total mass of 3,291 kg (GPUs only; not counting chassis, system, rack)

dauertewigkeit · on June 13, 2023

That is also what I suspect. Otherwise where would they get two thousand H100s?

mike_d · on June 13, 2023

Maybe an AI startup that went under and the investors took the cluster?

braindead_in · on June 13, 2023

Someone please start the GPT 5 training before the regulation kicks in.

005 · on June 13, 2023

Looks like they've reserved a bunch of compute from Lambda Labs?

Edit: Based off this tweet, looks very similar https://twitter.com/LambdaAPI/status/1668676838044868620

swyx · on June 13, 2023

i mean if its just h100 specs then it cant help but look similar cant it

005 · on June 13, 2023

Not just the H100 specs, the interconnect and ingress/egress details.

danielcampos93 · on June 14, 2023

Last week I got an email from them about they had only 6000 H100 left. Got one today about how they have 4000. Tracks

brucethemoose2 · on June 13, 2023

> Big enough to train llama 65B in ~10 days

Y'all could totally eat Meta's lunch and train an open LLM with all the innovations that have come since LLaMA's release. Other startups are trying, but they all seem bottlenecked by training time/resources.

This could be where the next Stable Diffusion 1.5 comes from.

redsh · on June 13, 2023

The new lean startup. 3,291 kg