NVidia chips can do this job. (EDIT: The job of training a neural network I shou...

pengaru · on Sept 5, 2022

> NVidia chips can do this job.

Presumably Tesla, at the time they decided to pursue this option, thought they'd potentially have a competitive advantage with in-house designs.

It's entirely possible that's just their hubris showing, time will tell if this was the right decision or not. After seeing the NVidia presentation announcing their latest datacenter-scale AI hardware, I'd be surprised if Tesla's in-house design is more than just a massive cost center vs. buying something from NVidia.

But sometimes you do things that appear irrational in part to keep your talented engineers from seeking work elsewhere. Just look at NASA's SLS, how much of that is a job program in part to prevent hoards of talented folks building rockets for competing nations?

kjksf · on Sept 5, 2022

There's nothing irrational about it. It's a big, bold, costly bet. A risky bet but not irrational.

People doing this are not some bored Tesla engineers.

Both the FSD chip and Dojo are staffed by chip design veterans Tesla poached from AMD, Apple and others.

Just read those resumes:

https://www.linkedin.com/in/peterbannon/ : this is the guy leading FSD chip

https://www.linkedin.com/in/ganesh-venkataramanan-99272a3/ : this is the guy leading Dojo

People like this can only be poached with a combination of great project and great salary.

It's a team that was built from grounds up because Tesla is (also) an AI company and Musk is thinking 10 years ahead.

pengaru · on Sept 5, 2022

> People doing this are not some bored Tesla engineers.

The whole reason they're not bored is because they're doing this.

You do realize we're basically in agreement, right? The more impressive the resumes the more important it is that you give them Real Work to do.

But NVidia's state of the art in this space seems substantially better than Dojo. How much that matters in totality remains to be seen.

mastax · on Sept 5, 2022

If Tesla wasn't designing chips, they wouldn't have any chip designers to get bored.

pengaru · on Sept 5, 2022

But once Tesla is designing chips for their in-vehicle inference needs, they need to keep those people interested and the large-scale training side is arguably more interesting to DIY.

astrange · on Sept 5, 2022

> It's a team that was built from grounds up because Tesla is (also) an AI company and Musk is thinking 10 years ahead.

Well, also because Tesla has an unnaturally low cost of capital because of its meme stock status.

epgui · on Sept 5, 2022

Tesla makes mad money. It's not a meme stock.

baq · on Sept 5, 2022

the biggest product of Tesla is its stock, there are no ifs and buts about it. this must change soon, since that mad money is barely enough to eek out a profitable quarter.

epgui · on Sept 5, 2022

You clearly don't read SEC filings / financials.

For your convenience, and you will want to look mainly at 10K and 10Q forms: https://www.nasdaq.com/market-activity/stocks/tsla/sec-filin...

dragontamer · on Sept 5, 2022

> After seeing the NVidia presentation announcing their latest datacenter-scale AI hardware

Did we watch the same presentation? NVidia knocked it out of the park.

Thread block cluster is obviously amazing. Routing between SMs / compute units will be far faster with this level of software abstraction, and it will be exceptionally easy to write code for. NVidia always impresses me with their advanced software techniques and clear understanding of the fundamental model of SIMD compute.

------

Ignoring those software details... the important thing is GH100 will be TSMC 4nm, which is 1.5 nodes ahead of the 7nm Dojo. A significant process advantage, representing 60+% less power usage and 300% the transistor density of the older 7nm tech.

Even if NVidia's GPU had issues, there's something to be said about just being a raw process node (or 1.5 nodes) ahead.

pengaru · on Sept 5, 2022

> Did we watch the same presentation? NVidia knocked it out of the park.

Perhaps I worded it poorly, I agree with you.

My meaning was vs. NVidia's latest tech it seems like Tesla's in-house datacenter NN could be nothing more than a huge cost center without even offering an advantage over what NVidia could sell them.

But like I said, if you have a staff of folks capable of building such things you have to keep them satisfied with practicing their craft or they leave.

sufiyan · on Sept 5, 2022

This was the case when Google launched their TPU as well. Look where they are now

p1esk · on Sept 5, 2022

Where?

dragontamer · on Sept 5, 2022

Gotcha. Misread for a sec.

atty · on Sept 5, 2022

The person you were replying to agreed with you, that it seems like Nvidia is doing a great job in the data center.

panick21_ · on Sept 5, 2022

> NASA's SLS, how much of that is a job program in part to prevent hoards of talented folks building rockets for competing nations

Zero. Because NASA has no problem if those engineers would work for other nation as long as it isn't Russia or North Korea and co. And that wouldn't happen anyway.

Those people would likely work in one of the huge amount of space startups or just go to the typical ULA, BlueOrigin, SpaceX and so on.

You make the totally wrong assumption that SLS has anything to do with rational thought. It really doesn't.

jml78 · on Sept 5, 2022

Tesla is very vertically integrated. This is just how they operate. You can make the argument that they shouldnt be so vertically integrated but it has worked for them thus far.

dragontamer · on Sept 5, 2022

Except Tesla has 7,360 A100 GPUs.

https://www.tomshardware.com/news/tesla-brags-about-in-house...

So... no? They are clearly leveraging the NVidia ecosystem right now. Now maybe they have ambitions to get off of NVidia, but they're doing so in a rather asinine fashion. There's probably half-a-dozen groups trying to make a faster systolic matrix multiplication unit for the deep learning crowd. Tesla probably should have either worked with those groups and/or bought one out, for example.

kjksf · on Sept 5, 2022

This gives me flashbacks of "advice" that Tesla should outsource manufacturing of Model 3 and focus on design.

Tesla is (also) an AI company. Musk is thinking 10 years ahead.

If you read the resumes of people leading FSD chip and DOJO: those are chip design industry veterans that Musk hired away from AMD, Apple and other.

He built a stronger team than whatever startup you think he could buy.

Those people have already been working on Dojo for over 6 years and they'll be working for next 20.

https://www.linkedin.com/in/ganesh-venkataramanan-99272a3/ is the lead for Dojo and has been at Tesla for 6.7 years.

Tesla can hire they best people and give them practically unlimited funding, certainly much bigger than any startup could afford.

This is a big, bold, risky bet. Kind of like reusable rockets or going to mars.

dragontamer · on Sept 5, 2022

Except... they already did outsource training chips and bought thousands of NVidia A100 GPUs.

So now they'll need porting effort to move their GPU-based current design to this new, very non-GPU, instruction set.

speedgoose · on Sept 5, 2022

It doesn’t sound like an insurmontable task for such a company.

baq · on Sept 5, 2022

if 99% of your compute is hidden behind pytorch/tensorflow... there isn't _that_ much work

dragontamer · on Sept 5, 2022

Sure. You only have to design a chip, design an assembly language, design a compiler, design the kernels, design a parallelization framework, design a server system to load-balance tasks, and then rework the pytorch/tensorflow code to use your new faster custom primitives that no one else has.

-----

Except step 1: "design a chip", is already something on the order of hundreds-of-megabucks of investment

baq · on Sept 5, 2022

oh absolutely.

but they're already there, looks like.

dragontamer · on Sept 5, 2022

If they were already there, they'd have ResNet 50 benchmarks to share.

And the other tidbit is that it has to be better than the off the shelf NVidia chip (A100 today, and GH100 in a few months).

I think Dojo might be faster than A100, but GH100 will be 1.5 nodes ahead, that's a big process advantage...

-------

If GH100 is faster in practice, then all the work has been somewhat wasted.

zeristor · on Sept 5, 2022

If I recall from the Dojo chip announcement previously they were utilising SpaceX know how to greatly increase the cooling of the chips.

Although the article doesn’t look to address thermal issues from my summary skim of it.

rbanffy · on Sept 5, 2022

> NVidia chips can do this job

Yes, but anyone can get them - they aren’t a competitive advantage.