It seems like if Anthropic released a super cool and useful _free_ utility (like a compiler, for example) that was better than existing counterparts or solved a problem that hadn’t been solved before[0] and just casually said “Here is this awesome thing that you should use every day. By the way our language model made this.” it would be incredible advertising for them.
But they instead made a blog post about how it would cost you twenty thousand dollars to recreate a piece of software that they do not, with a straight face, actually recommend that you use in any capacity beyond as a toy.
[0] I am categorically not talking about anything AI related or anything that is directly a part of their sales funnel. I am talking about a piece of software that just efficiently does something useful. GCC is an example, Everything by voidtools is an example, Wireshark is an example, etc. Claude is not an example.
They made a blog post about it because it's an amazing test of the abilities of the models to deliver a working C-compiler, even with lots of bugs and serious caveats, for $20k of tokens, without a human babysitting it.
I'd challenge anyone who are negative to this to try to achieve what they did by hand, with the same restrictions (e.g. generating full SSA form instead of just directly emitting code, capable of compiling Linux), and log their time doing it.
Having written several compilers, I'll say with some confidence that not many developers would succeed. Far fewer would succeed fast enough to compete with $20k cost. Even fewer would do that and deliver decent quality code.
Now notice the part where they've done this experiment before. This is the first time it succeeded. Give it another model iteration or two, and expect quality to increase, and price to drop.
>Every agent would hit the same bug, fix that bug, and then overwrite each other's changes. Having 16 agents running didn't help because each was stuck solving the same task.
>The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC
The blog post used the word autonomous a lot, which I suppose is true if Nicholas Carlini is not a human being but in fact a Claude agent.
>I'd challenge anyone who are negative to this to try to achieve what they did by hand, with the same restrictions (e.g. generating full SSA form instead of just directly emitting code, capable of compiling Linux), and log their time doing it.
Why would anyone do that? My point was that why does the company _not_ make a useful tool? I feel like that is a much more interesting topic of discussion than “why aren’t people that aren’t impressed by this spending their time trying to make this company look good?”
>This is the new floor.
Aside from the notion that they maybe intentionally set out to create the least useful or valuable output from their tooling (eg ‘the floor’) when they did not say that they did that, my question was “Why do they not make something genuinely useful?”. Marketing speak and imaginary engineers failing at made up challenges does not answer that question.
> The blog post used the word autonomous a lot, which I suppose is true if Nicholas Carlini is not a human being but in fact a Claude agent.
Nothing in the article suggests it did not autonomously do the work.
> Why would anyone do that?
Because a lot of naysayers here pretend as if this is somehow trivial.
> My point was that why does the company _not_ make a useful tool?
Useful to whom? This is a researcher testing the limits of the models. Knowing those limits is highly useful to Anthropic. And it's highly useful to lots of others too, like me, as a means of understanding the capabilities of these models.
What, exactly would such a tool that'd somehow make the people dismissing this change their minds look like? Because I don't think anything would. They could produce lots of useful tools, if they aimed lower than testing the limits of the model. But it would not achieve what they set out to do, and it would not tell us anything useful.
I produce "useful tools" with Claude every day. That's not interesting. Anyone who actually uses these tools properly will develop a good understanding of the many things that can be achieved with them.
Most of us can't spend $20k figuring out where the limits are, however.
> I feel like that is a much more interesting topic of discussion than “why aren’t people that aren’t impressed by this spending their time trying to make this company look good?”
This is a ridiculous misrepresentation of the point. The point is that the people who aren't impressed by this very clearly and obviously do not have an understanding of the complexity of what they achieved, and are making ignorant statements about it.
> Aside from the notion that they maybe intentionally set out to create the least useful or valuable output from their tooling (eg ‘the floor’)
Again, you're either entirely failing to understand, or wilfully misrepresenting what I said. No, their goal was not to "set out the create the least useful or valuable output". Their goal was to test the limits of what the model can achieve. They did that.
That has far higher value than not testing the limits. Lots, and lots of people are building tools with Claude without testing the limits. We would not learn anything from that.
> my question was “Why do they not make something genuinely useful?”
Because that wasn't the purpose. The purpose was to test the limits of what the model can achieve. That you struggle to understand why what they achieved was massively impressive, does not change that.
> Nothing in the article suggests it did not autonomously do the work.
I don’t know how to respond to that other than to ask you to quote the part of the blog post where the author described the language model running into a problem that it could not fix and then described the details of how he manually intervened to fix the problem that the language model could not fix when you elaborate on your definition of “nothing” in that sentence.
>Every agent would hit the same bug, fix that bug, and then overwrite each other's changes. Having 16 agents running didn't help because each was stuck solving the same task.
>The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC
As for:
> Because a lot of naysayers here pretend as if this is somehow trivial.
This is an answer to “why do you want someone to do that?” You have already established that you would like that to happen. It doesn’t answer “why would a real human being (who is not you) that isn’t impressed by the compiler that doesn’t work put their time into making Anthropic look good?”
For example “I will pay a naysayer $20,000 to try” or “I know a guy that will pay a naysayer to try this, succeed or fail” or “I will give a naysayer a bunch of hardware to play with in exchange for attempting this” would be motivation to work for Anthropic and not get paid by Anthropic. Saying “I want you to do that because I think you’ll feel bad and waste your time” and then getting no takers isn’t really an assault on “the naysayers” decision not to do work for Anthropic without getting paid by Anthropic.
As for this, that’s a good question but I would say the bare minimum would be “useful”
> What, exactly would such a tool that'd somehow make the people dismissing this change their minds look like?
It is pretty common for tech companies to release free useful software. For example pytorch, react, Hack/hhvm etc. from Meta
Or chromium from Google. Chromium is a good example, there’s a decent chance that you’re using a chromium based browser to read this. There’s also a ton of other stuff, golang comes to mind as another example.
Or if you want stuff made by a business that’s a fraction the valuation of Anthropic, there’s Campfire and Writebook by 37signals. https://once.com/
> Because that wasn't the purpose.
I know that. That was the premise of my question.
I saw that they put a bunch of resources into making something that is not useful and asked why they did not put a bunch of resources into that was useful. Surely they could make something that is both useful and made their model look good?
For me it seems like the obvious answer would be either that they can’t make something useful:
> Their goal was to test the limits of what the model can achieve. They did that.
Or they don’t want to
> Because that wasn't the purpose.
I was asking if anyone had any substantive knowledge or informed opinion about whether it was one or the other but it seems like you’re saying it’s… both? They don’t want to make and release a useful tool and also they can not make and release a useful tool because this compiler, which is not useful, is the limit of what their model can achieve.
Like you want us all to know that they cannot and do not want to make any sort of useful tool. That is your clearly-stated opinion about their desires and capabilities. And also you want these “naysayers”, who are not you, to put their time and effort into… also not making something useful? To prove… what?
> I don’t know how to respond to that other than to ask you to quote the part of the blog post where the author described the language model running into a problem that it could not fix and then described the details of how he manually intervened to fix the problem that the language model could not fix when you elaborate on your definition of “nothing” in that sentence.
I suggest you re-read that and pay attention to how they're describing addressing these things by fixing the harness rather than solving the problems.
You conveniently quoted the part that doesn't support your claim.
> This is an answer to “why do you want someone to do that?” You have already established that you would like that to happen. It doesn’t answer “why would a real human being (who is not you) that isn’t impressed by the compiler that doesn’t work put their time into making Anthropic look good?”
My bad for assuming the naysayer care about learning something or understanding the technology rather than looking for excuses to ignorantly bash it.
> It is pretty common for tech companies to release free useful software. For example pytorch, react, Hack/hhvm etc. from Meta
You entirely failed to address my question.
> I saw that they put a bunch of resources into making something that is not useful and asked why they did not put a bunch of resources into that was useful. Surely they could make something that is both useful and made their model look good?
They made something that made their model look good to the people who are actually likely to want to use their model. Aka, the customers actually providing the vast majority of their revenue.
> I was asking if anyone had any substantive knowledge or informed opinion about whether it was one or the other but it seems like you’re saying it’s… both? They don’t want to make and release a useful tool and also they can not make and release a useful tool because this compiler, which is not useful, is the limit of what their model can achieve.
No, I've said there's no value in it for them to spend money on tools that'd just get dismissed and that at the same time wouldn't provide useful data to their actual customers.
> Like you want us all to know that they cannot and do not want to make any sort of useful tool.
I've said nothing of the sort. Stop lying about what I've said.
> That is your clearly-stated opinion about their desires and capabilities.
Another lie.
> And also you want these “naysayers”, who are not you, to put their time and effort into… also not making something useful? To prove… what?
I'd like them to stop being blatantly intellectually dishonest and make trite unjustified claims because they don't understand why this project had value, and actually try to learn and inform themselves.
It's naive to assume there's any honest curiosity lurking behind these shallow dismissals, I know, but I try to think the best of people until they prove their intent.
> it's an amazing test of the abilities of the models to deliver a working C-compiler, even with lots of bugs and serious caveats,
if I deliver my work with lots of bugs and serious caveats, I will be sacked.
Streching the definition of "working", especially in times where code quality is going down, does not help.
As far as I underestand from the comments, Anthropic released a "compiler" that translates C code to some assembly, which might or might not be valid input for a linker.
They claimed they were able to compile the Linux kernel (which version ? which config) and boot it (was the boot successfull ? were all devices correctly initialized ? Is userland running without problems ?)
At the moment it really looks like a political farce with no real outcome except some promisses.
But they instead made a blog post about how it would cost you twenty thousand dollars to recreate a piece of software that they do not, with a straight face, actually recommend that you use in any capacity beyond as a toy.
[0] I am categorically not talking about anything AI related or anything that is directly a part of their sales funnel. I am talking about a piece of software that just efficiently does something useful. GCC is an example, Everything by voidtools is an example, Wireshark is an example, etc. Claude is not an example.