OpenAI slashes the cost of using its AI with a "mini" model

minimaxir · on July 18, 2024

This appears to be part of an embargoed news blitz from a few news organizations (Verge and Bloomberg posted the same news at the same time), which is an interesting PR deviation from OpenAI posting it on their blog and having it go viral. The news isn't on their official blog at all currently.

rmorey · on July 18, 2024

i think they just messed up the embargo times, and suddenly the scoops all dropped before they were ready

xendipity · on July 18, 2024

Yeah I found that interesting. Flubbed embargo times makes sense, but could it also be that they're letting the news organizations have first dibs to build a little goodwill with the industry?

dustedcodes · on July 18, 2024

Is OpenAI still making any changes to make ChatGPT better and more accurate and correct, or are they only focusing on making it cheaper by giving us weaker/dumber responses in a faster way nowadays? I have cancelled my subscription recently because didn't think that the GPT-4 was much better than what I get for free at Claude or Gemini.

hidelooktropic · on July 18, 2024

Going out on a limb here that they're not focusing the entire company on one objective.

Adding to this, reducing cost means they've reduced compute and improved quality per unit of computation.

That matters! It matters that these systems currently require an embarassing amount of energy to run otherwise. It matters that these models could be portable to the point where they can be run locally on portable consumer electronics rather than being locked into remote compute and HTTP calls.

fragmede · on July 18, 2024

> Is OpenAI still making any changes to make ChatGPT better and more accurate and correct

a few prompts recently, it just answered for me instead of refusing, to my surprise, so if feels like it's been getting better, but unfortunately there's no hard data I have on that, it's more of a feeling, so it's hard to prove that's actually true to someone else.

then again, ChatGPT-4o was recently found to be unable to do 9.11 - 9.9 correctly unless pressed, so there's still a long ways to go.

https://chatgpt.com/share/47d260ae-0c62-48ab-8d3b-84767acd7f...

JohnMakin · on July 18, 2024

That is interesting for sure, but the subsequent questioning "Why did you get it wrong the first time?" and asking it to explain itself seems like it shows a misunderstanding about how these technologies actually work. Please stop treating these tools like they are conscious and actually understand what they are spitting back to you! /rant

skeledrew · on July 19, 2024

Actually, given that models are trained on data generated by humans, the ideal way to get decent responses are by treating them as such. Many times when I'm aware that a model gave an incorrect response, or I'm not sure and skeptical, a simple "Are you sure" or "Explain your answer" does wonders.

fragmede · on July 18, 2024

I've done spreadsheets are all you need. The questioning isn't based on a misunderstanding that this is autocomplete with an insane dataset, but given a black box, how else do you want it to explain how it works?

JohnMakin · on July 18, 2024

> how else do you want it to explain how it works?

Why do you think these tools understand how they work enough to explain it? If this were the case they wouldn't be a blackbox to begin with.

Atotalnoob · on July 18, 2024

Because it’s making up how it works.

tempodox · on July 18, 2024

LLMs are text completion engines and can't really do math. Even if it happened to know the correct answer to 9.11 - 9.9, it could still fall flat on 9.12 - 9.9.

Sohcahtoa82 · on July 18, 2024

> LLMs are text completion engines and can't really do math

It blows my mind that there are people on HN that don't know this.

LLMs are basically Markov Chains on a massive dose of steroids. It looks at the last $context_window tokens and decides what the next token should be. They just work differently in that a neural network creates more "fuzzy" matching than a Markov Chain, which is purely statistical.

visarga · on July 19, 2024

That is RNNs. LLMs (usually transformers) are non-markovian.

Jensson · on July 19, 2024

> That is RNNs. LLMs (usually transformers) are non-markovian.

How? They have a hard cut off context, and randomly generates next state from that. That is the definition of a markov chain.

It is a markov chain with a pretty large state at every point, but still a markov chain.

fsloth · on July 18, 2024

Yes, but chatgpt will happily go browsing the web if needed or appropriate. I guess there is a reason why they haven't hooked up a symbolic math package to do the math when encountered, but feels weird it's not a feature yet.

cableshaft · on July 18, 2024

If it doesn't already (doesn't seem like it does last time I checked), it could potentially recognize that as a mathematical equation and feed that portion of the query into a calculation engine, and then include that in its answer.

Terretta · on July 18, 2024

This is from current ChatGPT 4o:

Q: Write parameterized Python that gives the answer to 9.11 - 9.9 and 9.12 - 9.9. Run the Python to show the answer to each pair of parameters.

A: The results for the parameter pairs are as follows:

  - 9.11 - 9.9 = -0.79
  - 9.12 - 9.9 = -0.78 [>_]

The [>_] is a link. Clicking it opens a pop-up window, showing:

  def calculate_difference(a, b):
      return a - b
  
  # Define the pairs of parameters
  pairs = [(9.11, 9.9), (9.12, 9.9)]
  
  # Calculate and print the results for each pair
  results = {f"{a} - {b}": calculate_difference(a, b) for a, b in pairs}
  results

Result: {'9.11 - 9.9': -0.7900000000000009, '9.12 - 9.9': -0.7800000000000011}

// Note, if I ask it simply "compute the answer to 9.11 - 9.9", I get nonsense such as a negative number, and no Python. It doesn't "automagically" write code, it has to be nudged. If you see "Analyzing", it's fired up a sandboxed Python container.

Slyfox33 · on July 19, 2024

I tried asking it to compare 9.11 to 9.9 using a python sandbox and it wrote the python code and then told me the result of the code would be 9.11 > 9.9 without actually running it. lol.

cableshaft · on July 18, 2024

Cool! I haven't checked in a while, so good to know it's at least possible somehow.

pseudosavant · on July 18, 2024

It is interesting that one of the best ways to deal with math problems in LLMs is to have it write Python code to solve the problem. It is good at writing that kind of fairly straight-forward Python, and Python is good at accurately being able to do math. It does mean that you need to safely implement a sandboxed Python interpreter to do the calculations though.

visarga · on July 19, 2024

As a human you can hardly catch me doing a math calculation by hand. Almost always use calculators.

jononor · on July 18, 2024

Companies like Cohere are focusing a lot on this. It is a bit surprising if OpenAI does not already do it?

OsrsNeedsf2P · on July 18, 2024

I suspect model improvement costs exponentially more. Between the fact it's been over a year since 4's release and 4o is only marginally better, and that OpenAI is focusing on other things (Sora, mini models, voice, etc) suggests a "better and more accurate and correct" model is actually really hard.

mritchie712 · on July 18, 2024

as someone building a product with their API, GPT-4 is already smarter than we need for our use case. Request latency is my #1 concern right now.

JimDabell · on July 18, 2024

If request latency is your top concern and you don’t need it to be as smart as GPT-4, why are you using OpenAI and not a model hosted by Groq or similar?

mritchie712 · on July 18, 2024

We need function calling, which wasn't great in Groq when we last tried it, but they seem to have made progress on it. We'll be testing it again soon.

voiper1 · on July 18, 2024

groq just released:

llama3-groq-70b-8192-tool-use-preview llama3-groq-8b-8192-tool-use-preview

Sonnet3.5 also has tool usage, and is slightly cheaper than 4o ($3 input vs $5).

If you _only_ need tool usage, then 4o-mini or others will be cheaper...

mistal's Mixtral 8x22B has function calling for $2/$6, which is a bit cheaper than 4o.

d13 · on July 18, 2024

Groq’s rate limits are so bad they’re only good enough for one person using it very rarely with an under 4k context size. It’s just a marketing stunt at the moment.

fragmede · on July 18, 2024

how much are you paying them?

j45 · on July 18, 2024

There might be some changes, but OpenAI has consistently shown they have way more polished stuff sitting and getting ready to go out, based on the quality of the product or update on the day of the launch.

doctorpangloss · on July 18, 2024

They didn't strategize for the possibility that Anthropic would produce a better model than they do.

swyx · on July 18, 2024

anecdotally hearing they're pretty close to releasing the "AGI level 2" model. not sure if it'll be classed 4.5 or 5

laweijfmvo · on July 18, 2024

If "AI" tools are eventually going to have to be "free" (as in beer) to compete, I shudder to think of what companies like OpenAI will have to extract from users to please investors...

Zambyte · on July 18, 2024

To combat those incentives, they should consider being non-profit and releasing their models as open.

Dig1t · on July 18, 2024

That’s a great idea, they could even change the name of their company to reflect their new open source philosophy.

exe34 · on July 18, 2024

they could found a charity, with the aim of creating open AI models. they could call it OpenAI to draw attention that it's not a proprietary commercial technology.

mritchie712 · on July 18, 2024

yeah, they'd probably need a board to see over it tho, sounds tricky, but doable.

BoiledCabbage · on July 18, 2024

Then they could have their largest benefactor first pledge $100 Million to fund them. Then after suddenly insist they are woefully falling behind google on developing the tech and insist the organization be handed over to his sole control to help them catch up. After being told he's wrong, he could then get upset and renege on his pledge after giving only 10% of his commitment - they'd be left scrambling running out of funds and realize that being dependent on donations at their scale isn't working and they need a way to earn a profit or partnership to self-fund.

Oh and after pulling the pledged money, that other party could go buy a social media site and use it to whine about how they aren't a non-profit anymore... while knowing they're the cause of it.

fragmede · on July 18, 2024

that sounds ridiculous. there's no way that could ever happen. what are you on‽

exe34 · on July 18, 2024

this seems like an unrealistic scenario, nobody is that dumb?

minimaxir · on July 18, 2024

There are many ways to extract value using a commodity: in OpenAI's case (and a few other LLM providers), the current strategy appears to be lock-in with additional services.

dehrmann · on July 18, 2024

I wonder how feasible it would be to do massively distributed training, a la Folding@home or SETI@home.

Sohcahtoa82 · on July 18, 2024

My somewhat limited understanding of LLMs (and ML in general) is that you would end up using a LOT of bandwidth, not to mention that training would be insanely slow, since each layer of the neural network requires the results of the previous layer, and then needs to send its results back.

The greatest limiter for training isn't raw computing power, but storage. If your model is 400B parameters, you need 800GB to store it, assuming fp16. Then you need another 800 GB for calculating gradients. Sharding all this out means transferring a lot of data. If your device only stores 1B parameters out of the 400B, that means having to download 2 GB of data, doing your share of the work, then uploading 2 GB of results. Even with gigabit internet, you'll spend an order of magnitude more time transferring data than actually processing it.

At that point, it'd be faster to train on standard-specced PC that had to constantly page out most of the model.

__sy__ · on July 18, 2024

Sounds like a really interesting technical challenge.

Couple of issues I can see: (1) most devices out there would probably be mobile, so no NVIDIA/CUDA for you; (2) even with binding to, say, Apple Silicon, you might still be memory limited, i.e. can you fit the entire net on a single mobile GPU; (3) network latency?

techjamie · on July 18, 2024

I would think that you'd need a model format that could be updated in different places at the same time and easily merged at various stages. Because right now, adding all that network latency only slows things down.

I'm sure there's a method to permit it, but it doesn't seem like anyone's worked it out yet.

fragmede · on July 18, 2024

the problem is internode bandwidth and latency. supercomputers are super because they have insanely fast bandwidth between the nodes so the GPUs can talk to each other as if they were local.

vineyardmike · on July 19, 2024

For a while the answer will probably be that they have better economies around GPU utilization rates.

tempodox · on July 18, 2024

I think enough people have already demonstrated that they essentially don't care. À la “facebook knows you're pregnant before you do” – and now so will “Open”“AI”.

vineyardmike · on July 19, 2024

Just an FYI that this story was target (the department store) not facebook.

OutOfHere · on July 18, 2024

I now see gpt-4o-mini listed at https://platform.openai.com/docs/models/gpt-4o-mini

Looking at https://openai.com/index/gpt-4o-mini-advancing-cost-efficien... , gpt-4o-mini is better than gpt-3.5 but worse than gpt-4o, as was expected. gpt-4o-mini is cheaper than both, however. Independent third-party performance benchmarks will help.

guilamu · on July 18, 2024

Claude 3.5 sonnet is so much better in every test I made (coding and everyday mondain stuffs), it beats me why anyone would choose chatgpt (I'm using free versions only).

OutOfHere · on July 18, 2024

> Claude 3.5 sonnet is so much better

Good to know, although this is targeting cheaper API use for specific applications in which a second-tier model is sufficient. Note however that according to the LMSYS Leaderboard, GPT-4o rates slightly higher than Claude 3.5 Sonnet.

> mondain stuffs

Mundane, not mondain.

guilamu · on July 19, 2024

For some reason I can't edit and fix the typo. Anyhow, thanks for the correction.

No idea what the leaderboard says, it's just been night and day for me since Sonnet 3.5 got out. Maybe my use case is just what sonnet does best.

m3kw9 · on July 18, 2024

This must mean there is a big chunk of people using open source models that are cheaper and they want a slice of that action

exitb · on July 18, 2024

There's market for small models with large context, for cases where there’s little need for reasoning (summarization, searching etc). 4o-mini is probably better and cheaper to run than 3.5.

unraveller · on July 18, 2024

It means they don't want anyone releasing much stronger models then they do or they will undercut you by 1 magnitude and out perform you by 5%. Hardly a recipe for long-term consumer benefit.

Me1000 · on July 18, 2024

Or it means that the existing models are too large to be profitable even at scale.

Yusefmosiah · on July 18, 2024

OpenAI’s strategy has been bizarre since at least last November, when they launched custom GPTs, then had the boardroom coup.

Since the launch of Claude 3 Opus, and then Claude 3.5 Sonnet, they have been significantly behind Anthropic in terms of the general intelligence of their models. And instead of deploying something on par or better, they are making demos of video generation (Sora) or audio-to-audio models, not releasing anything.

GPT-4o is quite bad at coding, often getting stuck in a loop, and “fixing” buggy code by rewriting it without any changes.

GPT-4o is speculated to be a distillation of a larger model, and now GPT-4o-mini is an even dumber smaller model. But what’s the point?

Who is actually using small/fast/cheap/dumb models in production apps? Most real apps require higher reliability than even the biggest/slowest/priciest/smartest models can provide today. For the use case of transformers that has taken off, aiding students and knowledge workers in one-off tasks like writing code and prose, most users want smarter, more reliable outputs, even at the expense of speed and cost.

GPT-4o-mini seems like a move to increase margins, not make customers happier. That, like demoing products without launching them, is what big old slow corporations do, not how world-leading startups operate.

sunaookami · on July 18, 2024

Since Claude 3.5 Sonnet was released I can't go back to GPT anymore. It sounds too "robotic" and is overly verbose. It explains every little detail that I don't want to know and still is far worse than Claude. OpenAI has to really step up their game if they don't want to fall behind. In fact, GPT-4 got worse back in November, the best version is still the one from June 2023 but it's only available in the API.

jmccarthy · on July 18, 2024

Sonnet is great, but also suggest exploiting custom instructions in the ChatGPT UI. Here's a snippet from mine:

Extremely concise, formal. As short as possible. Assume I am an industry expert in any topic we discuss. Answer assuming I have the highest level of intellect possible, and do not require explication regardless of the sophistication of the topic. In cases where one approach among many is superior, offer an opinionated argument in favor of that approach.

andrewmcwatters · on July 18, 2024

I wonder how this compares to running an ollama server on a vps.

Edit: I’m amazed by how offended some people are by such a simple question.

ilaksh · on July 18, 2024

I don't think there are any VPSs that can do that in a way that is even remotely performant or a good value compared to something like an LLM inference provider or serverless GPU. I would look into together.ai and RunPod for that type of thing.

But let me know if you find something. I just don't think something tiny like phi-3 which could run on a VPS, although great for it's size, is at all comparable to this stuff in terms of ability.

wkat4242 · on July 18, 2024

True, you could run it at home on a server though.

My AI server takes about 60W idle and 300-350W while running a query in llama3. At a kWh price of 0.15€ that ends up at about 7-10€ a month if it's not loaded too heavily. Not bad IMO.

The server could be more energy optimized though. But that would cost me also.

ilaksh · on July 18, 2024

Llama3 8b is in no way equivalent to Gemini Flash or gpt-4o mini.

wkat4242 · on July 18, 2024

I'm really missing a middle-ground llama3 model. Llama 2 had a 12b but Llama3 is 8 or 70. 12 or 13 would be pretty ideal for a 16gb card :(

vineyardmike · on July 19, 2024

Gemma 27b? Command R (32b) would be ideal middle grounds but won’t fit in a 16gb card. There are a handful of 12gb like the new mistral though. I doubt that 12b overs much improvement over a 7b to compare to 70b. Seems like an entirely different class.

You probably want to limit yourself if you do have a 16gb card because you still need to fit the context window in memory too.

wkat4242 · on July 19, 2024

I tried gemma2 but it has to be quantified to hell to fit in a 16gb card and then it performs a lot worse than llama3 8b unforutnately.

I hope the new mistral comes out soon for ollama.

True about the context window but with llama3 this was not a problem as it has such a small context window anyway.

ilaksh · on July 18, 2024

Mistral just released a totally open 12b model

wkat4242 · on July 19, 2024

Yeah I saw! I hope it comes to ollama soon.

ilaksh · on July 18, 2024

Mini is 82% MMLU and 8b is 68%

j5155 · on July 18, 2024

How much did the server hardware cost?

wkat4242 · on July 18, 2024

About 100€ for the PC (some hardware was surplus) and 300€ for the GPU which was a nice 16gb model with HBM2. Was pretty nice for an educational project IMO. I much rather do something like this than spend money on a course.

ilaksh · on July 18, 2024

Anyone see benchmarks comparing this to Gemini Flash or gpt-3.5 or open models that can run on groq?

xendipity · on July 18, 2024

I'm not finding any direct sources from OpenAI, but here's this snippet from a Reuters article [1]

> Priced at 15 cents per million input tokens and 60 cents per million output tokens, the GPT-4o mini is more than 60% cheaper than GPT-3.5 Turbo, OpenAI said. It currently outperforms the GPT-4 model on chat preferences and scored 82% on Massive Multitask Language Understanding (MMLU), OpenAI said.

...

> The GPT-4o mini model's score compared with 77.9% for Google's Gemini Flash and 73.8% for Anthropic's Claude Haiku, according to OpenAI.

For some more context: We don't know the size of 4o-mini but Mistral's just released NeMo 12B scores 68% on the MMLU. [2]

[1]: https://www.reuters.com/technology/artificial-intelligence/o...

[2]: https://mistral.ai/news/mistral-nemo/

pzo · on July 18, 2024

Also for some reference:

Gemma 2 27B scored: 75.2 in MMLU

LLama 3 70B scored: 79.5 in MMLU

Haiku scored: 75.2 in MMLU

GPT 3.5 scored: 70.0 in MMLU

Based on pricing I see in openrouter.ai across different providers this seems like the cheapest model for this kind of performance.

ref: [0] https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bb...

[1] https://blog.google/technology/developers/google-gemma-2/

unraveller · on July 18, 2024

https://artificialanalysis.ai (groq)

Quality to Price graph suggests gpt3.5 was the worst now 4o-mini inched out all others of that lower league. It supposedly gets you flash/llama3-70B tier quality at around llama3-8B price.

machiaweliczny · on July 19, 2024

I think they just want to maintain lead with data/task collections and that’s why having cheapest model wins

deadbabe · on July 18, 2024

The only OpenAI product I care to hear about is ChatGPT-5

sa-code · on July 18, 2024

They're never letting go of the gpt4 branding

_wbrb · on July 19, 2024

They are, and I would say soon.