This appears to be part of an embargoed news blitz from a few news organizations (Verge and Bloomberg posted the same news at the same time), which is an interesting PR deviation from OpenAI posting it on their blog and having it go viral. The news isn't on their official blog at all currently.
Yeah I found that interesting. Flubbed embargo times makes sense, but could it also be that they're letting the news organizations have first dibs to build a little goodwill with the industry?
Is OpenAI still making any changes to make ChatGPT better and more accurate and correct, or are they only focusing on making it cheaper by giving us weaker/dumber responses in a faster way nowadays? I have cancelled my subscription recently because didn't think that the GPT-4 was much better than what I get for free at Claude or Gemini.
Going out on a limb here that they're not focusing the entire company on one objective.
Adding to this, reducing cost means they've reduced compute and improved quality per unit of computation.
That matters! It matters that these systems currently require an embarassing amount of energy to run otherwise. It matters that these models could be portable to the point where they can be run locally on portable consumer electronics rather than being locked into remote compute and HTTP calls.
> Is OpenAI still making any changes to make ChatGPT better and more accurate and correct
a few prompts recently, it just answered for me instead of refusing, to my surprise, so if feels like it's been getting better, but unfortunately there's no hard data I have on that, it's more of a feeling, so it's hard to prove that's actually true to someone else.
then again, ChatGPT-4o was recently found to be unable to do 9.11 - 9.9 correctly unless pressed, so there's still a long ways to go.
That is interesting for sure, but the subsequent questioning "Why did you get it wrong the first time?" and asking it to explain itself seems like it shows a misunderstanding about how these technologies actually work. Please stop treating these tools like they are conscious and actually understand what they are spitting back to you! /rant
Actually, given that models are trained on data generated by humans, the ideal way to get decent responses are by treating them as such. Many times when I'm aware that a model gave an incorrect response, or I'm not sure and skeptical, a simple "Are you sure" or "Explain your answer" does wonders.
I've done spreadsheets are all you need. The questioning isn't based on a misunderstanding that this is autocomplete with an insane dataset, but given a black box, how else do you want it to explain how it works?
LLMs are text completion engines and can't really do math. Even if it happened to know the correct answer to 9.11 - 9.9, it could still fall flat on 9.12 - 9.9.
> LLMs are text completion engines and can't really do math
It blows my mind that there are people on HN that don't know this.
LLMs are basically Markov Chains on a massive dose of steroids. It looks at the last $context_window tokens and decides what the next token should be. They just work differently in that a neural network creates more "fuzzy" matching than a Markov Chain, which is purely statistical.
Yes, but chatgpt will happily go browsing the web if needed or appropriate. I guess there is a reason why they haven't hooked up a symbolic math package to do the math when encountered, but feels weird it's not a feature yet.
If it doesn't already (doesn't seem like it does last time I checked), it could potentially recognize that as a mathematical equation and feed that portion of the query into a calculation engine, and then include that in its answer.
Q: Write parameterized Python that gives the answer to 9.11 - 9.9 and 9.12 - 9.9. Run the Python to show the answer to each pair of parameters.
A: The results for the parameter pairs are as follows:
- 9.11 - 9.9 = -0.79
- 9.12 - 9.9 = -0.78 [>_]
The [>_] is a link. Clicking it opens a pop-up window, showing:
def calculate_difference(a, b):
return a - b
# Define the pairs of parameters
pairs = [(9.11, 9.9), (9.12, 9.9)]
# Calculate and print the results for each pair
results = {f"{a} - {b}": calculate_difference(a, b) for a, b in pairs}
results
// Note, if I ask it simply "compute the answer to 9.11 - 9.9", I get nonsense such as a negative number, and no Python. It doesn't "automagically" write code, it has to be nudged. If you see "Analyzing", it's fired up a sandboxed Python container.
I tried asking it to compare 9.11 to 9.9 using a python sandbox and it wrote the python code and then told me the result of the code would be 9.11 > 9.9 without actually running it. lol.
It is interesting that one of the best ways to deal with math problems in LLMs is to have it write Python code to solve the problem. It is good at writing that kind of fairly straight-forward Python, and Python is good at accurately being able to do math. It does mean that you need to safely implement a sandboxed Python interpreter to do the calculations though.
I suspect model improvement costs exponentially more. Between the fact it's been over a year since 4's release and 4o is only marginally better, and that OpenAI is focusing on other things (Sora, mini models, voice, etc) suggests a "better and more accurate and correct" model is actually really hard.
If request latency is your top concern and you don’t need it to be as smart as GPT-4, why are you using OpenAI and not a model hosted by Groq or similar?
Groq’s rate limits are so bad they’re only good enough for one person using it very rarely with an under 4k context size. It’s just a marketing stunt at the moment.
There might be some changes, but OpenAI has consistently shown they have way more polished stuff sitting and getting ready to go out, based on the quality of the product or update on the day of the launch.
If "AI" tools are eventually going to have to be "free" (as in beer) to compete, I shudder to think of what companies like OpenAI will have to extract from users to please investors...
they could found a charity, with the aim of creating open AI models. they could call it OpenAI to draw attention that it's not a proprietary commercial technology.
Then they could have their largest benefactor first pledge $100 Million to fund them. Then after suddenly insist they are woefully falling behind google on developing the tech and insist the organization be handed over to his sole control to help them catch up. After being told he's wrong, he could then get upset and renege on his pledge after giving only 10% of his commitment - they'd be left scrambling running out of funds and realize that being dependent on donations at their scale isn't working and they need a way to earn a profit or partnership to self-fund.
Oh and after pulling the pledged money, that other party could go buy a social media site and use it to whine about how they aren't a non-profit anymore... while knowing they're the cause of it.
There are many ways to extract value using a commodity: in OpenAI's case (and a few other LLM providers), the current strategy appears to be lock-in with additional services.
My somewhat limited understanding of LLMs (and ML in general) is that you would end up using a LOT of bandwidth, not to mention that training would be insanely slow, since each layer of the neural network requires the results of the previous layer, and then needs to send its results back.
The greatest limiter for training isn't raw computing power, but storage. If your model is 400B parameters, you need 800GB to store it, assuming fp16. Then you need another 800 GB for calculating gradients. Sharding all this out means transferring a lot of data. If your device only stores 1B parameters out of the 400B, that means having to download 2 GB of data, doing your share of the work, then uploading 2 GB of results. Even with gigabit internet, you'll spend an order of magnitude more time transferring data than actually processing it.
At that point, it'd be faster to train on standard-specced PC that had to constantly page out most of the model.
Sounds like a really interesting technical challenge.
Couple of issues I can see: (1) most devices out there would probably be mobile, so no NVIDIA/CUDA for you; (2) even with binding to, say, Apple Silicon, you might still be memory limited, i.e. can you fit the entire net on a single mobile GPU; (3) network latency?
I would think that you'd need a model format that could be updated in different places at the same time and easily merged at various stages. Because right now, adding all that network latency only slows things down.
I'm sure there's a method to permit it, but it doesn't seem like anyone's worked it out yet.
the problem is internode bandwidth and latency. supercomputers are super because they have insanely fast bandwidth between the nodes so the GPUs can talk to each other as if they were local.
I think enough people have already demonstrated that they essentially don't care. À la “facebook knows you're pregnant before you do” – and now so will “Open”“AI”.
Looking at https://openai.com/index/gpt-4o-mini-advancing-cost-efficien... , gpt-4o-mini is better than gpt-3.5 but worse than gpt-4o, as was expected. gpt-4o-mini is cheaper than both, however. Independent third-party performance benchmarks will help.
Claude 3.5 sonnet is so much better in every test I made (coding and everyday mondain stuffs), it beats me why anyone would choose chatgpt (I'm using free versions only).
Good to know, although this is targeting cheaper API use for specific applications in which a second-tier model is sufficient. Note however that according to the LMSYS Leaderboard, GPT-4o rates slightly higher than Claude 3.5 Sonnet.
There's market for small models with large context, for cases where there’s little need for reasoning (summarization, searching etc). 4o-mini is probably better and cheaper to run than 3.5.
It means they don't want anyone releasing much stronger models then they do or they will undercut you by 1 magnitude and out perform you by 5%. Hardly a recipe for long-term consumer benefit.
OpenAI’s strategy has been bizarre since at least last November, when they launched custom GPTs, then had the boardroom coup.
Since the launch of Claude 3 Opus, and then Claude 3.5 Sonnet, they have been significantly behind Anthropic in terms of the general intelligence of their models. And instead of deploying something on par or better, they are making demos of video generation (Sora) or audio-to-audio models, not releasing anything.
GPT-4o is quite bad at coding, often getting stuck in a loop, and “fixing” buggy code by rewriting it without any changes.
GPT-4o is speculated to be a distillation of a larger model, and now GPT-4o-mini is an even dumber smaller model. But what’s the point?
Who is actually using small/fast/cheap/dumb models in production apps? Most real apps require higher reliability than even the biggest/slowest/priciest/smartest models can provide today. For the use case of transformers that has taken off, aiding students and knowledge workers in one-off tasks like writing code and prose, most users want smarter, more reliable outputs, even at the expense of speed and cost.
GPT-4o-mini seems like a move to increase margins, not make customers happier. That, like demoing products without launching them, is what big old slow corporations do, not how world-leading startups operate.
Since Claude 3.5 Sonnet was released I can't go back to GPT anymore. It sounds too "robotic" and is overly verbose. It explains every little detail that I don't want to know and still is far worse than Claude. OpenAI has to really step up their game if they don't want to fall behind. In fact, GPT-4 got worse back in November, the best version is still the one from June 2023 but it's only available in the API.
Sonnet is great, but also suggest exploiting custom instructions in the ChatGPT UI. Here's a snippet from mine:
Extremely concise, formal. As short as possible. Assume I am an industry expert in any topic we discuss. Answer assuming I have the highest level of intellect possible, and do not require explication regardless of the sophistication of the topic. In cases where one approach among many is superior, offer an opinionated argument in favor of that approach.
I don't think there are any VPSs that can do that in a way that is even remotely performant or a good value compared to something like an LLM inference provider or serverless GPU. I would look into together.ai and RunPod for that type of thing.
But let me know if you find something. I just don't think something tiny like phi-3 which could run on a VPS, although great for it's size, is at all comparable to this stuff in terms of ability.
True, you could run it at home on a server though.
My AI server takes about 60W idle and 300-350W while running a query in llama3. At a kWh price of 0.15€ that ends up at about 7-10€ a month if it's not loaded too heavily. Not bad IMO.
The server could be more energy optimized though. But that would cost me also.
Gemma 27b? Command R (32b) would be ideal middle grounds but won’t fit in a 16gb card. There are a handful of 12gb like the new mistral though. I doubt that 12b overs much improvement over a 7b to compare to 70b. Seems like an entirely different class.
You probably want to limit yourself if you do have a 16gb card because you still need to fit the context window in memory too.
About 100€ for the PC (some hardware was surplus) and 300€ for the GPU which was a nice 16gb model with HBM2. Was pretty nice for an educational project IMO. I much rather do something like this than spend money on a course.
I'm not finding any direct sources from OpenAI, but here's this snippet from a Reuters article [1]
> Priced at 15 cents per million input tokens and 60 cents per million output tokens, the GPT-4o mini is more than 60% cheaper than GPT-3.5 Turbo, OpenAI said.
It currently outperforms the GPT-4 model on chat preferences and scored 82% on Massive Multitask Language Understanding (MMLU), OpenAI said.
...
> The GPT-4o mini model's score compared with 77.9% for Google's Gemini Flash and 73.8% for Anthropic's Claude Haiku, according to OpenAI.
For some more context: We don't know the size of 4o-mini but Mistral's just released NeMo 12B scores 68% on the MMLU. [2]
Quality to Price graph suggests gpt3.5 was the worst now 4o-mini inched out all others of that lower league. It supposedly gets you flash/llama3-70B tier quality at around llama3-8B price.