Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Overcoming the limits of current LLMs (seanpedersen.github.io)
119 points by sean_pedersen on July 18, 2024 | hide | past | favorite | 106 comments


LLMs don't only hallucinate because of mistaken statements in their training data. It just comes hand-in-hand with the model's ability to remix, interpolate, and extrapolate answers to other questions that aren't directly answered in the dataset. For example if I ask ChatGPT a legal question, it might cite as precedent a case that doesn't exist at all (but which seems plausible, being interpolated from cases that do exist). It's not necessarily because it drew that case from a TV episode. It works the same way that GPT-3 wrote news releases that sounded convincing, matching the structure and flow of real articles.

Training only on factual data won't solve this.

Anyway, I can't help but feel saddened sometimes to see our talented people and investment resources being drawn in to developing these AI chatbots. These problems are solvable, but are we really making a better world by solving them?


100% I think the author is really misunderstanding the issue here. "Hallucination" is a fundamental aspect of the design of Large Language Models. Narrowing the distribution of the training data will reduce the LLM's ability to generalize, but it won't stop hallucinations.


I agree in that a perfectly consistent dataset won't completely stop statistical language models from hallucinating but it will reduce it. I think it is established that data quality is more important than quantity. Bullshit in -> bullshit out, so a focus on data quality is good and needed IMO.

I am also saying LMs output should cite sources and give confidence scores (which reflects how much the output is in or out of the training distrtibution).


I think the problem is you need an extremely large quantity of data just to get the machine to work in the first place. So much so that there may not be enough to get it working on just "quality" data.


How would confidence scores work? Multiple passthroughs and a % attached to each statement according to how often it appeared in the generated result?

If so, building this could be quite complex depending on the domain. In the legal field even one simple word that is changed can have large consequences.


What’s a non-statistical language model?

And I think looking to the training data for sources is a little silly - that’s the training data for intuitive language use, not true statements about the world. If you haven’t checked it out yet, two terms you’d love are “RAG” and “Manuel De Landa”


Most sentences in the world are not about truth or falsity. Training on a high quality corpus isn’t going to fix ‘hallucination’. The complete separation of facts from sentences is what makes LLMs powerful.


These problems are solvable, but are we really making a better world by solving them?

When you ask yourself that question -- and you do ask yourself that, right? -- what's your answer?


I do, all the time! My answer is "most likely not". (I assumed that answer was implied by my expressing sadness about all the work being invested in them.) This is why, although I try to keep up-to-date with and understand these technologies, I am not being paid to develop them.


I mean, comon man, _most_ tech people work on in general is also useless. FAANG has a retainer on talent whether they are spending their time wisely or not.

LLM stuff is not in a different echelon


AI noob, but instead of training and fine tuning the LLM itself, don’t more specific and targeted embeddings paired with the model help alleviate the hallucination where you incorporate semantic search context with the question?


One of the main factors that makes LLMs popular today is that scaling up the models is a simple and (relatively) inexpensive matter of buying compute capacity and scraping together more raw text to train them. Without large and highly diverse training datasets to construct base models, LLMs cannot produce even the superficial appearance of good results.

Manually curating "tidy", properly-licensed and verified datasets is immensely more difficult, expensive, and time-consuming than stealing whatever you can find on the open internet. Wolfram Alpha is one of the more successful attempts in that curation-based direction (using good-old-fashioned heuristic techniques instead of opaque ML models), and while it is very useful and contains a great deal of factual information, it does not conjure appealing fantasies of magical capabilities springing up from thin air and hands-off exponential improvement.


> properly-licensed and verified datasets is immensely more difficult, expensive

Arguably the bigger problem is that many of those datasets e.g. WSJ articles are proprietary and can be exclusively licensed like we've seen recently with OpenAI.

So we end up with in a situation where competition is simply not possible.


> Arguably the bigger problem is that many of those datasets e.g. WSJ articles are proprietary and can be exclusively licensed like we've seen recently with OpenAI.

> So we end up with in a situation where competition is simply not possible.

Exactly, and Technofeudalism advances a little more into a new feud.

OpenAI is trying to create its moat by shoring up training data, probably attempting to not allow competitors to train on the same datasets they've been licencing, at least for a while. Training data is the only possible moat for LLMs, models seem to be advancing quite well between different companies but as mentioned here a tidy training dataset is the actual gold.


It's landmines no matter how you approach the problem.

If you treat the web as a free-for-all and you scrape freely, you get sued by the content platforms for copyright or term of service violation.

If you license the content, you let the highest bidder get the content.

No matter what happens, capital wins.


the irony is that if large media providers aren't represented in the training sets, my comments on internet forums over the decades will be over-represented, which is kind of great, really.


Right? I always ask people - Hypothetically if someone created a superintelligent AI that took over the world, wouldn’t you WANT it to share your opinions and morals?

Every tiny bit of text you write is a vote in the election of our future AI overlords.


It’s not unethical if people in positions of privilege and power do it to maintain their rightful position of privilege and power.


Please don't post in the flamewar style to HN. It degrades discussion and we're trying to go in the opposite direction here, to the extent that is possible on the internet.

https://news.ycombinator.com/newsguidelines.html


> ...manually curate a high-quality (consistent) text corpus based on undisputed, well curated wikipedia articles and battle tested scientific literature.

This assumption is based on the mistaken assumption that science is about objective truth.

It is confusing the map for the territory. Scientific models are intended to be useful, not perfect.

Statistical learning, vs symbolic learning is about existential quantification vs universal quantification respectively.

All models are wrong some are useful, this applies to even the most unreasonably accurate versions like QFT and GR.

Spherical cows, no matter how useful are hotly debated outside of the didactic half truths of low level courses.

The corpus that the above seeks doesn't exist in academic circles, only in popular science where people don't see that practical, useful models are far more important that 'correct' ones.


We can't develop a universally coherent data set because what we understand as "truth" is so intensely contextual that we can't hope to cover the amount of context needed to make the things work how we want, not to mention the numerous social situations where writing factual statements would be awkward or disastrous.

Here are a few examples of statements that are not "factual" in the sense of being derivable from a universally coherent data set, and that nevertheless we would expect a useful intelligence to be able to generate:

"There is a region called Hobbiton where someone named Frodo Baggins lives."

"We'd like to announce that Mr. Ousted is transitioning from his role as CEO to an advisory position while he looks for a new challenge. We are grateful to Mr. Ousted for his contributions and will be sad to see him go."

"The earth is round."

"Nebraska is flat."


> We can't develop a universally coherent data set because

Yet every child seems to manage, when raised by a small village, over a period of about 18 years. I guess we just need to give these LLMs a little more love and attention.


And then you go out into the real world, talk to real adults, and discover that the majority of people don't have a coherent mental model of the world, and have completely ridiculous ideas that aren't anywhere near an approximation of the real physical world.


> and discover that the majority of people don't have a coherent mental model of the world

"Coherent" is doing a lot of lifting here. All humans have highly flawed models, and we've been culturally conditioned to grade on a curve to hide the problem from ourselves.


Or maybe hundreds of millions of years of evolutionary pressure to build unbelievably efficient function approximation.


You're right. We don't really know how to handle uncertainty and fuzziness in logic properly (to avoid logical contradictions). There has been many mathematical attempts to model uncertainty (just to name a few - probability, Dempster-Shafer theory, fuzzy logic, non-monotone logics, etc.), but they all suffer from some kind of paradox.

At the end of the day, none of these theoretical techniques prevailed in the field of AI, and we ended up with, empirically successful, neural networks (and LLMs specifically). We know they model uncertainty but we have no clue how they do it conceptually, or whether they even have a coherent conception of uncertainty.

So I would pose that the problem isn't that we don't have the technology, but it's rather we don't understand what we want from it. I am yet to see a coherent theory of how humans manipulate the human language to express uncertainty that would encompass broad (if not all) range of how people use language. Without having that, you can't define what is a hallucination of an LLM. Maybe it's making a joke (some believe that point of the joke is to highlight a subtle logical error of some sort), because, you know, it read a lot of them and it concluded that's what humans do.

So AI eventually prevailed (over humans) in fields where we were able to precisely define the goal. But what is our goal vis-a-vis human language? What do we want from AI to answer to our prompts? I think we are stuck at the lack of definition of that.


Man it seems like the ship has sailed on "hallucination" but it's such a terrible name for the phenomenon we see. It is a major mistake to imply the issue is with perception rather than structural incompetence. Why not just say "incoherent output"? It's actually descriptive and doesn't require bastardizing a word we already find meaningful to mean something completely different.


> Why not just say "incoherent output"? Because the biggest problem with hallucinations is that the output is usually coherent but factually incorrect. I agree that "hallucination" isn't the best word for it... perhaps something like "confabulation" is better.


And we use "hallucination" because in the ancient times when generative AI meant image generation models would "hallucinate" extra fingers etc.

The behavior of text models is similar enough that the wording stuck, and it's not all that bad.


"hallucination" was coined in the context of text generating RNNs. Specifically in this blog post by Karpathy in 2015: https://karpathy.github.io/2015/05/21/rnn-effectiveness/


That was a misnomer—hallucination refers to perception, not generation. Completely misled an entire generation of people.


I appreciated a post on here recently that likened AI hallucination to 'bullshitting'. It's coherent, even plausible output without any regard for the truth.


More true to say that all output is bullshitting, not just the ones we call hallucinations. Some of it is true, some isn't. The model doesn't know or care.


While I have absolutely no issues with the word "shit" in popular terms, I'd normally like to reserve it for situations where there's actually intended malice like in "enshittification".

Rather than just an imperfect technology as we have here.

Many people object to the term enshittification for foul-mouthing reasons but I think it covers it very well because the principle it covers is itself so very nasty. But that's not at all the case here.


"Bullshitting" isn't a new piece of jargon, it's a common English word of many decades vintage, and is being used in its dictionary sense here.


I think it's a pretty good name for the phenomenon -- maybe the only problem with the term is that what models are doing is 100% hallucination all the time -- it's just that when the hallucinations are useful we don't call them hallucinations -- so maybe that is a problem with the term (not sure if that's what you are getting at).

But there's nothing at all different about what the model is doing between these cases -- the models are hallucinating all the time and have no ability to assess when they are hallucinating "right" or "wrong" or useful/non-useful output in any meaningful way.


They aren't hallucinating in any way comparable to humans, which implies a delusion in perception. You're describing the quality of output by using a word used to describe the quality of input.


I prefer “confabulate” to describe this phenomena.

: to fill in gaps in memory by fabrication

> In psychology, confabulation is a memory error consisting of the production of fabricated, distorted, or misinterpreted memories about oneself or the world.

It’s more about coming up with a plausible explanation in the absence of a readily-available one.


This is not what hallucination means in the pre-LLM machine learning literature.


"Hallucinations" implies that someone isn't of sound mental state. We can argue forever about what that means for a LLM and whether that's appropriate, but I think it's absolutely the right attitude and approach to be taking toward these things.

They simply do not behave like humans of sound minds, and "hallucinations" conveys that in a way that "confabulations" or even "bullshit" does not. (Though "bullshit" isn't bad either.)


I disagree with this take because LLMs are, always, hallucinating. When they get things right it’s because they are lucky. Yes, yes, it’s more complicated than that, but the essence of LLMs is that they are very good at being lucky. So good that they will often give you better results than random search engine clicks, but not good enough to be useful for anything important.

I think what calling the times they get things wrong hallucinations is largely an advertising trick. So that they can sort of fit the LLMs into how all IT is sometimes “wonky” and sell their fundamentally flawed technology more easily. I also think it works extremely well.


But the point is, isn't hallucinating about having malformed, altered or out of touch input rather than producing inaccurate output yourself?

It is the memory pathways leading them astray. It could be thought of a memory system that at certain point any longer can't be fully sure if whatever connections they have are from actually being trained or it or created accidentally.


> isn't hallucinating about having malformed, altered or out of touch input rather than producing inaccurate output yourself?

I suppose so, in the sense that someone could simply be lying about pink elephants instead of seeing them. However it's hard to argue that the machine knows the "right" answer and is (intelligently?) deceiving us.

> It is the memory pathways leading them astray.

I don't think it's a "memory" issue as much as a "they don't operate the way we like to think they do" issue.

Suppose a human is asked to describe different paintings on the wall of an art gallery. Sometimes their statements appear valid and you nod along, and sometimes the statements are so wrong that it alarms you, because "this person is hallucinating."

Now consider how the entire situation is flipped by finding out one additional fact... They're actually totally blind.

Is it a lie? Is it a hallucination? Does it matter? Either way you must dramatically re-evaluate what their "good" outputs really mean and whether they can be used.


To me it's more like, imagine that you have read a lot of books throughout your life, but then someone comes in and asks a question from you and you try to answer from memory, but you get beaten when you say something like "I don't know", and you get rewarded if you answer accurately. You do get beaten if you answer inaccurately, but eventually you learn that if you just say something, you might just be accurate and you will not get beaten. So you just always learn to answer to the best of your knowledge, while never saying that you specifically don't know, because it decreases chances of getting beat up. You are not intentionally lying, you are just hoping that whatever you say is accurate to the best you can do according to the neural connections you've built up in your brain.

Like you ask me for a birthdate of some obscure political figure from history? I'm going to try to feel out what period in history the name might feel like to me and just make my best guess based on that, then say some random year and a birthdate. It just has the lowest odds of being beaten. Was I hallucinating? No, I was just trying to not get beaten.


> When they get things right it’s because they are lucky.

This is transparently wrong. It gets so many things right in a response that the few things it gets wrong are tremendously frustrating. I think people underestimate how much correct "knowledge about the world" is expressed in a typical chat gpt response and focus only on the parts that are incorrect.

If it were wrong about _everything_ at rates no better than chance, we wouldn't even be having this conversation because nobody would be using them.


To offer a satirical analogy: "Lastly, I want to reassure investors and members of the press that we take these concerns very seriously: Hindenburg 2 will only contain only normal and unreactive hydrogen gas, and not the rare and unusual explosive kind, which is merely a temporary hurdle in this highly dynamic and growing field."

Edit: It retrospect, perhaps a better analogy would involve gasoline, as its explosive nature is what's being actively being exploited in normal use.


Yes (to the edit), an analogy with making planes safer by only using non-flammable fuels is perfect.


I expect most people have already filled in the blanks, but for completeness:

"Lastly, I want to reassure investors and members of the press that we take these concerns very seriously: The Ford Pinto-II will only contain only normal and stable gasoline, and not the rare and unusual burning kind, which is merely a temporary hurdle in this highly dynamic and explos--er--fast growing field."


I don't really immediately link "Hallucinations" with "Unsound mind" - most people I know have experienced auditory hallucinations - often things like not sure if the doorbell went off, or if someone said their name.

And I couldn't find a single one of my friends who hadn't experienced "phantom vibration syndrome".

Both I'd say are "Hallucinations", without any real negative connotation.


Hallucinations implies that they do behave like a human mind. Why else would you use the word if you were not trying to draw this parallel?


We're stuck with metaphors for human behavior because the way LLMs operate is so alien and counterintuitive, yet similar enough to human behavior, that we haven't yet developed suitable language to describe it. "Hallucination" gets the point across in general terms, at least.


"Sound" minds for humans is graded on a curve, and this trick is not acknowledged, or popular.


How about "dream-reality confusion (DRC)" ?


There is no dream-reality separation in an LLM, or really any conception of dreams or reality, so I don't think the term makes sense. Hallucination works fine to describe the phenomenon. LLMs work by coalescing textual information. LLM hallucinations occur due to faulty or inappropriate coalescence of information, which is similar to what occurs with actual hallucinations.


"Incoherence" seems like a far more natural fit for what you're describing than a human with a sensory delusion or psychosis.


Bullshit is the most descriptive one.

LLMs don't do it because they are out of their right mind. They do it because every single answer they say is invented caring only about form, and not correctness.

But yeah, that ship has already sailed.


The problem with "incoherent output" is that it isn't describing the phenomenon at all. There have been cases where LLM output has been incoherent, but modern LLM hallucinations are usually coherent and well-contructed, just completely fabricated.


Do you have an example? Your statement seems trivially contradictory—how do you know it's fabricated without incoherence showing you this? Isn't fabrication the entire point of generative ai?


It seems that your insistence upon using incoherence is based upon a misunderstanding of the word. Coherence does not mean to be factual, but to be logically ordered. Also, the word "fabricated" is generally used colloquially to describe made up information without factual basis.


Why is "incoherent output" better? When an LLM hallucinates, it coherently and confidently lies to you. I think "hallucination" is the perfect word for this.


Hallucination is one single word. Even if it's not perfect it's great as a term. It's easy to remember and people new to the term already have an idea of what it entails. And the term will bend to cover what we take it to mean anyway. Language is flexible. Hallucination in an LLM context doesn't have to be the exact same as in a human context. All it matters is that we're aligned on what we're talking about. It's already achieved this purpose.


Hallucination perfectly describes the phenomenon.


On a literal level, hallucinations are perceptual.

But "hallucination" was already (before LLMs) being used in a figurative sense, i.e. for abstract ideas that are made up out of nothing. The same is also true of other words that were originally visual, like "illusion" and "mirage".


It’s used incorrectly. Hallucination has (or used to have) a very specific meaning in machine learning. All hallucinations are errors but not all errors are hallucinations.


I think calling it hallucination is because of our tendency to anthropomorphize things.

Humans hallucinate. Programs have bugs.


The point is that this isn't a bug.

It's inherent to how LLMs work and is expected although undesired behaviour.


The article suggests a useful line of research. Train an LLM to detect logical fallacies and then see if that can be bootstrapped into something useful because it's pretty clear that all the issues with LLMs is the lack of logical capabilities. If an LLM was capable of logical reasoning then it would be obvious when it was generating made-up nonsense instead of referencing existing sources of consistent information.


> If an LLM was capable of logical reasoning

the prompt interfaces + smartphone apps were (from the beginning), and are ongoing training for the next iteration, they provide massive RLHF for further improvements in already quite RLHFed advanced models.

Whatever tokens they're extracting from all the interactions, the most valuable are those from metadata, like "correct answer in one shot", or "correct answer in three shots".

The inputs and potentially the outputs can be gibberish, but the metadata can be mostly accurate given some implicit/explicit (the tumbs up, the "thanks" answers from users, maybe), human feedback.

The RLHF refinement extracted from getting the models face the entire human population for to be continuously, 24x7x365, prompted in all languages, about all the topics interesting for the human society, must be incredible. If you just can extract a single percentage of definitely "correct answers" from the total prompts answered, it should be massive compared to just a few thousands of QA dedicated RLHF people working on the models in the initial iterations of training.

That was GPT2,3,4, initial iterations of the training. Having the models been evolved to more powerful (mathematical) entities, you can use them to train the next models. Like is almost certainly happening.

My bet is that one of two

- The scaling thing is working spectacularly, they've seen linear improvement in blue/green deployments across the world + realtime RLHF, and maybe it is going a bit slow, but the improvements justify just a bit more waiting to get trained a more powerful,refined model, incredible more better answers from even the previous datasets used (now more deeply inquired by the new models and the new massive RLHF data), if in a year they have a 20x GPT4, Claude, Gemini, whatever, they could be "jumping" to the next 40x GPT4, Claude, Gemini, a lot faster, if they have the most popular, prompted model in the market (in the world).

- The scaling stuff already sunk, they have seen the numbers and it doesn't add by now, or they've seen disminished returns coming. This is being firmly denied by anyone on the record or off the record.


I think we should start smaller and make them able to count first.


Yeah, you can train an LLM to recognize the vocabulary and grammatical features of logical fallacies... Except the nature of fallacies is that they look real on that same linguistic level, so those features aren't distinctive for that purpose.

Heck, I think detecting sarcasm would be an easier goal, and still tricky.


> Except the nature of fallacies is that they look real on that same linguistic level, so those features aren't distinctive for that purpose

Well that's actually good news. With a large enough labelled dataset of actually-sound and fallacious text with similar grammatical features you should be able to train a discriminator to distinguish between them using some other metric. Good luck with getting that data set though.


The Entscheidungsproblem tends to rear its ugly problem

Remember NP is equivalent to second order logic with existential quantified. E.g. for any X there exists a Y

And that only gets you to truthy Trues, co-NP is another problem.

ATP is hard, and while we get lucky with some constrained problems like type inference, which is pathological in its runtime, but decidable, Pressburger arithmetic is the highest form we know is decidable.

It is a large reason CS uses science and falsification vs proofs.

Gödel and the difference between Symantec and syntactic completeness is another rat hole.


> you should be able to train a discriminator to distinguish between them using some other metric

Not when the better metrics are likely alien/incompatible to the discriminator's core algorithm!

Then it's rather inconvenient news, because it means you have to develop something separate and novel.

As the other poster already mentioned, if we can't even get them to reliably count how many objects are being referred to, how do you expect them to also handle logical syllogisms?


My biggest problem with them is that I can't quite get it to behave like I want it to. I built myself a "therapy/coaching" telegram bot (I'm healthy, but like to reflect a lot, no worries). I even built a self-reflecting memory component that generates insights (sometimes spot on, sometimes random af). But the more I use it, the more I notice that neither the memory nor the prompt matters much. I just can't get it to behave like a therapist would. So in other words: I can't find the inputs to achieve a desirable prediction from the SOTA LLMs. And I think that's a bigger problem for them not to be a shallow hype.


>I just can't get it to behave like a therapist would

  import time
  import random

  SESSION_DURATION = 50 * 60
  start_time = time.time()
    
  while True:
    current_time = time.time()
    elapsed_time = current_time - start_time
    
    if elapsed_time >= SESSION_DURATION:
        print("Our time is up. That will be $150. See you next week!")
        break
    
    _ = input("")
    print(random.choice(["Mmm hmm", "Tell me more", "How does that make you feel?"]))
    
    time.sleep(1)  

Thank me later!


haha, good one! although I'm German and it was free for me when I did it. I just had the best therapist. $150 a session is insane!


> One could spin this idea even further and train several models with radically different world views by curating different training corpi that represent different sets of beliefs / world views.

You can get good results by combining different models in chat, or even the same model with different parameters. Model usually gives up on hallucinations when challenged. Sometime it pushes back and provides explanation with sources.

I have a script that puts models into dialog, moderates discussion and takes notes. I run this stuff overnight, so getting multiple choices speeds up iteration.


In my mind LLMs are already fatally compromised. Proximity matching via vector embeddings that offer no guarantees of completeness or correctness have already surrendered the essential advantage of technological advances.

Imagine a dictionary where the words are only mostly in alphabetical order. If you look up a word and don't find it, you can't be certain it's not in there. It's as useful as asking someone else, or several other people, but it's value as a reference is zero, and there's no shortage of other people on the planet.


> Proximity matching via vector embeddings that offer no guarantees of completeness or correctness have already surrendered the essential advantage of technological advances.

On the contrary, it's arguably the breakthrough that allowed us to model concepts and meaning in computers. A sufficiently high-dimensional embedding space can model arbitrary relationships between embedded entities, which allows each of them to be defined in terms of its associations to all the others. This is more-less how we define concepts too, if you dig down into it.

> Imagine a dictionary where the words are only mostly in alphabetical order. If you look up a word and don't find it, you can't be certain it's not in there.

It's already the case with dictionaries. Dictionaries have mistakes, words out of order; they get outdated, and most importantly, they're descriptive. If a word isn't in it, or isn't defined in particular way, you cannot be certain it doesn't exist or doesn't mean anything other than the dictionary says it does.

> It's as useful as asking someone else, or several other people

Which is very useful, because it saves you the hassle of dealing with other people. Especially when it's as useful as asking an expert, which saves you the effort of finding one. Now scale that up to being able to ask about whole topics of interest, instead of single words.

> it's value as a reference is zero

Obviously. So is the value of asking even an expert for an immediate, snap answer, and going with that.

> and there's no shortage of other people on the planet

Again, dealing with people is stupidly expensive in time, energy and effort, starting with having to find the right people. LLM is just a function call away.


Technology advances by supplanting human mechanisms, not by amplifying or cheapening them. A loom isn't a more nimble hand, it's a different mechanical approach to weaving. Wheels and roads aren't better legs, they're different conveyances. LLMs as a replacement for dealing with people but offering only the same certainty aren't an advance.

LLMs do math by trying to match an answer to a prompt. Mathematica does better than that.


Wheels and roads do the same thing as legs in several major use cases, only they do it better. Sane with jet engines and flapping wings. Same with loom vs. hand, and same with LLMs vs. people.

> LLMs do math by trying to match an answer to a prompt. Mathematica does better than that.

Category error. Pencil and paper or theorem prover are better at doing complex math than snap judgment of an expert, but an expert using those tools according to their judgement is the best. LLMs compete with snap judgement, not heavily algorithmic tasks.

Still, it's a somewhat pointless discussion, because the premise behind your argument is that LLMs aren't a big breakthrough, which is in disagreement with facts obvious to anyone who hasn't been living under a rock for the past year.


I'm playing around with LangChain and LangGraph (https://www.langchain.com/) and it seems like these enable just the sort of mechanisms mentioned.


Does anyone really believe that having a good corpus will remove hallucinations?

Is this article even written by a person? Hard to know; they have a real blog with real article, but stuff like this reads strangely. Maybe it's just not a native english speaker?

> Hallucinations are certainly the toughest nut to crack and their negative impact is basically only slightly lessened by good confidence estimates and reliable citations (sources).

> The impact of contradictions in the training data.

(was this a prompt header you forget to remove?)

> LLM are incapable of "self-inspection" on their training data to find logical inconsistencies in it but in the input context window they should be able to find logical inconsistencies.

Annnnyway...

Hallucinations cannot be fixed by a good corpus in a non-deterministic (ie. temp > 0) LLM system where you've introduced a random factor.

Period. QED. If you think it can, do more reading.

The idea that a good corpus can significantly improve the error rate is an open question, but the research I've seen tends to fall on the side of "to some degree, but curating a 'perfect' dataset like that, of a sufficiently large size, is basically impossible'".

So, it's a pipe dream.

Yes, if you could have a perfect corpus, absolutely, you would get a better model.

...but how do you plan to get that perfect corpus of training data?

If it was that easy, the people spending millions and millions of dollars making LLMs would have, I guess, probably come up with a solution for it. They're not stupid. If you could easily do it, it would already have been done.

my $0.02:

This is a dead end of research, because it's impossible.

Using LLMs which are finetuned to evaluate the output of other LLMs and using multi-sample / voting to reduce the incidence of halluciations that make it past the API barrier is both actively used and far, far more effective.

(ie. it doesn't matter if your LLM hallucinates 1 time in 10; if you can reliably detect that 1 instance, sample again, and return a non hallucination).

Other solutions... I'm skeptical; most of the ones I've seen haven't worked when you actually try to use them.


I've seen such articles more and more recently. In the past, when people had a vague idea, they had to do research before writing. During this process, they often realized some flaws and thoroughly revised the idea or gave up writing. Nowadays, research can be bypassed with the help of eloquent LLMs, allowing any vague idea to turn into a write-up.


Thank you. It seems largely ignored that LLMs still sample from a set of tokens based on estimated probability and the given temperature - but not on factuality or the described "confidence estimate" in the article. RAG etc. only move the estimated probabilities into a more factually based direction, but do not change the sampling itself


I wonder to what extent is hallucination a result of a "must answer" bias?

When sampling data all over the internet, your data set only represents people who did write, did respond to questions - with no representation of what they didn't. Add into that confidently wrong people - people who respond to questions on, say, StackOverflow, even if they're wrong, and suddenly you have a data set that prefers replying bullshit, because there's no data for the people who didnt know the answer and wrote nothing.

Inherently there's no representation in the datasets of "I don't know" null values.

LLMs are forced to reply, in contrast, so they "bullshit" a response that sounds right even though not answering or saying you don't know would be more appropriate - because no-one does that on the internet.

I always assumed this was a big factor, but am I completely off the mark?


I wrote up this blog post in 30 mins, that's why it reads a little rough. I could not find explicit research on the impact of contradicting training data, only on the general need for high-quality training data.

May be it is a pipe dream to drastically improve on hallucinations by curating a self-consistent data set but I am still interested in how much it actually impacts the quality of the final model.

I described one possible way to create such a self-consistent data set in this very blog post.


It's obvious that you can't solve hallucinations by curating the dataset when you think about arithmetic.

It's trivial to create a corpus of True Maths Facts and verify that they're correct. But an LLM (as they're currently structured) will never generalise to new mathematical problems with 100% success rate because they do not fundamentally work like that.


As I understand it: the Phi models, are trained with a much more selective training data, the Tiny Stories research was one of the starts of that, they used GPT-4 to make stories and encyclopedia like training data for Phi to learn from and code, which probably helps with logical structuring too. I think they did add in real web data too though but I think it was fairly selective.

Maybe something between Cyc and Google's math and geometry LLM's could help.


We knew high quality data can help as evidenced by the \Phi models. However, this alone can never eliminate hallucination because data can never be both consistent and complete. Moreover, hallucination is an inherent flaw of intelligence in general if we think of intelligence as (lossy) compression.


I do feel like we've reached a local maxima with the current state of LLMs, and researchers need to find something completely different to hit a new maxima (whether that is the global maxima or not, we'll know when we hail our new AI overlords).


I'm surprised he didn't mention the way, that we are solving the issue at amazon. It's not an secret at this point, giving the LLM's hands or agentic systems to run code or do things that get feedback in a loop DRAMATICALLY REDUCE Hallucinations.


The thing is we probably can't build AGI: https://www.lycee.ai/blog/why-no-agi-openai


Despite its title, this article merely seems to argue that LLMs will not themselves scale into AGI.


This is almost a year old, thoughts on it today?


LLMs still do not reason or plan. And nothing in their architecture, training, post-training points toward real reasoning as scaling continues. Thinking does not happen one token at a time.


I don't get why some people seem to think the only way to use a LLM is for next token prediction or AGI has to be bult using LLM alone.

You want planning, you can do monte carlo tree search and use LLM to evaluate which node to explore next. You want verifiable reasoning, you can ask it to generate code(an approach used by recent AI olympiad winner and many previous papers).

What is even "planning", finding desirable/optimal solutions to some constrained satisfaction problems? Is the llm based minecraft bot voyager not doing some kind of planning?

LLMs have their limitations. Then augment them with external data sources, code interpreters, give it ways to interact with real world/simulation environment.


The problem is that every time you ask the LLM to evaluate what to do next it will return a wrong answer X% of the time. Multiple that X across the number of steps and you have a system that is effectively useless. X today is ~5%.

I do think LLMs could be used to assist in building a world model that could be a foundation for an AGI/agent system. But it won't be the major part.


absolutely


As the other reply has said, the article points to limitations of LLMs, but that doesn't preclude synthesizing a system of multiple components that uses LLMs. To the extent that I'm bearish on AI capabilities, I'll note that program synthesis / compression / general inductive reasoning which we expect intelligent agents to do is a computationally very hard problem.


I wish he went into how to improve confidence scores, though I guess training on better data to begin with should improve results and thus confidence.


Q: is hallucination a milestone towards consciousness?

Given how inevitable it is, it seems to me that it might be.


There has been steady improvement since the release of chat gpt into the wild, which is still only less than two years ago (easy to forget). I've been getting a lot of value out of chat gpt 4o, like lots of other people. I find with each model generation my dependence on this stuff for day to day work goes up as the soundness of its answers and reasoning improve.

There are still lots of issues and limitations but it's a very different experience than with gpt 3 early on. A lot of the smaller OSS models are a bit of a mixed bag in terms of hallucinations and utility. But they can be useful if you apply some skills. Half the success is actually learning to prompt these things and learning to spot when it starts to hallucinate.

One thing I find useful is to run ideas by it in kind of a socratic mode where I try to get it to flesh out brain farts I have for algorithms or other kinds of things. This can be coding related topics but also non technical kinds of things. It will get some things wrong and when you spot it, you can often get a better answer simply by pointing it out and maybe nudging it in a different direction. A useful trick with code is to also let it generate tests for its own code. When the tests fail to run, you can ask it to fix it. Or you can ask it for some alternative implementation of the same thing. Often you get something that is 95% close to what you asked for and then you can just do the remaining few percent yourself.

Doing TDD with an LLM is a power move. Good tests are easy enough to understand and once they pass, it's hard to argue with the results. And you can just ask it to identify edge cases and add more tests for those. LLMs take a lot of the tediousness out of writing tests. I'm a big picture kind of guy and my weakness is skipping unit tests to fast forward to having working code. Spelling out all the stupid little assertions is mindnumbingly stupid work that I don't have to bother with anymore. I just let AI generate good test cases. LLMs make TDD a lot less tedious. It's like having a really diligent junior pair programmer doing all the easy bits.

And if you apply SOLID principles to your own code (which is a good thing in any case), a lot of code is self contained enough that you can easily fit it in a small file that is small enough to fit into the context window of chat gpt (which is quite large these days). So, a thing I often do is just gather relevant code, copy past it and then tell it to make some reasonable assumptions about missing things and make some modifications to the code. Add a function that does X; how would I need to modify this code to address Y; etc. I also get it to iterate on its own code. And a neat trick is to ask it to compare its solution to other solutions out there and then get it to apply some of the same principles and optimizations.

One thing with RAG is that we're still under utilizing LLMs for this. It's a lot easier to get an LLM to ask good questions than it is to get them to provide the right answers. With RAG, you can use good old information retrieval to answer the questions. IMHO limiting RAG to just vector search is a big mistake. It actually doesn't work that well for structured data and you could just ask it to query some API based on a specification of use some sql, xpath, or whatever query language. And why just ask 1 question? Maybe engage in a dialog where it zooms in on the solution via querying and iteratively coming up with better questions until the context has all the data needed to come up with the answer.

If you think about it, this is how most knowledge workers address problems themselves. They are not oracles of wisdom that know everything but merely aggregators and filters of external knowledge. A good knowledge worker / researcher / engineer is one that knows how to ask the right questions in order to come up with an iterative process that converges on a solution.

Once you stop using LLMs as one shot oracles that give you an answer given a question, they become a lot more useful.

As for AGI, a human AI enhanced by AGI is a powerful combination. I kind of like the vision behind neuralink where the core idea is basically improving the bandwidth between our brains and external tools and intelligence. Using a chat bot is a low bandwidth kind of thing. I actually find it tedious.


This is very close to my use case with Claude 3.5, and I used to only write tests when I was forced to, now it is part of the routine to double check everything while improving the codebase. I also really enjoy the socratic discussions when thinking about new ideas. What it says is mostly generic Wikipedia quality but this is useful when I am exploring domains where I have knowledge gaps.


Plausible idea which needs a big training budget. Was it funded?


I came here thinking I will learn how to make LLMs better. But leaving with more complicated questions:

1. Do I want LLMs to be trained with licensed data, that's arguably well curated. Or, do I want LLM to scrape the web because it is more democratic in opinions?

2. If hallucination is not about training data but how LLM uses that data to extrapolate info that's not directly present in training data - can we teach it this skill to make better choices?

3. It's easy to define good data for facts. How to define good data for subjective topics?

4. For subjective topics, is it better to have separate LLMs trained with each theme of opinions or one big LLM with a mix of all opinions?

5. Is using LLM to improve its own training data truly helpful as the author claims? If yes - is this recursion method better or it's better to use multiple LLMs together?

Dang! If I interview for a position that requires knowledge of AI - every question they ask will be answered with more questions. smh!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: