Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree in that a perfectly consistent dataset won't completely stop statistical language models from hallucinating but it will reduce it. I think it is established that data quality is more important than quantity. Bullshit in -> bullshit out, so a focus on data quality is good and needed IMO.

I am also saying LMs output should cite sources and give confidence scores (which reflects how much the output is in or out of the training distrtibution).



I think the problem is you need an extremely large quantity of data just to get the machine to work in the first place. So much so that there may not be enough to get it working on just "quality" data.


How would confidence scores work? Multiple passthroughs and a % attached to each statement according to how often it appeared in the generated result?

If so, building this could be quite complex depending on the domain. In the legal field even one simple word that is changed can have large consequences.


What’s a non-statistical language model?

And I think looking to the training data for sources is a little silly - that’s the training data for intuitive language use, not true statements about the world. If you haven’t checked it out yet, two terms you’d love are “RAG” and “Manuel De Landa”




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: