Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone know the correlation between our abilities to parse PDF's and the quality of our LLM's training datasets?

If a lot of scientific papers have been pdf's and hitherto had bad conversions to text/tokens, can we expect to see major gains in our training and therefore better outputs?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: