Does anyone know the correlation between our abilities to parse PDF's and the qu...

Does anyone know the correlation between our abilities to parse PDF's and the quality of our LLM's training datasets?

If a lot of scientific papers have been pdf's and hitherto had bad conversions to text/tokens, can we expect to see major gains in our training and therefore better outputs?