> Arguably the bigger problem is that many of those datasets e.g. WSJ articles a... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		piva00 on July 18, 2024 \| parent \| context \| favorite \| on: Overcoming the limits of current LLMs > Arguably the bigger problem is that many of those datasets e.g. WSJ articles are proprietary and can be exclusively licensed like we've seen recently with OpenAI. > So we end up with in a situation where competition is simply not possible. Exactly, and Technofeudalism advances a little more into a new feud. OpenAI is trying to create its moat by shoring up training data, probably attempting to not allow competitors to train on the same datasets they've been licencing, at least for a while. Training data is the only possible moat for LLMs, models seem to be advancing quite well between different companies but as mentioned here a tidy training dataset is the actual gold.

gessha on July 18, 2024 [–]

It's landmines no matter how you approach the problem.

If you treat the web as a free-for-all and you scrape freely, you get sued by the content platforms for copyright or term of service violation.

If you license the content, you let the highest bidder get the content.

No matter what happens, capital wins.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact