RLHF (reinforcement learning from human feedback) is to a large extent about resolving that ambiguity by simply polling people for their subjective judgement.
I've worked one an RLHF project for one of the larger model providers, and the instructions provided to the reviewers were very clear that if there was no objective correct answer, they were still required to choose the best answer, and while there were of course disagreements in the margins, groups of people do tend to converge on the big lines.
I've worked one an RLHF project for one of the larger model providers, and the instructions provided to the reviewers were very clear that if there was no objective correct answer, they were still required to choose the best answer, and while there were of course disagreements in the margins, groups of people do tend to converge on the big lines.