> Human brains somehow are able to learn from all forms of sensory input and world interaction in ways that pay off in word tests. (Even then, would the information processed by an average human match the GPT-3 training corpus?)
Yes, absolutely humans learn from cross modalities. I haven't seen much work on attempting this in neural networks, but cross modal prediction is known to work well.
> Even then, would the information processed by an average human match the GPT-3 training corpus?
I'd be surprised if it didn't. The visual cortex processes around 8.75 megabits per second[1]. Assuming eyes are open around 16 hours a day that is 63 GB/day of information just from the eyes.
Assuming 500 billion words in the GPT training set, and 5 characters per word on average and 1 byte characters that is 100 GB training set or a bit under 2 days of information for a human.
Now the two aren't directly comparable, but humans gather a lot of information.
If this explanation is true, it suggests a very important experiment would be to develop AI models that train on video and (much less than 500 billion words of) text and then tackle these test problems that GPT-3 is evaluated on.
It's not entirely obvious how to do this though. There needs to be a common training representation and that's pretty hard.
In videos, the sequence of frames is important, but each frame is an image. In text the sequence of characters is important.
Maybe something that can accept sequences (of some length..) of bytes where the bytes maybe an image or maybe a single character might work. But unified representations of images and words for training is probably the first step towards that.
based on the current state of the art and the research problems needed to solve it's probably 3 years before we are in a position to contemplate the kind of training suggested here.
Another very important aspect is that humans interact with the world they are observing. It isn’t just passive processing of data. Training a model on video may help over text alone, but the interactivity is still missing.
Yes, absolutely humans learn from cross modalities. I haven't seen much work on attempting this in neural networks, but cross modal prediction is known to work well.
> Even then, would the information processed by an average human match the GPT-3 training corpus?
I'd be surprised if it didn't. The visual cortex processes around 8.75 megabits per second[1]. Assuming eyes are open around 16 hours a day that is 63 GB/day of information just from the eyes.
Assuming 500 billion words in the GPT training set, and 5 characters per word on average and 1 byte characters that is 100 GB training set or a bit under 2 days of information for a human.
Now the two aren't directly comparable, but humans gather a lot of information.
[1] https://www.newscientist.com/article/dn9633-calculating-the-...