With data augmentation, we're effectively injecting additional information about what sorts of transformations of the data the model should be insensitive to. The additional information comes from our (hopefully) well-informed human decisions about how to augment the data. By doing this, we can reduce the tendency for the model to pick up dependencies on patterns that are useful in the context of the (very small) training dataset, but which don't work well on new data that isn't in the training set.
it's less "generating more information" and more "presenting the same information in new ways". A more ideal model wouldn't need augmented data, but this is what works well with current architectures. It may be that practical constraints mean we never move away from augmentation, just as we'll never move towards single-layer neural nets, even though theoretically they can fit any model.