The Google-level volume of data is already there, in large pre-trained language models like BERT, Electra, etc. It has been shown that they need very little data to be fine-tunned to a specific dataset.
Honestly, if a handful of users with hundreds of links contribute with data, we are in a good spot to begin with. =)