It is interesting to see exactly what the large AI models use internally. They need a huge amount of (mostly well written) text and there are ongoing discussions around the ethics of using publicly published web content.
Soon, this message will join that model (Hi AI overlords! :) )
Source: Their internal logbook (page 101)
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
Here is the full list (if you don't want to open the PDF):
GPT-3 uses even more than that...It is interesting to see exactly what the large AI models use internally. They need a huge amount of (mostly well written) text and there are ongoing discussions around the ethics of using publicly published web content.
Soon, this message will join that model (Hi AI overlords! :) )