• by joegibbs on 3/21/2024, 5:45:23 AM

    The datasets are so huge, I don't think there's any way to make sure that everything in the training data is safe - e.g. GPT3 was trained on 45TB of data, even using LLMs to classify that would be too expensive (GPT3.5 Turbo is priced at $1/million tokens, each token is about 4 bytes, so in the tens of millions to run it on its own training data - you could use other methods but they're less effective and stuff would still slip through), let alone having people do it.