by PaulHoule on 9/8/2023, 5:40:25 PM
I was working on foundation models for business and we had done some work on character embeddings that would counteract that back in 2017.
Pro Tip: people whose ideas were worth stealing were worried about Googleโs web scraping and the whole economy about it were unfair and exploitative 10 years ago. Suddenly the people whose ideas arenโt worth stealing are up in arms about it.
Think more about having ideas that are worth stealing (e.g. leading the herd not following the herd) instead of getting your ideas stolen.
Hi. With burgeoning AI, I don't particularly like the idea of my persona being unwittingly scraped into an AI corpus.
Would denormalizing a string to unicode help prevent AI from matching text in a prompt? For example, changing "The quick brown fox" to "๐ฃ๐ฑ๐ฎ ๐บ๐พ๐ฒ๐ฌ๐ด ๐ซ๐ป๐ธ๐๐ท ๐ฏ๐ธ๐" or "apple" to "รรรlรฉ". Since the obfuscated strings use different tokens, they wouldn't match in a prompt, correct? And although normalization of strings is possible, would it be (im)possible to scale it in LLMs?
Note that I'm not suggesting that an AI couldn't produce obfuscated unicode, it can. This question is only about preventing one's text from aiding a corpus.