• by tc4v on 8/25/2024, 11:00:16 AM

    This seems unlikely because LLMs don't produce high quality code, they produce average code. So they don't contribute to a better dataset, they contribute to a narrower dataset around the average. LLM tend to self poison, not to self improve. There is a good chance it already started because of the huge amount of chatgpt code that was put on github since 2021. Maybe if the LLM authors use some quality filter to discard 80%of the dataset it can be avoided.

  • by sitkack on 8/24/2024, 6:09:35 PM

    They don't need that much data.

    They operate in a higher dimensional space.

    You can fine-tune a model trained on JS/Python and teach it Lua with little issue. If you have a proper rosetta for your language to a language that is well represented in the training corpus, it isn't an issue.

  • by VirusNewbie on 8/24/2024, 9:48:53 PM

    I was wondering if you could go the other way, could the statistical knowledge of what most people want when they type XYZ mean we could use it to design more powerful languages that are even less verbose.

    I don't really know but I hope someone answers this question!

  • by mikewarot on 8/24/2024, 6:53:04 PM

    I've been avoiding trying out CoPilot with Pascal code because I believe this to be true.

    Perhaps it's time to challenge that assumption.