Hacker News Clone

Ask HN: Did DeepSeek train on ChatGPT outputs?

by LZ_Khan on 1/28/2025, 8:04:05 AM with 1 comments

I have quite a rudimentary knowledge of AI, but from what I know training off another model's outputs is a form of knowledge distillation. If Deepseek had done that on 4o data, could that be the reason why their base model training costs were so low?

by sidkshatriya on 1/28/2025, 10:06:55 AM
In short, I think it is possible for DeepSeek to have achieved this amazing outcome without mass accessing 4o outputs. In any case OpenAI probably strictly monitors usage of its API to make sure its proprietary sauce remains proprietary and clients follow their T&Cs. The amount of data required at all stages for this model to train possibly cannot fly under the radar.
But why do I think that DeepSeek did this themselves? I think the amazing decision of DeepSeek to release its full model in the open shows how confident and proud of the work they are. This model by all accounts is a good one. If large scale distilling happened (hundreds of thousands of training outputs taken from closed models where this is expressed disallowed by the said closed model) I don't think DeepSeek would have been so open with its models and technical papers. They would have walled it off and just provided a chat interface. This genuinely seems like a labour of love. You can deeply inspect the model. I'm sure OpenAI experts and others are all over the model since the weights are freely available.
Another point: the field of deep learning has a vast literature by now and a lot of openly available supervised training datasets. There is enough training data available during the pre-training and post training stages if you can gather a large team. Now DeepSeek has a large team -- they would have been able to get enough training data legitimately.
Lastly DeepSeek in their technical papers have shown how they bootstrapped this model. Training data for R1 was also synthetically generated by previous models of DeepSeek (R1-Zero/ V3 / both ?). Synthetic data generation with light human curation I can imagine would have been entirely possible to do legitimately at scale.
Finally I feel that we should resist the urge to think if something came out of China then it must have been ripped off from US, involved large scale industrial espionage etc. By now China legitimately produces a lot of cutting edge products and original science. Without specific proof shown by experts I would hesitate to adopt this viewpoint.
Also consider this: China is the largest producer of electricity in the world and the cost of every material input is quite competitive. Labor costs are quite low also (Few Deep learning engineers there earning million dollar plus salaries).
DeepSeek papers also describe how their adopted various inovations to keep GPU costs low (Multi head latent attention to reduce GPU resources, use of FP8 instead of FP16/32 etc.). All this would have kept the costs low.
On a closing note, this model represents a gift to humanity. Not unlike Google and OpenAI's research papers. When I stop thinking about this model merely as "Chinese made" but (also) "Human made" my heart is filled with admiration of this achievement. This model can be owned by humanity just like the Linux kernel is.