by rahimnathwani on 3/26/2025, 11:33:44 PM
by aktsvigun on 4/1/2025, 8:16:10 AM
I'm curious how they evaluate the responses in the first place. This is the part replacing human annotation (which seems to be the cornerstone of their method) yet no detail is provided.
They're distilling a reasoning model, using a llama model as a base. But they're using RL instead of SFT:
I'm curious:1. How do they determine 'closely aligned'?
2. How does the performance of this RL approach compare with SFT using the same base model and same dataset?