by jasonjmcghee on 2/20/2025, 5:42:40 AM
by jboggan on 2/20/2025, 5:53:52 AM
Not sure why they included a hallucination as one of their first examples:
"Please recommend me three famous movies"
"The Empire Strikes Back (1980) - Directed by George Lucas"
by billconan on 2/20/2025, 2:24:55 AM
it doesn't seem to support variable length for input and output, does it?
The paper seems to use EOS padding to create fixed length input/output.
so is there a maximum output length?
by flowerthoughts on 2/20/2025, 7:39:57 AM
Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.
Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?
As always you kind of need to play with the model to see how well it actually works as benchmarks can be misleading (e.g. phi-2)
But at face value, a new architectural approach with the same capacity (8b) trained on a dataset 1/6th the tokens, being competitive with llama3-8b is exciting