• by jasonjmcghee on 2/20/2025, 5:42:40 AM

    As always you kind of need to play with the model to see how well it actually works as benchmarks can be misleading (e.g. phi-2)

    But at face value, a new architectural approach with the same capacity (8b) trained on a dataset 1/6th the tokens, being competitive with llama3-8b is exciting

  • by jboggan on 2/20/2025, 5:53:52 AM

    Not sure why they included a hallucination as one of their first examples:

    "Please recommend me three famous movies"

    "The Empire Strikes Back (1980) - Directed by George Lucas"

  • by billconan on 2/20/2025, 2:24:55 AM

    it doesn't seem to support variable length for input and output, does it?

    The paper seems to use EOS padding to create fixed length input/output.

    so is there a maximum output length?

  • by flowerthoughts on 2/20/2025, 7:39:57 AM

    Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.

    Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?