• by cpldcpu on 6/9/2024, 3:41:44 PM

    The quantization approach is basically identical to the 1.58bit LLM paper:

    https://arxiv.org/abs/2402.17764

    The main addition of the new paper seems to be the implementation of optimized and fused kernels using triton, as seen here:

    https://github.com/ridgerchu/matmulfreellm/blob/master/mmfre...

    This is quite useful, as this should make training this type of LLMs much more efficient.

    So this is a ternary weight LLM using quantization aware training (QAT). The activations are quantized to 8 bits. The matmal is still there, but it is multiplying the 8 bit activations by one bit values.

    Quantization aware training with low bit weights seems to lead to reduced overfitting by an intrensic tendency to regularize. However, also the model capacity should be reduced compared to a model with the same number of weights and a higher number of bits per weights. It's quite possible that this only becomes apparent after the models have been trained with a significant number of tokens, as LLMs seem to be quite sparse.

    Edit: In addition to the QAT they also changed the model architecture to use a linear transformer to reduce reliance on multiplications in the attention mechanism. Thanks to logicchains for pointing this out.

  • by buildbot on 6/9/2024, 4:39:07 AM

    Wow - This seems at first read to be really impressive work. They got scaling laws up to a reasonable size, 2.7B, and also run a few downstream tasks. Would be interesting to see how a comparable model trained by someone else does, to check their scores against those.

    They get real (61%!?) memory savings during training, and inference too.

    On top of all that, they then go build an FPGA core which is programmed with a custom assembler. And their code is posted and works seamlessly with huggingface transformers?! Absolutely going to test this out.

  • by jph00 on 6/9/2024, 4:01:04 AM

    There was another matmul-free language model paper released a year ago FYI:

    https://arxiv.org/abs/2305.17190

  • by naasking on 6/9/2024, 6:05:09 PM

    I feel like all of these transformer reductions to binary or ternary bits are basically constructing an implicit decision tree, where any stage of the process is basically answering a question with yes/no/I don't know answers, where "I don't know" basically invokes a continuation for further processing with more context.

  • by WithinReason on 6/9/2024, 11:14:53 AM

    Not sure if it's fair to call binary multiplication "multiplication free", you can express any multiplication as a sequence of additions/subtractions.

  • by throwaway71271 on 6/9/2024, 11:00:07 AM

    the github link in the paper: https://github.com/ridgerchu/matmulfreellm

    it is super easy to try it out, the 2.7B, 1.3B, 0.37B models are on huggingface, and the generate.py example just works if you have triton 2.2 installed

  • by amluto on 6/9/2024, 8:44:34 AM

    One thing I didn’t figure out from just the paper: how does one train these parameters that are not even approximately real numbers? Specifically, most of the parameters are ternary (i.e. -1, 0, or 1). The approximate gradient discussed in the paper will (I think) give some real gradient on each parameter, and that can be further processed by the learning rate schedule, but the result is still a real number g_i for each parameter a_i. Normally one would update a_i to a_i + g_i, but with these ternary parameters, a_i + g_i isn’t ternary!

    So what’s the extra trick to make the model stay quantized? Does one evaluate the gradients on a whole bunch of training inputs, add them up, apply some randomness, and then re-quantize the model? Or is it something else?

  • by sva_ on 6/9/2024, 12:42:42 PM

    The FPGA in question, Intel FPGA PAC D5005 seems to cost around $8k

  • by PaulHoule on 6/9/2024, 11:50:52 AM

    This is why the NPU built into your processor could quickly become a liability instead of a benefit.

  • by hisoka44 on 6/9/2024, 1:33:57 PM

    has someone tried to do binary Hopfield networks like this? in an LLM like massive scale way?

  • by nuz on 6/9/2024, 11:35:56 AM

    Oh this is by the inventor of RWKV, cool

  • by gabesullice on 6/9/2024, 4:58:31 AM

    Reminds me of ghotz's interview: https://youtu.be/wE1ZoMGIZHM