Hacker News Clone

Quantized Llama models with increased speed and a reduced memory footprint

by egnehots on 10/24/2024, 6:52:44 PM with 122 comments

by tveita on 10/24/2024, 10:10:07 PM
So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.
Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.
I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.
Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.
As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.
[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf
by nisten on 10/24/2024, 8:20:45 PM
It's pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).
Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.
by theanonymousone on 10/24/2024, 8:47:57 PM
May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/
by formalsystem on 10/24/2024, 10:37:44 PM
Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!
by philipkglass on 10/24/2024, 8:06:57 PM
These quantized models show much less degradation compared to a "vanilla post-training-quantization" but there are a bunch of PTQ schemes that people have already applied to Llama models [1]. I didn't see any details about the vanilla PTQ they used as a baseline. Has it been written about elsewhere?
[1] https://ollama.com/library/llama3.2/tags
by yuvalr1 on 10/25/2024, 4:12:05 PM
Looking at how to deploy 1B and 3B Llama models on Android for inference. Some posts online recommend using Termux (an amazing app) to have an emulated shell and then install as if it's Linux, using ollama for example. However, this forces you into a manual installation process, and also most of the people don't know what Termux is, and would be afraid to install it from F-Droid.
Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?
I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.
by cmsj on 10/25/2024, 9:01:20 AM
It really bugs me that every time I see posts about new models, there is never any indication of how much VRAM one needs to actually run them.
by ed on 10/24/2024, 10:29:00 PM
Oh cool! I’ve been playing with quantized llama 3B for the last week. (4-bit spinquant). The code for spinquant has been public for a bit.
It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.
But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.
My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.
Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.
by Evidlo on 10/25/2024, 3:56:25 AM
Why don't they actually say what the size of the model is in GB?
That and average inference times on common hardware is what I'm curious about.
by itsTyrion on 11/3/2024, 6:19:20 PM
Wait, so I can get incorrect information and text summaries with things added or cut off even faster and on mobile now? that's amazing.
by nikolayasdf123 on 10/25/2024, 1:50:59 AM
what's your opinion on LlamaStack?
for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.
is ExecuTorch any better?
by Tepix on 10/25/2024, 6:07:31 AM
From TFA:
> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B
No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!
by justanotheratom on 10/24/2024, 9:36:31 PM
Any pointers no how to finetune this on my dataset and package and run it in my swift ios app?
by behnamoh on 10/24/2024, 9:17:44 PM
Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).
by EliBullockPapa on 10/24/2024, 8:45:35 PM
Anyone know a nice iOS app to run these locally?
by arnaudsm on 10/24/2024, 8:06:12 PM
How do they compare to their original quants on ollama like q4_K_S?
by newfocogi on 10/24/2024, 7:44:00 PM
TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).
by mmaunder on 10/24/2024, 9:02:44 PM
[flagged]