Hacker News Clone

Gemma 3 QAT Models: Bringing AI to Consumer GPUs

by emrah on 4/20/2025, 12:22:06 PM with 276 comments

by simonw on 4/20/2025, 2:14:22 PM
I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Last night I had it write me a complete plugin for my LLM tool like this:
```
  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'
```
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/
by Samin100 on 4/20/2025, 8:10:38 PM
I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!
by diggan on 4/20/2025, 1:29:20 PM
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.
by mark_l_watson on 4/20/2025, 7:00:40 PM
Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.
gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.
I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.
Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.
by mekpro on 4/20/2025, 3:45:59 PM
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.
by trebligdivad on 4/20/2025, 2:31:02 PM
It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.
by manjunaths on 4/21/2025, 6:58:54 AM
I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.
It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.
I just love it.
For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.
I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.
But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.
Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.
by behnamoh on 4/20/2025, 1:45:02 PM
This is what local LLMs need—being treated like first-class citizens by the companies that make them.
That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.
by mythz on 4/20/2025, 1:46:15 PM
The speed gains are real, after downloading latest QAT gemma3:27b eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11 tok/s (on A4000's).
by porphyra on 4/20/2025, 5:31:42 PM
It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.
by emrah on 4/20/2025, 12:22:06 PM
Available on ollama: https://ollama.com/library/gemma3
by technologesus on 4/21/2025, 3:14:31 AM
Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert.
Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu
by miki123211 on 4/20/2025, 4:13:36 PM
What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.
We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.
I would normally say VLLM, but the blog post notably does not mention VLLM support.
by wtcactus on 4/20/2025, 1:33:11 PM
They keep mentioning the RTX 3090 (with 24 GB VRAM), but the model is only 14.1 GB.
Shouldn’t it fit a 5060 Ti 16GB, for instance?
by casey2 on 4/22/2025, 12:22:40 AM
I don't get the appeal. For LLMs to be useful at all you at least need to bin the the dozen exabit range per token, zettabit/s if you want something usable.
There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.
The thing that makes LLMs usefull is their ability to translate concepts from one domain to the other. Overfitting on choice benchmarks, even a spread, will lower their usefullness in every general task by destorying infomation that is encoded in the weights.
Ask gemma to write a 5 paragraph essay on any niche topic and you will get plenty of statements that have an extremely small likely of existing in relation to the topic, but have a high likely of existing in related more popular topics. ChatGPT less so, but still at least one a paragraph. I'm not talking about factual errors or common oversimplifications. I'm talking about completely unrelated statements. What your asking about is largely outside it's training data of which a 27GB models gives you what? a few hundred Gigs? Seems like alot, but you have to remember that there is a lot of stuff that you probably don't care about that many people do. Stainless steel and Kubernetes are going to be well represented, your favorite media? probably not, relatively current? definitely not. Which sounds fine, until you realize that people who care about Stainless steel and Kubernetes, likely care about some much more specific aspect which isn't going to be represented and you are back to the same problem of low usability.
This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.
by umajho on 4/20/2025, 2:01:11 PM
I am currently using the Q4_K_M quantized version of gemma-3-27b-it locally. I previously assumed that a 27B model with image input support wouldn't be very high quality, but after actually using it, the generated responses feel better than those from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M), and its recognition of images is also stronger than I expected. (I thought the model could only roughly understand the concepts in the image, but I didn't expect it to be able to recognize text within the image.)
Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).
(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)
by Havoc on 4/21/2025, 12:52:47 AM
Definitely my current fav. Also interesting that for many questions the response is very similar to the gemini series. Must be sharing training datasets pretty directly.
by 999900000999 on 4/20/2025, 4:22:23 PM
Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.
I can imagine this will serve to drive prices for hosted llms lower.
At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).
by piyh on 4/20/2025, 7:22:20 PM
Meta Maverick is crying in the shower getting so handily beat by a model with 15x fewer params
by jarbus on 4/20/2025, 1:31:43 PM
Very excited to see these kinds of techniques, I think getting a 30B level reasoning model usable on consumer hardware is going to be a game changer, especially if it uses less power.
by Alifatisk on 4/20/2025, 2:25:35 PM
Except this being lighter than the other models, is there anything else the Gemma model is specifically good at or better than the other models at doing?
by holografix on 4/20/2025, 1:27:44 PM
Could 16gb vram be enough for the 27b QAT version?
by api on 4/20/2025, 3:32:00 PM
When I see 32B or 70B models performing similarly to 200+B models, I don’t know what to make of this. Either the latter contains more breadth of information but we have managed to distill latent capabilities to be similar, the larger models are just less efficient, or the tests are not very good.
by yuweiloopy2 on 4/21/2025, 1:26:34 PM
Been using the 27B QAT model for batch processing 50K+ internal documents. The 128K context is game-changing for our legal review pipeline. Though I wish the token generation was faster - at 20tps it's still too slow for interactive use compared to Claude Opus.
by ece on 4/21/2025, 12:39:17 PM
On Hugging Face: https://huggingface.co/collections/google/gemma-3-qat-67ee61...
by briandear on 4/20/2025, 5:04:03 PM
The normal Gemma models seem to work fine on Apple silicon with Metal. Am I missing something?
by justanotheratom on 4/20/2025, 2:23:36 PM
Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.
by punnerud on 4/20/2025, 9:43:35 PM
Just tested the 27B, and it’s not very good at following instructions and is very limited on more complex code problems.
Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.
Ask it to generate SVG, and it’s very simple and almost too dumb.
Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.
by CyberShadow on 4/20/2025, 3:23:14 PM
How does it compare to CodeGemma for programming tasks?
by gigel82 on 4/21/2025, 3:17:39 AM
FWIW, the 27b Q4_K_M takes about 23Gb of VRAM with 4k context and 29Gb with 16k context and runs at ~61t/s on my 5090.
by perching_aix on 4/20/2025, 3:19:08 PM
This is my first time trying to locally host a model - gave both the 12B and 27B QAT models a shot.
I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.
However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.
by btbuildem on 4/20/2025, 1:57:51 PM
Is 27B the largest QAT Gemma 3? Given these size reductions, it would be amazing to have the 70B!
by gitroom on 4/21/2025, 5:30:41 PM
nice, loving the push with local models lately - always makes me wonder though, you think privacy wins out over speed and convenience in the long run or people just stick with what's quickest?
by noodletheworld on 4/20/2025, 1:33:15 PM
?
Am I missing something?
These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.
Is there something more to this, or is just a follow up blog post?
(is it just that ollama finally has partial (no images right?) support? Or something else?)
by XCSme on 4/20/2025, 2:50:27 PM
So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?
by mattfrommars on 4/21/2025, 2:15:09 AM
anyone had success using Gemma 3 QAT models on Ollama with cline? They just don't work as good compared Gemini 2.0 flash provided by API
by anshumankmr on 4/21/2025, 8:24:35 AM
my trusty RTX 3060 is gonna have its day in the sun... though I have run a bunch of 7B models fairly easily on Ollama.
by cheriot on 4/20/2025, 8:29:11 PM
Is there already a Helium for GPUs?
by rob_c on 4/20/2025, 2:01:36 PM
Given how long between this being released and this community picking up on it... Lol