Hacker News Clone

Running a 180B parameter LLM on a single Apple M2 Ultra

by tbruckner on 9/7/2023, 2:36:13 PM with 141 comments

by adam_arthur on 9/7/2023, 3:32:58 PM
Even a linear growth rate of average RAM capacity would obviate the need to run current SOTA LLMs remotely in short order.
Historically average RAM has grown far faster than linear, and there really hasn't been anything pressing manufacturers to push the envelope here in the past few years... until now.
It could be that LLM model sizes keep increasing such that we continue to require cloud consumption, but I suspect the sizes will not increase as quickly as hardware for inference.
Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.
I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers. There's not going to be much moat or differentiation to defend margins... more of a race to the bottom on pricing
by logicchains on 9/7/2023, 3:17:04 PM
Pretty amazing that in such a short span of time we went from people being amazed how powerful GPT3.5 was upon its release to people being able to run something equivalently powerful locally.
by regularfry on 9/7/2023, 3:10:56 PM
4-bit quantised model, to be precise.
When does this guy sleep?
by sbierwagen on 9/7/2023, 3:18:07 PM
The screenshot shows a working set size of 147,456 mb, so he's using the mac studio with 192 gb of ram?
by m3kw9 on 9/7/2023, 4:43:33 PM
OpenAIs moat will soon largely be UX. Anyone can do plugins, code etc but when operating by everyday users the best UX wins after LLM becomes commodified. Just look at stand alone digital cameras vs mobile phone cams from Apple.
by homarp on 9/7/2023, 3:16:28 PM
https://www.reddit.com/r/LocalLLaMA/comments/16bynin/falcon_... has some more data like sample answers with various level of quantizations
and https://huggingface.co/TheBloke/Falcon-180B-Chat-GGUF if you want to try
by doctoboggan on 9/7/2023, 5:17:45 PM
Georgi is doing so much to democratize LLM access, I am very thankful he is doing it all on apple silicon!
by pella on 9/7/2023, 3:20:11 PM
Is this an M2 Ultra with 192 GB of unified memory, or the standard version with 64 GB of unified memory?
by Havoc on 9/7/2023, 10:42:47 PM
Great progress, but I also can't help but feel a sense of apprehension on the access front.
An M2 Ultra while consumer tech is affordable to a fairly small % of the world population.
by ViktorBash on 9/7/2023, 6:07:48 PM
It's refreshing to see how fast open LLMs are advancing in terms of the models available. A year ago I thought that besides for the novelty of it, running LLMs locally would be nowhere close to stuff like OpenAI's closed models in terms of utility.
As more and more models become open and are able to be run locally, the precedent gets stronger (which is good for the end consumer in my opinion).
by randomopining on 9/7/2023, 3:40:21 PM
Is there any actual usecases to run this stuff on a local computer? Or are most of these models actually suited to run on remote clusters?
by two_in_one on 9/8/2023, 4:58:07 AM
Just wondering what are local LLMs used for today? So far they look more like a.. promising.
by tiffanyh on 9/7/2023, 4:11:05 PM
```
  system_info: n_threads = 4 / 24
```
Am I seeing correctly in the video that this ran on only 4 threads?
by growt on 9/7/2023, 5:48:01 PM
So how much ram did the machine have?
by rvz on 9/7/2023, 3:25:47 PM
Totally makes sense for C++ or Rust based AI models for inference instead of the over-bloated networks run on Python with sub-optimal inference and fine-tuning costs.
Minimal overhead or zero cost abstractions around deep learning libraries implemented in those languages gives some hope that people like ggerganov are not afraid of the 'don't roll your own deep learning library' dogma and now we can see the results as to why DL on the edge and local AI, is the future of efficiency in deep learning.
We'll see, but Python just can't compete on speed at all, henceforth Modular's Mojo compiler is another one that solves the problem properly with the almost 1:1 familiarity of Python.