Hacker News Clone

Run Llama locally with only PyTorch on CPU

by anordin95 on 10/8/2024, 1:45:14 AM with 34 comments

by yjftsjthsd-h on 10/11/2024, 4:04:11 PM
If your goal is
> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.
Then this is great.
If your goal is
> Run and explore Llama models locally with minimal dependencies on CPU
then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.
by littlestymaar on 10/11/2024, 4:44:48 PM
With the same mindset, but without even PyTorch as dependency there's a straightforward CPU implementation of llama/gemma in Rust: https://github.com/samuel-vitorino/lm.rs/
It's impressive to realize how little code is needed to run these models at all.
by Ship_Star_1010 on 10/11/2024, 6:31:55 PM
PyTorch has a native llm solution It supports all the LLama models. It supports CPU, MPS and CUDA https://github.com/pytorch/torchchat Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1
by I_am_tiberius on 10/11/2024, 8:51:51 PM
Does anyone know what's the easiest way to finetune a model locally is today?
by tcdent on 10/11/2024, 6:48:45 PM
> from llama_models.llama3.reference_impl.model import Transformer
This just imports the Llama reference implementation and patches the device FYI.
There are more robust implementations out there.
by anordin95 on 10/8/2024, 1:45:14 AM
Peel back the layers of the onion and other gluey-mess to gain insight into these models.
by klaussilveira on 10/12/2024, 6:23:58 PM
Fast enough for RPI5 ARM?