• by yjftsjthsd-h on 10/11/2024, 4:04:11 PM

    If your goal is

    > I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.

    Then this is great.

    If your goal is

    > Run and explore Llama models locally with minimal dependencies on CPU

    then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.

  • by littlestymaar on 10/11/2024, 4:44:48 PM

    With the same mindset, but without even PyTorch as dependency there's a straightforward CPU implementation of llama/gemma in Rust: https://github.com/samuel-vitorino/lm.rs/

    It's impressive to realize how little code is needed to run these models at all.

  • by Ship_Star_1010 on 10/11/2024, 6:31:55 PM

    PyTorch has a native llm solution It supports all the LLama models. It supports CPU, MPS and CUDA https://github.com/pytorch/torchchat Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1

  • by I_am_tiberius on 10/11/2024, 8:51:51 PM

    Does anyone know what's the easiest way to finetune a model locally is today?

  • by tcdent on 10/11/2024, 6:48:45 PM

    > from llama_models.llama3.reference_impl.model import Transformer

    This just imports the Llama reference implementation and patches the device FYI.

    There are more robust implementations out there.

  • by anordin95 on 10/8/2024, 1:45:14 AM

    Peel back the layers of the onion and other gluey-mess to gain insight into these models.

  • by klaussilveira on 10/12/2024, 6:23:58 PM

    Fast enough for RPI5 ARM?