Running Deepseek LLM on Raspberry Pi 4
Published:
Introduction
This post is about running Deepseek Inference on Raspberry Pi 4.
Set up llama.cpp
llama.cpp is a lightweight and fast inference engine for running LLMs on CPUs. build provides instructions on how to build the project in various configurations such as adding CUDA support, building on Android, etc. Here I will build the project for Raspberry Pi 4 in a very basic manner.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
#Build llama.cpp using CMake:
cmake -B build
cmake --build build --config Release -j 8 # Build with 8 threads
Download Deepseek distill model
We will use the distill model that has 8B parameters. Click Files and versions on the left top side of the page to download various models that have different level of quantization Huggingface. I tested F16 and Q2_K models. F16 model is the full precision model and Q2_K model is the quantized model with 2-bit. It takes 2-3 minutes to load the F16 model and the actual token inference never happen before the llama process getting stuck, while the Q2_K model can be loaded way faster. But the token inference speed is still around 1 token/second.
Run the model
In the top directory of the llama.cpp repository, run the following command to start the model:
./build/bin/llama-cli -m models/DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf
performance This page provides some tips on setting up multiple threads and GPUs to improve the performance. More running option flags can be found by use –help flag for llama-cli.