Running LLMs locally with ollama

Table of contents

With the release of more and more new and improved generative ai models, their infrastructure changes and newly added components, it is easy to be overwhelmed by requirements and setup instructions for newer models. Especially newcomers have difficulty understanding the landscape and getting something to run for evaluation - which is exactly where ollama comes in.

What is ollama?

What exactly ollama is can be hard to grasp at first, because it is an entire ecosystem of features. It comes with command-line tools, an online library of ready-to-use preconfigured models and process management capabilities to help you run, query and even serve locally running models over a REST api. If you are familiar with docker, you''ll feel right at home in seconds, as much of it's principles and syntax are reused by ollama, albeit in a different context.

To get started, all you need to do is download and install the latest release from their website.

Picking the right model

Before running a model locally, you'll need to find one that suits your needs and works for your hardware. Start by browsing the ollama library and pick a model that you want to run locally. There will be many different versions of that model available, with the default usually being the 7b variant. 7b means 7 billion parameters and is considered fairly small in the world of ai models. More parameters like 32b of 70b will provide much better quality, but also use much more memory.

The size for each variant is displayed in the drop-down model, and serves as a rough estimate of how much space the model needs at runtime. Keep in mind that models in the ollama library use the gguf format, so storage is your graphics card memory + system memory. The actual memory consumption of a model can be slightly higher than it's filesize, so keep some headroom in there for safety.

For example, small models like orca-mini should be fine on systems using old graphics cards and as little as 4gb memory, so it might be a good starting point for testing. Don't worry if you aren't sure about resource requirements at first glance; ollama will give you a warning if you are trying to run a model that is beyond your computer's abilities.

Running a model

All you need to do to run a model locally is open a terminal and type

ollama run orca-mini

The command will download the model if it is not present locally, then starts it and provides a prompt to interact with it:

ollama run orca-mini
>>> Send a message (/? for help)

Typing any message will be fed to the model in the background and the output written back in realtime. You can quit out of your active session by typing /bye.

If you only want an answer to a single prompt instead of an entire chat session, you can feed it to the model during the run command:

ollama run orca-mini "Write a poem about horses"

This time, it will simply output the answer and exit. You can further control the output with the --format flag, for example to receive json instead of plaintext:

ollama run orca-mini "Write a poem about horses" --format json

See ollama help run for a list of options.

Viewing and stopping models

When running a model, ollama will intelligently keep it running the background for a moment (default 5 minutes), because loading the model can be slow and users often want to have more than one interaction with a model at a time. You can view those models running in the background with

ollama ps

The output provides a list, one line per running model:

NAME               ID             SIZE     PROCESSOR         UNTIL            
orca-mini:latest   2dbd9f439647   3.4 GB   52%/48% CPU/GPU   4 minutes from now   

You may notice that the PROCESSOR column shows different values for CPU and GPU. Your results may differ, but if your graphics card (GPU) doesn't have enough memory to fit the entire model, some parts of it may be offloaded to system ram and processed by the CPU, efficiently using as much of your hardware as needed.

To stop a model running in the background like this, all you need is it's name from the first column (everything before the colon):

ollama stop orca-mini

The model would also shut down automatically if left idle, but you may want to stop it earlier if you want to run a different model right now.

Managing local models

As you have already seen, when running a model locally, ollama will download the model if it is missing. You can also do this manually using the pull command if you prefer:

ollama pull orca-mini

Models downloaded once are kept on disk locally by default so you don't have to download them again in the future. You can view them with

ollama list

The list only shows you what is stored locally, not all models that exist in the ollama library.

If you are done with a model and don't want to use it anymore, you can remove it again (removing orca-mini in this example):

ollama rm orca-mini

Removing models you don't need anymore ensures your disk doesn't fill up unexpectedly over time with unused model data.

Setting parameters in a chat session

When running an interactive chat session with an ai model like

ollama run orca-mini

you can tweak some parameters of how inference works during the session by typing /set parameter for a list of all possible options:

Available Parameters:
 /set parameter seed <int>            Random number seed
 /set parameter num_predict <int>     Max number of tokens to predict
 /set parameter top_k <int>           Pick from top k num of tokens
 /set parameter top_p <float>         Pick token based on sum of probabilities
 /set parameter min_p <float>         Pick token based on top token probability * min_p
 /set parameter num_ctx <int>         Set the context size
 /set parameter temperature <float>   Set creativity level
 /set parameter repeat_penalty <float> How strongly to penalize repetitions
 /set parameter repeat_last_n <int>   Set how far back to look for repetitions
 /set parameter num_gpu <int>         The number of layers to send to the GPU
 /set parameter stop <string> <string> ...  Set the stop parameters

Depending on your ai model of choice, some of these may be fixed (like the context size or penalties). Most commonly. you will want to adjust the temperature parameter, which controls how creative the model will be in responding. Sane values range from 0.0 (very precise) to 2.0 (very creative), with 1.0 representing a balance between the two.

Changing the num_gpu parameter would adjust the ratio of how many layers are processed by the cpu/gpu, with 0 offloading the entire model to cpu only. In most cases, you shouldn't touch this setting, but if your graphics card is underused you may be able to get better speeds by adjusting the ratio upwards.

Using specific model versions

Much like in docker, you can also run more specific versions of the model. The previous command had no tag provided, so it used the default 7b variant as a fallback. If you want a different variant, specify it after the model name, separated by a colon:

ollama run orca-mini:13b

If you want to be even more detailed, you can pick a tag for a specific quantization method and quality as well:

ollana run orca-mini:13b-q3_K_M

Refer to the orca-mini tags on the ollama library website to see a full list of available tags to use.

Don't worry if you don't understand what quantization is or how model parameters work, because ollama will keep most of that complexity hidden from users until they need it.

If you are familiar with huggingface, you can easily use any gguf formatted model from there as well! Just click the "use this model" button on the top right, select the Ollama option, copy the command and run it:

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:IQ4_XS

The model will work just like any other on ollama's library, except with a slightly larger name of course.

Inspecting models

A lot of metadata is available for LLMs, which is downloaded with the model when fetching it from ollama's library:

ollama show orca-mini

which shows:

 Model
   architecture       llama   
   parameters         3.4B    
   context length     2048    
   embedding length   3200    
   quantization       Q4_0    

 Parameters
   stop   "###   System:"     
   stop   "###   User:"       
   stop   "###   Response:"   

 System
   You are an AI assistant that follows instruction extremely well. Help as much as you can.   

More specific information is also available by using flags (see ollama help show for a list):

ollama show orca-mini --modelfile

this shows the modelfile used to construct the model for ollama:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM orca-mini:latest

FROM /usr/share/ollama/.ollama/models/blobs/sha256-66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66
TEMPLATE "{{- if .System }}
### System:
{{ .System }}
{{- end }}

### User:
{{ .Prompt }}

### Response:
"
SYSTEM You are an AI assistant that follows instruction extremely well. Help as much as you can.
PARAMETER stop ### System:
PARAMETER stop ### User:
PARAMETER stop ### Response:

If you are familiar with docker, this will look very familiar to you, as it is obviously inspired by the Dockerfile syntax.

Serving models over a REST http api

In order to interact with models from code, most developers rely on APIs to make requests and get responses from models. Ollama comes with this feature included, and simply running

ollama serve

starts an http api on port 11434 by default. If you are on linux, the command is registered as a service for you, so start it with systemctl instead:

systemctl start ollama

The api allows interacting with any locally available model through request parameters, but you may want to adjust how long models are kept alive (left running after they were last used) to avoid sudden latency spikes for api responses. The api is aiming for compatiblity with that of openai, so their default api clients can easily be used with the ollama api, allowing developers to easily port applications to the new locally available api endpoint.

More articles

Writing user-friendly bash scripts

Meeting user expectations from cleanup to help output

Exploring CPU caches

Why modern CPUs need L1, L2 and L3 caches

Extracting video covers, thumbnails and previews with ffmpeg

Generating common metadata formats from video sources