Understanding the different sizes of ai models

Page contents

Using AI models is fairly simple through tools like ollama or open-webui, but with the ease of deployment comes a new problem: choosing a model variant. This guide will help you make the most of your hardware.

Why size matters

When running models, there are three factors to balance: inference speed, hardware cost and output quality. Any change away from the original model will reduce quality, although not always predictably - reducing size will reduce quality, but not linearly. Inference speed is mostly driven by parameter count and computational complexity/hardware support (for example by using better supported numeric types), and finally hardware cost is typically bound to vram (gpu memory). Fitting a model entirely into vram allows for fastest inference speed, but gpus are expensive and memory limited.

Balancing these three considerations is highly subjective and often varies between use cases, so understanding what model formats, optimizations and compression methods are available is vital to choose the best fit for your needs.

Parameter count & distillation

A model's parameter count directly limits its ability to retain information, but having many learnable weights doesn't mean they are used efficiently. When initially training a new model, using very large parameter counts helps retain more information and patterns from the training dataset. Running these large models during inference however is expensive, since hundreds of billions (or even trillions!) of parameters are involved in the computation and the model takes up extreme amounts of vram.

Luckily, training a smaller model to mimic the answers of a larger model will retain a lot of its output quality and capabilities, but require much less power and memory to run during inference. This process is called distillation, and is the reason why most LLMs have many different variants of differing model sizes. Strictly speaking, these models aren't exactly the same, since each different parameter count variant is its own model trained to mirror the teacher model, but all models within the same family turn out largely similar since they share the same teacher model.

Parameter count is typically added to a model's name with the abbreviated size, for example 7B for a model using 7 billion weights (parameters). The amount of learnable weights has the most impact on model size, quality and inference speed, with fewer parameters being smaller and faster during inference, but at reduced quality.

Lower-precision weights

Model weights are essentially just matrixes (arrays of arrays) of floating point numbers. To store the numeric value, there are three major options for machine learning:

  • fp32 or "full precision"; a simple 4 byte (32bit) floatvariable

  • fp16 and bf16 both use 2 bytes (16bit) per weight. fp16 is a simple half-precision float, whereas bf16 (also called "brainfloat 16") retains the exponent size of fp32 and only reduce the mantissa precision, leading to easier training.

  • fp8 uses only 1 byte (8bits) for storage

Lower precision means less memory requirements and faster computation, but also higher loss of details.


During training, models will typically favor higher precision for better training results, so most are trained as fp16 or bf16 to strike a balance between hardware requirements, costs and model quality.

During inference however, these models are often reduced to 8 bits per weight, as the impact on inference is much less significant but the costs in memory space and computation time lowers significantly.

New hardware improvements and training optimizations now allow training some models directly on 8bits per weight, but this is less common (for now).

Quantization

Reducing the size of models for inference (once training is complete) is primarily done through quantization, which basically just means mapping the floating point weight values to a range of integer values instead, often scaling them up in the process. In practice, this is often done in "blocks" by grouping weight values together, for example by layers or rows.


You may think of this as a form of "compression", as a quantized model's layers will need to be dequantized ("decompressed") back into floating point numbers for the matrix computations. The main benefit is trading a tiny bit of compute overhead and some accuracy for much less memory space (and by extension memory bandwidth). But not all quantizations are equal:

  • Q8 has the unique ability to utilize hardware support for INT8 computations, allowing it to run inference in quantized state using the much faster integer arithmetic and only dequantize the result if necessary. While it is the least memory efficient quantization method, it is by far the fastest, especially on modern hardware.

  • Q4 - Q1 do rarely have hardware support for integer operations, thus need to be converted back into floating point values for inference. These types incur a more noticeable computational overhead, though usually still outweighed by the speed gains of needing to move much less memory around.

You can identify quantized models by the letter "Q" followed by a number between 8 and 1, like mymodel_Q4, in the filename. The number after the Q stands for the bits used to store each weight, with higher numbers meaning more memory requirements and slower speeds due to memory bandwidth limitations, but better quality. Only Q8 is the exception, integer arithmetic hardware support outweighs memory bandwidth bottlenecks despite its size - it requires the most memory, but is also the fastest and highest quality.


The above quantization is also known as "type 0", often expressed as _0 in model names, like mymodel_Q4_0. There is a second variant which stores slightly more information per block (scales and zero points), trading slightly more computational overhead and memory for higher accuracy. This version is called "type 1", with filenames like mymodel_Q2_1. For most inference applications, you will typically prefer type 0.


Note the speed comparisons/tradeoffs are only true for basic type 0/1 quantization; k-quant and importance-aware quantization can change performance characteristics considerably

K-Quant Quantization

Improving on basic quantization, k-quant quantization introduces a more flexible block layout, using multiple advanced quantization strategies and bits per weight for each block, and further optimizes rounding and normalization.


This flexibility means models are often offered in different configurations, using the letter K to indicate k-quant quantization with one of _L/_M/_S as a suffix.

For example, a 4bit quantized model may be available as mymodel_Q4_K_L (largest but best quality), mymodel_Q4_K_M (medium size but still decent quality) or mymodel_Q4_K_S (smallest size but worst quality).


Note that since k-quant can use varying bits per weight for each block, the Q4 really means "on average 4 bits per weight across all blocks", but is often still referred to as "4bit quantized".


K-quant quantized model sizes will hover around those of basic quantized ones (_K_L slightly larger, _K_M roughly equal, _K_S slightly smaller), but provide better quality at the cost of more computational overhead (to undo the more complex quantization) during inference. Most applications should prefer _K_M for inference, as quality gains of _K_L are often minimal in comparison, and the quality drop of _K_S can be quite steep for a modest reduction in size.

Note: k-quants quality is not always this predictable; check author notes for the model to be sure!

Importance-aware quantization

Some model weights have more impact on output quality than others, a feature that importance-aware quantization uses to further improve the benefits of k-quant quantization. It achieves this by using importance matrices to determine which weights are most important for output quality, then intelligently assigning them higher bit widths, while reducing the bits for less important weights. This will often result in better quality at similar (or even slightly smaller) model sizes. Since the dequantization computations are almost identical, importance-aware quantization does not slow down inference speeds compared to k-quant.


Filenames will contain IQ followed by the number of average bits per weight when quantized with this technique, sometimes followed by the _K suffix, and a size indicator from XS (largest) to XXS (smallest). This means that the sample files mymodel_IQ4_XS and mymodel_IQ4_K_XS refer to the same quantization method (though including the _K is less common).


Importance-aware quantization will usually outperform normal k-quant variants significantly, with IQ4_XS often being noticeably better quality than a Q4_K_M alternative.

More articles

Enabling gpu support in docker

Accelerating video and machine learning workloads inside containers

A practical guide to working with sudo

From basic usage to configuration

Encrypting data in python with fernet

Secure data encryption without the hassle