A guide to ai model file formats

Ai models have made rapid improvements in recent years, a lot of different file formats have been developed and popularized. The amount of different file extensions can be daunting at first, but they can generally be sorted into three main groups, with easily understandable tradeoffs to the other contenders from their respective group.

Raw model weights

Most common for file containing ai model weights are raw formats, meaning unoptimized or minimally modified tensor data or model weights. This type of format can accomodate any kind of ai model, from large language models to image generators.

.pth(pytorch weights) and .ckpt (pytorch checkpoint):

Raw saved model data from PyTorch. They are typically much slower than the others and can contain arbitrary python code, posing a possible security risk. The lightweight .pth format only stores weights, while .ckpt includes more context like optimizer states and training data, making it preferred for image generation models.

.pkl (python pickle) and .bin (arbitrary binary):

Similar to the previous group, but not standardized in any way. Can contain arbitrary code, thus posing a security risk. The .pkl format is python's native serialization, whereas .bin is generic and could contain just about anything.

.safetensors(stripped tensor data):

A stripped format used to store only tensor data (model weights), without any other code or metadata. Only compatible with the most common frameworks like PyTorch or TensorFlow, but safe from malicious code and smaller (much faster to load and run).

.onnx (open neural network exchange):

Also an unoptimized storage format, but aiming for compatibility instead. While it is not as fast or small as other options, it is safe to use and highly compatible with most frameworks, making it a great choice to transfer models between different training / deployment frameworks, then converting to a better-suited format for the specific target environment.

The file formats in this category are typically used to either share proof-of-concepts where optimization is not yet reasonable, or to enable further modifications. If you want to fine-tune or retrain a a model, you need one of these formats.

GPU-optimized LLM formats

Graphical Processing Units (aka GPUs or "graphics cards") are great at parallel computations like the matrix multiplications needed for inference from Large Language Models (LLMs). While these LLMs prefer to use 32-bit floating point numbers (FP32), but GPU calculations are much faster for integers. Because of this, most optimized model formats map the floating point numbers range to lower-precision integers, a process known as model "quantization". Quantizing a model is a tradeoff, with smaller integers like 8-bit (INT8) or 4-bit (INT4) needing less memory and allowing faster execution, but also losing more output quality.

There are multiple file formats optimized for LLM inference on GPUs:

.gptq (generative pretrained transformer quantization):

One of the first widely-adopted quantization formats, relying on one-shot 4-ibt quantization of models using a GPT-based architecture. It allowed large LLMs with more than 100b parameters to be run on consumer-grade graphics cards (although only high-end gaming hardware). It comes with a significant loss in quality as the quantization indiscriminately reduces the precision of all weights, no matter how important each individual one is. It is quickly losing popularity since more advanced GPU-optimized formats have been adopted.

.awq(activation aware weight quantization):

Improving on the naive quantization of GTPQ, the AWQ format is more fine-grained in it's approach to quantization. It identifies the most important weights and skips them during quantization, leaving them at full precision. This tradeoff results in a minimally larger filesize, but significantly improved output quality. Additionally, it optimizes the matrix multiplication's implementation to better use the GPU's resources, resulting in faster inference compared to GTPQ.

.exl2(exLlamaV2 quantization):

This format differs a lot from the previous ones. It is roughly based on GPTQ as well, but instead groups weights into groups (typically 16 values per group), then tries quantizing each group into 2-8 bit integers while measuring the quality loss against a test data set. This results in dynamically selected quality loss per value group, striking a balance between quality and size. Files using this format typically state their quantization by averaging the bits per weight, leading to confusing names like "Q4.5" (where some reader question how it can use "half a bit" for storage).

The GPU-based models are highly optimized to run on both commercial and consumer-grade (high-tier gaming-) hardware, and allow very large models to strike a reasonable balance between quality and operational cost.

Consumer-hardware optimized LLM formats

Since optimizations and size adjustments have made running LLMs on consumer hardware feasible, some formats have heavily leaned into reducing model size as with minimal quality loss as much as possible, to enable local inference on even cheaper hardware. They also aim to better utilize existing hardware by offloading parts of running models to CPU/system memory if the graphics memory cannot fit it entirely. While this comes with a steep performance penalty, it enables many mid-low cost systems to run local inference at all.

.ggml(Georgi Gerganov Machine Learning):

The first popular format to support shared CPU/GPU or even CPU-only inference, making it suitable for low-end hardware like consumer laptops or office computers. It natively supports 4/8/16-bit quantization to provide a better range of quality-to-size tradeoffs for end users. It is largely replaced by it's successor GGUF.

.gguf(GGML universal format):

Improving on the GGML format's approach, the GGUF format also offers CPU offloading but takes consumer hardware optimizations a step further by supporting multiple types of quantization, from simple conversion to lower-bit integers or floats to more advanced K-block quantization and even some k-means clustering support. This leads to a wide array of possible model variations and quality-to-size tradeoffs, allowing users to pick a file that best fits their own hardware. But even with all these different variations, the base truth of smaller file size means faster execution but lower quality remains, so focus on that when picking a model. The GGUF format is by far the most popular for running LLMs on consumer-grade hardware.

A guide to ai model file formats

Raw model weights

GPU-optimized LLM formats

Consumer-hardware optimized LLM formats

More articles

Btrfs vs ZFS

Waiting for page load using selenium

Configuring sleep and hibernate on linux debian

Automating SSL certificates for web servers with certbot

Automating backups with ansible

The definitive guide to HTTP caching