Extracting video covers, thumbnails and previews with ffmpeg

Providing video files to viewers on websites or other UIs usually involves more than just the video itself: Users expect thumbnails in indexes or search results, preview snippets for a quick content overview, timelines when seeking with the progress bar and covers or storyboards to prevent layout shifts on page load. All these can be created with ffmpeg, but finding the right commands can be difficult.

Cover image

A cover image (aka "poster") is what is displayed when the video itself is not loaded (yet). Mainly used as a placeholder for the real video element, it should have the same dimensions (width, height) as the actual video, to prevent layout shifting.

Finding a relevant frame to use as a cover image can be difficult because of intro sequences and frequent scene changes in the beginning of videos. To mitigate the problem on a best-effort basis, the command will skip the first 10s of the input video (as it is seldom relevant), then uses the thumbnail video filter to find a relevant frame to use, based on frame content compared to adjacent frames, making it better suited than using simple scene change detection alone (like select='gt(scene,0.4)').

The simplest approach to extract a cover image with ffmpeg is:

ffmpeg -ss 00:00:10 -i video.mp4 -frames:v 1 -an -vf "thumbnail,setsar=1" -y cover.jpg

Note the -ss flag comes before the -i flag, enabling fast seeking to increase performance. The -an flag disables audio decoding, and -update 1 mutes an error message about writing to an output without sequence pattern. -frames:v 1 sets the framerate to 1 FPS and using setsar=1 ensures pixels are square, even for non-square video streams.

Thumbnails

Covers and thumbnails are almost identical, with two key differences: thumbnails are often much smaller than the video dimensions, and are expected to be the same width/height for differing video inputs. These requirements can be met by extracting a frame (just like the cover above), but then scaling it down (while maintaining aspect ratio). The final image may be smaller than the thumbnail dimensions, so all remaining space is filled with a padding color to ensure the thumbnail size does not vary, but also prevent tearing the extracted video frame:

#!/bin/bash

input="video.mp4"
output="snapshot.jpg"
offset="00:00:10"
width="640"
height="360"
padding_color="ffffff"

ffmpeg -ss "$offset" -i "$input" -frames:v 1 -an \
 -vf "thumbnail,scale=$width:$height:force_original_aspect_ratio=decrease,pad=$width:$height:-1:-1:$padding_color,setsar=1" \
 -update 1 -y "$output"

The command was turned into a bash script this time, to make the adjustable options more obvious.

The only technical difference to the ffmpeg command extracting the video cover image lies in the video filters (-vf): After a representative frame was found using the thumbnail filter, it is scaled down to the desired thumbnail dimensions with the scale filter. Using the force_original_aspect_ratio=decrease option ensures the image is not teared to the new dimensions, but scaled down while maintaining aspect ratio. This may leave the scaled down image smaller than the thumbnail dimensions, so the pad filter fills in all leftover space with a solid color (from a hex color code, ffffff for white in this example).

Storyboard

A storyboard is a different kind of cover, where a number of frames from the video are picked and arranged in a grid to use as a poster image for the video. While this is less ideal for small dimensions, it gives the viewer an immediate overview of the video contents before playing it.

To create a storyboard, you first need to know the number of images to produce (images per row * number of rows in the grid), and the interval at which to extract frames from the source video. To simplify these issues and avoid dealing with aspect ratios altogether, try to pick an equal number of rows and columns, e.g. 4x4 (5x5, 6x6, ...), so images can be resized blindly by dividing width/height by 4 (5, 6, ...).

#!/bin/bash

input="video.mp4"
output="storyboard.jpg"
grid="10"

duration=$(ffprobe -select_streams v:0 -show_entries stream=duration -of csv=p=0 "$input")
frame_interval=$(echo "scale=2; $duration / ($grid * $grid)" | bc)
ffmpeg -i "$input" -vf "fps=1/$frame_interval,scale=iw/${grid}:ih/${grid},tile=${grid}x${grid}" \
      -fps_mode vfr "$output"

This time more logic is offloaded to bash, first to compute the video duration using ffprobe, then use that to calculate the interval in which to pick frames from the video for the storyboard image (duration divided by total number of images in grid).

To extract frames at an even interval with ffmpeg, we simply set the fps to one per interval at which we want to extract a frame, leaving only the desired frames in the stream. The frames are then scaled down to fit neatly into the grid (by dividing the input width/height ih/iw by the total number of images per row/column) and arranged in a tile layout in the output image. Using -fps_mode vfr is important to ensure consistent timestamps when dealing with variable framerate input video. The grid variable defines both the number of images per row and the number of rows in the grid (total images = grid*grid), only allowing even numbers of output images helps preserve the aspect ratio of the scaled down images, avoiding issues with tearing of padding as discussed in thumbnails above.

If unequal grid dimensions like 10x3 are desired, you need to either adjust the scaling process to allow tearing or add padding, or leave empty spots in the grid.

Note that using the thumbnail filter for storyboard frames can be a tricky choice; while it does help in picking better frames, it also takes much more resources and shifts the interval unpredictably, so output may have less images than expected, potentially leading to a mostly empty grid in the storyboard. You could make it work with some more complex logic, for example by first computing the interval, then using the script from the "cover" extraction above at each interval timestamp (adjust the -ss flag) and saving them as images to disk, then rescaling and combining them with ffmpeg in a final step.

Be aware that ffprobe simply reads the video duration from the source video metadata. If the video's metadata is damaged, missing or in a non-standard format, ffprobe may return bad values for the duration. Using ffmpeg instead will almost always return good results, but is also considerably slower and more costly on cpu/gpu resources. Whether you want to support corrupted videos and pay for it in higher system load varies by use case. If you need the script to always work simply replace the duration computation in the script above with this one:

duration=$(ffmpeg -i "$input" 2>&1 | grep "Duration" | cut -d ' ' -f 4 | sed s/,// | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }')

Timeline

A timeline refers to a collection of frames taken at static intervals (for example every 5 seconds), assembled in a line from left to right. Most commonly, a timeline is used to show preview images when users hover over the progress bar of a video. Merging all the frames together in a single timeline video has significant advantages for usage in web scenarios, where fetching every file would incur some overhead from networking protocols and cpu load through network encryption and image decoding. A single image does incur this penalty only once, and is usually small enough in size to be well worth the tradeoff.

Creating a timeline needs an interval at which to pick frames from the video and the desired height of each frame (used to resize the frame while maintaining aspect ratio). We can then compute the total number of frames as duration/interval, which we need for the tile filter:

#!/bin/bash

input="video.mp4"
output="timeline.jpg"
interval=5
frame_height=120

duration=$(ffprobe -v error -select_streams v:0 -show_entries stream=duration \
             -of csv=p=0 "$input")
num_frames=$(echo "$duration / $interval" | bc)
ffmpeg -i "$input" \
  -vf "select='not(mod(t,$interval))',scale=-2:$frame_height,tile=${num_frames}x1"\
  -frames:v 1 "$output"

In order to select only frames at the specified interval, we use the select filter with the expression mod(t,$interval), which only picks frames from timestamps that are exactly divisible by the interval (aka have a remainder of 0). Each selected frame is then scaled to the desired height (automatically computing the correct width to maintain aspect ratio), and finally assembling them with the tile filter in a grid with just one row (a single line).

As with the storyboard above, the duration computation may need some more consideration, and the thumbnail filter is not a good choice as it may produce unreliable frame counts.

Animated preview videos

A preview is a video made up of short excerpts from a source video, often used in place of (or in combination with) thumbnails, where hovering over a thumbnail would automatically play the preview video to give the viewer a brief overview of the video contents without leaving the current view (like search results). Preview videos are typically expected to behave like thumbnails in terms of resizing, aspect ratio and optional padding, except as a muted video made of snippets evenly spaced throughout the video.

#!/bin/bash

input="video.mp4"
output="preview.mp4"
num_snippets=3
snippet_duration=5
output_width=200
output_height=200
padding_color="000000"

temp_dir=$(mktemp -d)
trap "rm -r $temp_dir" EXIT

duration=$(ffprobe -v error -select_streams v:0 -show_entries stream=duration -of csv=p=0 "$input")
interval=$(echo "$duration / ($num_snippets - 1)" | bc -l)

current_time=0
for ((index=0; index<num_snippets; index++)); do
 ffmpeg -i "$input" -ss "$current_time" -t "$snippet_duration" \
   -vf "scale=${output_width}:${output_height}:force_original_aspect_ratio=decrease,pad=${output_width}:${output_height}:-1:-1:${padding_color}" \
   -an -y "$temp_dir/snippet_$index.mp4"
 current_time=$(echo "$current_time + $interval" | bc)
done

for f in "$temp_dir"/*.mp4; do echo "file '$f'"; done > "$temp_dir/files.txt"
ffmpeg -f concat -safe 0 -i "$temp_dir/files.txt" -c copy -y "$output"

There is a lot going on in this script, but it can be broken down into more reasonable steps:

We create a temporary directory and a trap to automatically remove it when the script exits. The directory is needed to store the extracted snippets temporarily, before we can concatenate them into the final output video.
A dynamic interval for snippet extraction is computed from the snippet count and video duration
The script loops over the video, extracting a snippet of the desired length for each previously calculated interval timestamp and saving it to the temporary directory
The names of the files in the temporary directory are written to a text file in a loop so ffmpeg can read them in order
We run ffmpeg to concatenate the extracted snippets into a single output video.

The most complex step of the process is the third, as it combines the previous thumbnail filters with the new snippet extraction logic and the -an flag to skip any audio stream from being extracted.

The script above works, but is kept simplistic intentionally to only showcase how to solve the problem at hand. There is a lot to take into consideration here, like edge cases of too short video duration for the selected snippet count/duration, the potentially unreliable offset calculation or adjustments to the extraction logic (e.g. if you wanted unlimited snippets at static intervals). You should treat this script as a starting point to customize, not a production-ready solution.