Checking disk health in Linux

Hard drives are one of the most important building blocks of any computer system, with problems often leading to time-consuming backup restoration processes or even permanent loss of data. Luckily you can check the health and failure indicators before a drive stops responding.

Understanding SMART

Modern drives ship with builtin tools to collect and log signals that can help identify drive issues long before the disk actually fails. This is called Self-Monitoring, Analysis and Reporting Technology - or SMART for short. On linux, you can interact with SMART using the smartctl tool. On debian-based systems, it is available in the smartmontools package:

sudo apt install smartmontools

Note that you need root privileges to run smartctl.

Finding the correct disk

If you are already familiar with the disks on your system, it may be enough to let smartctl scan for hard drives:

sudo smartctl --scan

Sample output may look like this:

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device

This is not exactly a lot of information. If you are unsure about which disk name you are looking for, you can use lsblk to get more context of which partitions are contained on which disks, and where they are mounted:

lsblk

The output gives more context about what each disk is used for, making it easier to identify the disk you are looking for.

NAME                MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda                   8:0   0  1.8T 0 disk 
└─datastore         254:3   0  1.8T 0 crypt /data
sr0                  11:0   1 1024M 0 rom  
nvme0n1             259:0   0 238.5G 0 disk 
├─nvme0n1p1         259:1   0  512M 0 part /boot/efi
├─nvme0n1p2         259:2   0  488M 0 part /boot
└─nvme0n1p3         259:3   0 237.5G 0 part 
 └─nvme0n1p3_crypt 254:0   0 237.5G 0 crypt 
   ├─lab--vg-root  254:1   0 236.5G 0 lvm  /
   └─lab--vg-swap_1 254:2   0  980M 0 lvm  [SWAP]

Getting information about a drive

First of all, let's make sure the drive even supports SMART at all. Most modern drives will, but there are exceptions (and older drives of course). To get general information, you can use smartctl -i:

sudo smartctl -i /dev/sda

The output contains generic information, like model, serial number, technical details like sector size and capacity:

=== START OF INFORMATION SECTION ===
Model Family:    Samsung based SSDs
Device Model:    Samsung SSD 870 QVO 2TB
Serial Number:   S5RPNF0TC00009A
LU WWN Device Id: 5 002538 f42c0a583
Firmware Version: SVQ02B6Q
User Capacity:   2,000,398,934,016 bytes [2.00 TB]
Sector Size:     512 bytes logical/physical
Rotation Rate:   Solid State Device
Form Factor:     2.5 inches
TRIM Command:    Available, deterministic, zeroed
Device is:       In smartctl database 7.3/5319
ATA Version is:  ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:   Thu Apr 11 19:01:44 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

While some parts may be of interest to you, the only important thing we are looking for are the last two lines, stating that SMART is available and enabled for the drive.

To get more detailed information about which SMART capabilities our drive supports, we can use smartctl -c:

sudo smartctl -c /dev/sda

The output may look like this:

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (   0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 160) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

The left side lists the capabilities, the right side shows which parts of it are supported by the disk in question.

Testing drive health

While SMART will collect and log potential issues passively, you can run a full drive check manually to verify drive health as well. First, start a selftest:

sudo smartctl -t short /dev/sda

The output will look similar to this:

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Apr 12 13:10:11 2024 CEST
Use smartctl -X to abort test.

The short test will typically be enough to find issues, but you can also run a long test instead for a more detailed drive test. The test will now run in the background, once it completes you can view the results with the -l flag:

sudo smartctl -l selftest /dev/sda

The results of all recently ran tests ill be listed as the output:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description   Status                 Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline      Completed without error      00%     6576        -
# 2 Short offline      Completed without error      00%     6572        -

In this sample, the tests completed successfully and no errors were found.

If error occurred, you can check what errors were logged as well:

sudo smartctl -l error /dev/sda

Since we have no errors, the output will report no problems

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

SMART can help to predict drive failure and allows system administrators to replace unreliable storage hardware before data loss occurs. Ensuring SMART is enabled on all drives and running / checking tests regularly is key to ensuring storage drives are reliable and consistent, while reducing the risk of surprise outages. A monitoring system could utilize task schedulers like cron to check and report drive health in the background, or a system operator routinely checking the output of sudo smartctl -a /dev/sda for problems or pre-fail indicators.