Hard drives are one of the most important building blocks of any computer system, with problems often leading to time-consuming backup restoration processes or even permanent loss of data. Luckily you can check the health and failure indicators before a drive stops responding.
Understanding SMART
Modern drives ship with builtin tools to collect and log signals that can help identify drive issues long before the disk actually fails. This is called Self-Monitoring, Analysis and Reporting Technology - or SMART for short. On linux, you can interact with SMART using the smartctl
tool. On debian-based systems, it is available in the smartmontools
package:
sudo apt install smartmontools
Note that you need root privileges to run smartctl
.
Finding the correct disk
If you are already familiar with the disks on your system, it may be enough to let smartctl scan for hard drives:
sudo smartctl --scan
Sample output may look like this:
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
This is not exactly a lot of information. If you are unsure about which disk name you are looking for, you can use lsblk
to get more context of which partitions are contained on which disks, and where they are mounted:
lsblk
The output gives more context about what each disk is used for, making it easier to identify the disk you are looking for.
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 1.8T 0 disk └─datastore 254:3 0 1.8T 0 crypt /data sr0 11:0 1 1024M 0 rom nvme0n1 259:0 0 238.5G 0 disk ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi ├─nvme0n1p2 259:2 0 488M 0 part /boot └─nvme0n1p3 259:3 0 237.5G 0 part └─nvme0n1p3_crypt 254:0 0 237.5G 0 crypt ├─lab--vg-root 254:1 0 236.5G 0 lvm / └─lab--vg-swap_1 254:2 0 980M 0 lvm [SWAP]
Getting information about a drive
First of all, let's make sure the drive even supports SMART at all. Most modern drives will, but there are exceptions (and older drives of course). To get general information, you can use smartctl -i
:
sudo smartctl -i /dev/sda
The output contains generic information, like model, serial number, technical details like sector size and capacity:
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 870 QVO 2TB
Serial Number: S5RPNF0TC00009A
LU WWN Device Id: 5 002538 f42c0a583
Firmware Version: SVQ02B6Q
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Apr 11 19:01:44 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
While some parts may be of interest to you, the only important thing we are looking for are the last two lines, stating that SMART is available and enabled for the drive.
To get more detailed information about which SMART capabilities our drive supports, we can use smartctl -c
:
sudo smartctl -c /dev/sda
The output may look like this:
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 160) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
The left side lists the capabilities, the right side shows which parts of it are supported by the disk in question.
Testing drive health
While SMART will collect and log potential issues passively, you can run a full drive check manually to verify drive health as well. First, start a selftest:
sudo smartctl -t short /dev/sda
The output will look similar to this:
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Apr 12 13:10:11 2024 CEST
Use smartctl -X to abort test.
The short
test will typically be enough to find issues, but you can also run a long test instead for a more detailed drive test. The test will now run in the background, once it completes you can view the results with the -l flag:
smartctl -l selftest /dev/sda
The results of all recently ran tests ill be listed as the output:
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 6576 -
# 2 Short offline Completed without error 00% 6572 -
In this sample, the tests completed successfully and no errors were found.
If error occurred, you can check what errors were logged as well:
sudo smartctl -l error /dev/sda
Since we have no errors, the output will report no problems
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged
SMART can help to predict drive failure and allows system administrators to replace unreliable storage hardware before data loss occurs. Ensuring SMART is enabled on all drives and running / checking tests regularly is key to ensuring storage drives are reliable and consistent, while reducing the risk of surprise outages. A monitoring system could utilize task schedulers like cron to check and report drive health in the background, or a system operator routinely checking the output of sudo smartctl -a /dev/sda
for problems or pre-fail indicators.