NVMe tuning on Linux servers: a complete optimization guide

Last update: 17/12/2025
Author Isaac
  • The actual performance of NVMe in Linux It depends so much on hardware such as the kernel, the file system, and the database.
  • Adjustments such as disabling APST, choosing the appropriate scheduler, fine-tuning NUMA, and using periodic TRIM improve latency and stability.
  • InnoDB requires specific tuning (buffer pool, logs, flush method) to leverage NVMe on MySQL/MariaDB loads.
  • Well-designed indexes, optimized queries, and continuous monitoring are key for NVMe tuning to have a real impact.

NVMe settings on Linux servers

If you work with Linux servers and have made the switch to NVMe drives, you've probably noticed that, although they are much faster than HDDs or even SSD SATAYou don't always see the spectacular figures promised in the spec sheets in production. It's not that the hardware has let you down: it's usually the operating system, the kernel configuration, and the services that are holding you back.

In this article we're going to see how to perform a serious NVMe tuning on Linux serversWe'll consider both pure performance (latency, IOPS, and bandwidth) and drive durability. Furthermore, we'll integrate it with the database layer (MySQL/MariaDB, especially InnoDB) and with best practices for SSD and NVMe usage to ensure your platform performs optimally. The goal is for you to move seamlessly from lab tests with Fio to real-world production workloads that truly blaze.

Prerequisites and tools for optimizing NVMe on Linux

Before starting to change parameters, it is essential to have access root to the server and a good backup planMany of the settings we'll be discussing affect the kernel, partitioning, or even the device format, so any mistake could cause a service to crash or a volume to become unrecoverable.

You also need to have certain utilities for managing NVMe, monitoring I/O, and benchmarkingOn Debian or Ubuntu you can install them with:

Installation on Debian/Ubuntu: sudo apt update
sudo apt install nvme-cli fio util-linux iotop sysstat numactl -y

In RHEL-based systems (Rocky, Alma, CentOS, etc.) they are achieved with:

Installation in RHEL and derivatives: sudo dnf update -y
sudo dnf install nvme-cli fio util-linux iotop sysstat numactl -y

With these tools you can identify your NVMe devices, check their health and update firmware on Linux, measure latency and throughput objectively and repeatably, something critical before touching anything in order to be able to compare the "before" and the "after".

Discover, identify, and measure your NVMe devices

NVMe performance monitoring on Linux

The first step is to list which NVMe drives you actually have and what their specifications are. You can see it very clearly with nvme-cli and without having to struggle with weird names in /dev.

List NVMe devices: nvme list

If you want to go deeper, the following command shows the NVMe controller capabilities (queues of commands, maximum transfer size, etc.):

NVMe controller details: nvme id-ctrl -H /dev/nvme0

And to see the information on the “namespace” (LBA formats, sector sizes and relative performance ranking):

NVMe namespace details: nvme id-ns -H /dev/nvme0n1

Before you start fine-tuning the system, it's a good idea to save a Performance baseline with fioFor example, you can measure the latency in random 4K reads with direct I/O:

Base benchmark (fio randread 4K): fio --name=randread --filename=/dev/nvme0n1 --rw=randread --bs=4k \
--iodepth=64 --numjobs=1 --ioengine=io_uring --direct=1 --time_based=1 --runtime=20

And the sequential read throughput at 128K:

Base benchmark (fio seqread 128K): fio --name=seqread --filename=/dev/nvme0n1 --rw=read --bs=128k \
--iodepth=64 --numjobs=1 --ioengine=io_uring --direct=1 --time_based=1 --runtime=20

To understand how the system as a whole behaves, it is helpful to supplement this with iostat and smart-log:

iostat / smart-log checks: iostat -x 2 10
nvme smart-log /dev/nvme0

This way you'll know if your bottleneck is in queue service times, temperature, media errors or simply in the top layer (filesystem, database, etc.).

NVMe Power Management (APST) for low latencies

NVMe drives implement APST (Autonomous Power State Transition)a mechanism for saving energy by entering deeper low-power states. That's fine for portableBut on a server where millisecond or microsecond queues matter, those SSD "wake-ups" can introduce unpleasant latency spikes.

If you prioritize consistent latency on energy savingsYou can disable APST by adjusting the nvme_core module parameter (see the kernel news). On systems with GRUB, it would be enough to add:

Disable APST (GRUB): sudo sed -i 's/GRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="nvme_core.default_ps_max_latency_us=0 /' /etc/default/grub
sudo update-grub

After restarting, the kernel will stop sending the drive into deep power-saving states and, in general, The 99% and 99.9% latency queues should improveprovided that the server cooling is adequate for the temperature increase.

I/O scheduler suitable for NVMe and block layer settings

The I/O scheduler is the layer that decides In what order are read and write requests served? towards the device. With mechanical drives, it made a lot of sense to use complex schedulers that optimized the physical movement of the heads, but with NVMe that's unnecessary and can even be a hindrance.

In most modern distributions, NVMe drives already use scheduler «none» (or mq-deadline in certain cases)Even so, it's worth checking out:

Verify I/O scheduler: cat /sys/block/nvme0n1/queue/scheduler

If you see something different and your workload is primarily low-latency random I/O, you can force the scheduler to "none":

Force scheduler 'none': echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

For highly mixed workloads with intensive write operations, "mq-deadline" can provide better balance between throughput and latency:

Force scheduler 'mq-deadline': echo mq-deadline | sudo tee /sys/block/nvme0n1/queue/scheduler

In addition to the scheduler, the block layer offers two other important controls: read-ahead and maximum request sizeThe read-ahead (read_ahead_kb) "reads in advance" more data than has been requested, which is useful in long sequential accesses but counterproductive in random accesses.

Typical read_ahead_kb values: echo 128 | sudo tee /sys/block/nvme0n1/queue/read_ahead_kb # cargas aleatorias (BBDD)
echo 4096 | sudo tee /sys/block/nvme0n1/queue/read_ahead_kb # secuencial grande (backups, vídeo, etc.)

Regarding maximum request size (max_sectors_kb)It is advisable to align it with the optimal I/O size of the unit and not exceed the maximum hardware:

  Banahosting: Features, Advantages, Disadvantages, Plans, Prices, Alternatives

Consult optimal I/O sizes: cat /sys/block/nvme0n1/queue/optimal_io_size
cat /sys/block/nvme0n1/queue/max_hw_sectors_kb

Once you know this, you can set a reasonable value, for example 1024 KB:

Set max_sectors_kb: sudo sh -c 'echo 1024 > /sys/block/nvme0n1/queue/max_sectors_kb'

Fine-tune CPU and NUMA affinity for NVMe

In modern servers with multiple CPUs or sockets (NUMA), the Memory and device access latency is not uniformAn NVMe drive is physically close to a specific NUMA node, and if the threads using it reside on another node, each I/O crosses the inter-socket bus with the resulting latency toll.

That's why it's important to understand both NVMe interrupts and the processes that use it. are attached to the same NUMA nodeFirst, find out which IRQs the NVMe drive uses:

Locate NVMe IRQs: grep -i nvme /proc/interrupts

Then you can assign, for example, all IRQs from nvme0 to cores 0-3:

Assign IRQ affinity to nuclei: for i in $(grep -i nvme0 /proc/interrupts | awk -F: '{print $1}'); do
echo 0-3 | sudo tee /proc/irq/$i/smp_affinity_list
done

When you launch your main database or service, even in virtualization with KVMuse numactl to fix CPU and memory to the same node:

Run service with numactl: numactl --cpunodebind=0 --membind=0 tu-binario --tus-opciones

Another small adjustment to the block layer is rq_affinityThis indicates how I/O completion queues are processed. A value of 2 forces completion to be handled on the same core that issued the request, improving cache locality.

Activate rq_affinity=2: echo 2 | sudo tee /sys/block/nvme0n1/queue/rq_affinity

Make NVMe tuning changes persistent with udev

All changes made by touching files in /sys They are lost upon restartingTo avoid having to manually run scripts each time, the cleanest approach is to use udev rules that apply tuning as soon as the kernel detects the device.

For example, to set scheduler to none, rq_affinity to 2, and read_ahead to 128 KB on all NVMe drives, you could create:

Create a udev rule for tuning: sudo nano /etc/udev/rules.d/60-nvme-tuning.rules

With the content:

Example of a udev rule: ACTION=="add|change", KERNEL=="nvme*n*", \
ATTR{queue/rq_affinity}="2", \
ATTR{queue/scheduler}="none", \
ATTR{queue/read_ahead_kb}="128"

Then you reload udev and trigger the rules:

Apply udev rules: sudo udevadm control --reload
sudo udevadm trigger

If you also want to ensure that all assembled units benefit from Options like noatime or tmpfs for temporary directoriesCombine these rules with a good /etc/fstab configuration, which also reduces writes and helps the longevity of SSDs and NVMe.

Partition alignment, logical sector and TRIM on NVMe

For an NVMe drive to perform as it should in the long term, kernel parameters alone are not enough: logical geometry of partitions and the use of TRIMnor technologies like the persistent memory storageThey make a difference, especially in database loads and virtualization where fragmentation and overwriting are constant.

The first thing to do is check that your partitions are aligned at 1 MiBThis usually fits well with the internal geometry of most modern SSDs. With parted you can create a clean layout:

Aligned partitioning (parted): sudo parted -s /dev/nvme0n1 mklabel gpt
sudo parted -s /dev/nvme0n1 mkpart primary 1MiB 100%
sudo parted -s /dev/nvme0n1 align-check optimal 1

In addition, many NVMe drives offer various LBA (LBAF) formats with different sector sizesTypically 512B and 4K. You can view the options and the RP (Relative Performance) metric with:

View LBA formats: nvme id-ns -H /dev/nvme0n1 | grep -E 'LBA Format|Relative'

If you decide to change, be aware that it is a destructive operation which erases all the contents of the drive. An example of choosing the format with index 1 would be:

Change LBA format (destructive operation): sudo nvme format /dev/nvme0n1 --lbaf=1

As for TRIM, the idea is to inform the unit which blocks no longer contain valid data so that it can reuse them without penaltyFirst, check that the device supports it with lsblk:

Check TRIM support: lsblk --discard

If you see a non-zero DISC-GRAN value, there is support. Most recent distributions enable it. weekly timer of fstrim which is the recommended option, better than using the discard option in /etc/fstab, which adds overhead to each deletion:

Enable periodic fstrim: systemctl enable --now fstrim.timer
systemctl status fstrim.timer

Choosing and installing file systems on NVMe

The filesystem layer is the final filter between your application and the NVMe. XFS and ext4 work very well on Linux serversAnd in most cases it is advisable to use its default options with slight adjustments aimed at reducing unnecessary writes.

A very effective option with almost no downsides is noatimeThis prevents updating the last access time every time a file is read. This reduces disk writes, which improves both file longevity and performance. An example in /etc/fstab with ext4 would be:

Example fstab (ext4 noatime): /dev/nvme0n1p1 / ext4 noatime,errors=remount-ro 0 1

With XFS, the philosophy is the same: use the default file system and add noatime if you want to further extend the SSD's lifespan. In the case of ext4 and XFS, it's Preferably periodic TRIM with fstrim.timer rather than the discard option, except for very specific needs.

If you work with SD cards or removable media, keep in mind that many use FAT32 or exFAT, which They don't have journaling or TRIM.In these cases, tuning focuses more on choosing higher quality and capacity cards, and on minimizing writes by changing /var/log, /tmp or directories with a lot of I/O to tmpfs when it makes sense.

General SSD/NVMe optimization: noatime, tmpfs, swap, and logging

Besides NVMe-specific tuning, there's a set of classic SSD tweaks that remain very effective. The first is to check which directories are experiencing the most strain. constant reading and writing using iotop:

  Hostgator. Plans, Prices, Alternatives, Advantages

Real-time I/O monitoring: iotop -oPa

Let it run for a while and you'll see which processes and paths take precedence. From there, you can move some highly volatile directories to tmpfs in RAMprovided you have plenty of memory. Typical examples in /etc/fstab:

Example tmpfs in fstab: tmpfs /tmp tmpfs defaults,noatime,mode=1777 0 0
tmpfs /var/tmp tmpfs defaults,noatime,mode=1777 0 0

In many modern Linux distributions, /tmp is already tmpfs, but you can still move /var/tmp or other noisy directories into RAM. Another option is /var/log, if you don't mind losing logs between reboots or if you send the logs to a remote syslog.

tmpfs /var/log tmpfs defaults,noatime,mode=0755 0 0

In parallel, review the systemd-journal configuration in /etc/systemd/journald.conf to limit the maximum log size (SystemMaxUse, RuntimeMaxUse, etc.) and the level of detail (MaxLevelStore), avoiding unnecessary write storms.

Another key point for any SSD or NVMe is the use of swap and swappiness policyReducing this value allows the kernel to use RAM more efficiently and reduces disk usage:

Swappiness adjustment: vm.swappiness=1

If you have an exceptionally good memory and are aware of the risk, you could even set it to 0, although in practice it's usually more balanced to combine it with zram or zswapwhich compress and manage the swap in memory before reaching the physical disk, greatly reducing writes to the NVMe.

Diagnosing real problems vs synthetic benchmarks (fio, dd, file copies)

It is not uncommon for an administrator to encounter the situation that fio gives excellent numbers, but a simple file copy or a database SELECT INTO seems ridiculously slowIt is crucial to understand that real-world loads do not resemble a perfect benchmark and that many more factors are involved.

Commands like dd with bs=1M They usually show speeds limited by the filesystem layer, the kernel page cache, dd itself, and the way it writesSimilarly, a massive operation in MSSQL, MySQL, or MariaDB on ext4 or XFS measures not only NVMe, but also journaling, fsync, log flush, locking, CPU, kernel scheduler, and much more.

If on a server with Epyc, large amounts of RAM, and high-end NVMe drives you see that backups don't exceed 1-2 GB/s, but Fio reports values ​​close to the specification, the bottleneck is most likely in the software layer (filesystem, DB, journaling configuration, log size, etc.) or in how parallelism is being used (number of effective threads, concurrency limits in the database, etc.).

The practical approach is to combine low-level tools (fio, iostat, smart-log) with application-level measurements and adjust in layers: first kernel and NVMe, then filesystem and finally the database and the queries themselves.

The role of hardware and operating system in database performance

When we talk about tuning databases Regarding NVMe, it's crucial to remember that the default configuration is rarely sufficient for production. The performance pyramid starts with the hardwareCPU, amount of RAM and, above all, type of storage.

At the base of that pyramid is the storage qualityMechanical HDDs no longer make sense for serious transactional databases; SATA SSDs are the minimum acceptable; and NVMe drives connected via PCIe are the gold standard thanks to their low latency and very high IOPS.

Above that is the operating system: how it manages memory, page cache, disk I/O, and process scheduling It directly affects the database. Settings like swappiness, scheduler selection, the use of O_DIRECT in InnoDB, NUMA affinity, and directory mounting in tmpfs can make more of a difference than a simple internal parameter change in MySQL.

Only then does it make sense to fight with the database server configuration (InnoDB, global buffers, logs, etc.) and, at the top, with the design of schemas and indexes and with the quality of SQL queries.

MySQL/MariaDB on NVMe: fine-tuning InnoDB to exploit the hardware

InnoDB is the modern default storage engine for a reason: It offers ACID transactions, row-level locking, good resistance to corruption, and the ability to handle high concurrency.But for it to truly shine, its configuration needs to be aligned with the underlying hardware, especially with fast NVMe drives.

The star parameter is innodb_buffer_pool_sizeThis defines how much RAM InnoDB dedicates to maintaining hot data and indexes in memory. As a general rule, on a dedicated database server, between 70% and 80% of the available RAM is reserved for the buffer pool, adjusting according to the other services running on the machine.

When data primarily fits in the buffer pool, reads are resolved in RAM and NVMe is mainly used for sequential log writes and controlled flushingOtherwise, each cache miss involves a trip to the disk, and even if it's NVMe, There The response time will be orders of magnitude slower than a memory access.

The other major tuning block is the InnoDB logs (innodb_log_file_size and innodb_log_buffer_size)A properly sized log allows for grouping more changes into long sequential writes, reducing the pressure of random I/O on data files:

  • Large log file: Higher write performance but longer recovery times after a crash, as more events need to be replayed.
  • Small log file: Faster recovery but potential bottleneck in intensive writing.

To take full advantage of NVMe, it's usually worth having logs slightly larger than usualbecause the drive can comfortably handle sequential writes and the recovery time is usually still acceptable.

Concurrency, threads, and I/O in InnoDB over NVMe

In environments with many CPUs and fast NVMe drives, it's tempting to tweak parameters such as innodb_thread_concurrency, innodb_read_io_threads and innodb_write_io_threadsHowever, in current versions of MySQL/MariaDB, it is common practice to leave innodb_thread_concurrency at 0 so that the engine manages internal concurrency itself.

  How to edit the hosts file in Windows, Linux, and macOS step by step

What is important is to make sure the server doesn't run out of space. Threads for read and write operations And that the lower layer (kernel and NVMe) is prepared to handle deep I/O queues when the load actually demands it. With NVMe there's usually no problem, but it's advisable to use tools like Performance Schema or sys schema to check for anomalous I/O waits.

How InnoDB interacts with the file system is also critical. The parameter innodb_flush_method=O_DIRECT In Linux, it avoids double buffering with the file system cache and writes directly to the device, which is especially recommended when there is RAID with protected cache or NVMe with good internal cache.

Connection management, session buffers, and global caches

Fast NVMe isn't very useful if the database server is overloaded. thousands of inactive connections, gigantic session buffers, or an outdated Query Cache configurationTherefore, in addition to tuning InnoDB, it is necessary to organize the general parameters.

The variable max_connections defines the maximum number of simultaneous connections allowed. Blindly increasing it because "there's plenty of RAM" is a mistake: each connection drags along buffers and internal structures that accumulate in memory. The best practice is to monitor Max_used_connections and Adjust max_connections slightly above the actual peak, leaving room but without going overboard.

`wait_timeout` specifies how long an idle connection is kept alive before being closed. Default values ​​in hours don't make much sense in this context. web applications with connection poolsReducing it to 60 or even 30 seconds helps to clear abandoned sessions and prevents filling the memory with "sleep" connections.

Parameter thread_cache_size This setting controls how many threads are cached for reuse on new connections. If Threads_created is significantly higher than Connections, it indicates a small cache. Adjusting this setting reduces the cost of thread creation and improves responsiveness during traffic bursts.

Regarding the Query Cache, it is obsolete in modern MySQL and is usually absent in MariaDB. cause more blockages and invalidation problems than benefits on systems with high write rates. It doesn't offer anything special over NVMe, and in most cases it's best to disable it.

Working buffers per session: be careful with memory

Parameters like sort_buffer_size, join_buffer_size and read_buffer_size They are allocated per thread when needed, which means that huge values ​​multiplied by hundreds of connections can eat up RAM at full speed.

These buffers are used for specific operations (indexless sorts, indexless joins, sequential reads), and should only grow for specific cases of heavy and controlled consultationsGlobally, it's better to maintain conservative values ​​and, if necessary, raise them on a case-by-case basis in a maintenance session or batch process, rather than setting huge limits for all customers.

Schema, index, and query design: the layer where the most profit is made

No matter how much NVMe, kernel tuning, and InnoDB you have, if your tables lack appropriate indexes and queries perform full massive scansThe hardware won't be able to save you. The essential tool to see what's going on is EXPLAIN.

When analyzing an execution plan, pay attention to:

  • type: If you see ALL in large tables, there is a full scan and you need to add indexes.
  • key and possible_keys: which indices could be used and which one is actually used.
  • rows: Estimated number of rows to examine; the lower, the better.

Indexing correctly is key. the columns used in WHERE and in the JOIN conditions...as well as rewriting correlated subqueries in JOINs whenever possible. A good index design reduces the I/O required per query and thus better exploits the low latency of NVMe.

At the schematic level, use data types as small as possibleNormalizing to a reasonable level and defining simple primary keys (INT/BIGINT AUTO_INCREMENT) in InnoDB helps make tables more compact, fit better in the buffer pool, and take advantage of the cache and underlying hardware.

Continuous monitoring and iterative tuning cycle

All this tuning is pointless if not You measure before and after of each change. For the database, the Slow Query Log, Performance Schema, and sys schema allow you to locate slow queries, tables with many full scans, underutilized indexes, or files that concentrate too much I/O.

External tools such as Prometheus/Grafana, Percona PMM or other monitoring systems They make it easier to see the big picture: average latency and p95/p99 queries, QPS, CPU usage, I/O queues, NVMe temperature and health, etc. With that information, you can apply a sensible iterative process:

  • Define a performance baseline (hardware + kernel + database).
  • Identify the dominant bottleneck at that time.
  • Apply a single change (e.g., adjust buffer pool, change scheduler, or add an index).
  • Measure again and decide whether the change stays, is reversed, or is modified.

Optimizing NVMe and databases on Linux is not a magic trick or an isolated parameter change, but a continuous process of measuring, understanding, and adjusting in layersBy combining good hardware (especially quality NVMe), a well-tuned kernel (APST, scheduler, NUMA, TRIM), intelligently configured file systems (noatime, tmpfs when appropriate), and a well-parameterized MySQL/MariaDB with solid schemas and indexes, it is perfectly possible to get your Linux servers to truly unleash the potential of their NVMe drives under real-world workloads, beyond synthetic laboratory benchmarks.

What's new in Linux 6.18
Related article:
Linux 6.18: all the new features of the new kernel