- Cpusets are a subsystem of cgroups that restricts which CPUs and memory nodes processes can run on and reserve memory on.
- The configuration is done through the cgroup file system, creating directory hierarchies and adjusting files such as cpuset.cpus and cpuset.mems.
- Additional controllers such as CPU, memory, or blkio allow limiting and accounting for CPU, RAM, and I/O, integrating CPU sets into a global resource management strategy.
- In cgroup v2, controllers such as cpu and cpuset are combined into a single tree, allowing precise CPU quotas through files such as cpu.max.
If you work with servers Linux Loaded with services, you've probably wondered how distribute well the CPU and memory between processes without one overpowering the other. Priorities with nice And classic limits help, but when everyone wants all the resources, the system becomes a jungle.
This is where two key parts of the kernel come into play: cgroups and the cpuset subsystemThanks to them you can precisely decide which processes use which CPUs and which memory nodes, monitor their consumption, limit their CPU or disk bandwidth, and even group them to create "soft partitions" within the same server.
What are cgroups and why do they matter?
Los control groups (cgroups) They are a mechanism of the Linux kernel for group processes and apply common resource policies to themCPU, memory, disk I/O, devices, network, etc. These do not replace traditional permissions, but rather complement them from the perspective of how much Each group consumes, not just one. to what has access.
Compared to traditional tools such nice, ionice o limits.confcgroups allow you to define group hierarchies in which each set of processes receives a predefined portion of resources, regardless of how many processes are within each group.
In modern kernels, cgroups are exposed through a virtual file system (cgroup v1 or cgroup v2) normally mounted on /sys/fs/cgroupEach subdirectory represents a group, and the files within it allow you to view and change its configuration.
cpusets: the subsystem for binding processes to CPUs and memory nodes
Within the cgroups, cpuset is the controller (subsystem) responsible for restrict on which CPUs and which memory nodes It can execute and allocate memory for a set of tasks. This is especially useful in large machines with many cores and NUMA topology, but it also makes sense on more modest servers when you want to isolate workloads.
Each CPU set defines two fundamental sets:
cpuset.cpus: list of CPUs on which the group's tasks can run.cpuset.mems: list of memory nodes in which those tasks can reserve memory.
When a process calls sched_setaffinity(), mbind() o set_mempolicy()the kernel filter those requests so that they can only use CPUs and nodes included in their current CPU set. Furthermore, the scheduler will never schedule that task on a CPU that is not in its allowed mask, and the page allocator will not reserve memory on nodes outside of it. mems.
Why CPUsets are so useful in large systems
On servers with many CPUs and memory distributed across multiple nodes, the process and memory placement It has a huge impact on performance. Accessing "distant" memory on a NUMA machine can be much slower than accessing the local memory of the node where the CPU is running.
With CPUsets you can create machine subsets (soft partitions) and assign entire jobs to those subsets: for example, one group of cores and memory nodes for a database, another for the web front end, and another for compute-intensive workloads. These groups can be dynamically adjust depending on the system load, without interfering with jobs already running on other partitions.
It's a very common strategy is available at:
- Web servers that run multiple instances of the same service.
- Mixed machines with web, database and other daemons sharing hardware.
- NUMA and HPC Clusters that run scientific applications sensitive to memory latency.
Practical example: taming a greedy Apache with cgroups and cpuset
Imagine a server with two cores and an Apache server that launches child processes dynamically. Even though they all have the same niceIn practice, the CPU available to the rest of the services will depend on the number of active Apache processes at any given time.
With normal priorities Things like this happen:
- Apache with 9 processes and another service with 1: the second one receives a 10% CPU.
- Apache with 99 processes and the other with 1: the second one drops to a 1% CPU.
The priority doesn't change, but the number of processes yesAnd that destroys fairness. With cgroups and cpuset you can create two groups: one for Apache and another for everything else, and tell the kernel that each group is assigned, for example, 50% of the CPU, regardless of the number of processes they set up within.
Mounting the cpuset file system (cgroup v1)
In many current distributions (Fedora, modern RHEL, systemdcgroups are mounted by default, and systemd automatically groups services. In older systems (for example, Ubuntu 12.04/14.04 with a classic configuration), you can manually mount the cpuset subsystem like this:
mount -t tmpfs -o size=5M cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/cpuset
mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
With this you have created a small tmpfs To house the hierarchies, a directory cpuset which will act as the mount point and, finally, you have mounted the file system cgroup limited to the controller cpuset.
If you inspect /sys/fs/cgroup/cpuset You will see files like:
tasksycgroup.procs: lists of tasks and IDs of thread groups belonging to the group.cpuset.cpusCPUs allowed for this group.cpuset.mems: allowed memory nodes.- Several additional flags:
cpuset.cpu_exclusive,cpuset.mem_exclusive,cpuset.memory_migrate, etc.
Creating subgroups and allocating CPUs/memory
Each subdirectory you create under /sys/fs/cgroup/cpuset is a new child CPU setRemoving it (if it's empty of tasks) is done with rmdirFor example, to divide the system into two groups, one per core:
cd /sys/fs/cgroup/cpuset
mkdir grupo-A grupo-B
echo 0 > grupo-A/cpuset.cpus
echo 1 > grupo-B/cpuset.cpus
echo 0 > grupo-A/cpuset.mems
echo 0 > grupo-B/cpuset.mems
You have created two CPU sets, group-A y group-BThe first uses only CPU 0 and the second uses CPU 1. Both share the machine's single memory node 0.
Associating processes with a CPU set
To place tasks in a CPU set, simply write your PID in the file tasks of that group. A process can only belong to one group within a specific hierarchy; when you move it, it disappears from the parent's list.
Suppose you open two new shells, with PIDs 3435 and 3492. They will initially be in the root CPU set.You can send them to the subgroups like this:
echo 3435 > grupo-A/tasks
echo 3492 > grupo-B/tasks
From that moment on, Everything launched from those bash shells will inherit their CPUset.If you run four processes from shell 3435 consume_cpu (a binary that just burns CPU in a loop), you'll see its PIDs inside grupo-A/tasks and the core 0 at full while core 1 remains almost idle.
If you later want to "rescue" one of those processes and give it a whole core, you can move your PID to the sibling CPUset:
echo 3621 > /sys/fs/cgroup/cpuset/grupo-B/tasks
When observing with topYou'll see how Both CPUs start working, and the moved process takes up most of CPU 1, while the other three consume CPU 0.
Other available cgroup controllers
cpuset is just one piece of the puzzle. cgroups include more controllers that can be used together for a very fine resource management:
cpuThis controls the proportion of CPU time allocated to each group. For example, it allows one group to have 80% of the CPU, another 15%, and another 5%.cpuacctIt does not limit, it only accounts There CPU consumed by the group's tasks and their descendants.blkio: regulates the I/O bandwidth on block devices, with proportional quotas or hard limits.memory: imposes RAM and swap limits and reports memory usage by group.devices: allows or denies access to specific devices (for example, locking a block device to a container).freezer: pause or resume all the tasks of a group.net_clsynet_prioThey label network traffic or adjust priorities per interface to integrate withtc.ns: groups processes into different namespaces, useful for lightweight virtualization in combination with namespaces.perf_event: allows monitoring cgroup tasks with the toolperf.
Internal details of cpusets in the kernel
At the kernel level, each task maintains a pointer to the cgroup structure to which it belongs. The cpusets define CPU and memory node masks that intersect with the affinity masks and NUMA policies requested by the task.
Some key points of the implementation:
- El root CPUset It contains all the system's memory cores and nodes.
- Each child cpuset must be a strict subset from the father's resources.
- A CPU set can be marked as exclusive access for CPU (
cpuset.cpu_exclusive) or memory (cpuset.mem_exclusive): in that case, their resources cannot overlap with those of their hermanos (yes, with ancestors or descendants). - No new system calls are added: everything is handled via cgroup file system and virtual files such as
cpuset.cpus,tasks, etc.
The kernel hooks up cpusets into several non-critical points for performance:
- Init: initializes the root cpuset at startup.
- fork/exit: to inherit and release membership in cpusets.
sched_setaffinity(): masks the affinity with the mask allowed by the cpuset.- Memory allocator: restricts pages to allowed nodes.
- Memory reclaim and task migration: cpuset restrictions are respected as far as possible.
Important flags and files of cpuset
Each CPU set has, in addition to cpuset.cpus y cpuset.memsa whirlpool bath, series of configuration files that control advanced behaviors:
cpuset.memory_migrateIf it's set to 1, when changing CPU set or modifyingmemsThe existing pages are migrant to the new nodes, respecting the relative position as much as possible.cpuset.mem_hardwallycpuset.mem_exclusiveWhen activated, they harden the memory barrier and restrict even certain shared kernel reserves.cpuset.memory_pressureycpuset.memory_pressure_enabledThey expose a measure of memory pressure (direct reclaims per second) with a moving average, useful for orchestrators and batch schedulers.cpuset.memory_spread_pageycpuset.memory_spread_slabIf enabled, the kernel distributes file cache pages and certain slabs. in round-robin mode by the allowed nodes, instead of always favoring the local node.cpuset.sched_load_balance: controls whether the scheduler attempts to balance the load between the CPUs in the CPU set.cpuset.sched_relax_domain_level: adjusts the scope (socket, node, whole system) of certain task migration operations within planning domains.
Additionally, in the root CPU set you will find cpuset.effective_cpus y cpuset.effective_memswhich reflect the truly usable resources, taking into account CPU/memory hotplug events. In special mode cpuset_v2_modeThese files may differ from cpuset.cpus y cpuset.mems to maintain a behavior more similar to that of cgroup v2.
Interaction with the scheduler: sched_load_balance and sched_relax_domain_level
The Linux scheduler divides the system into planning domains to minimize the cost of load balancing. Balancing across many cores is expensive, so it's usually done in groups (per socket, per node, etc.).
The flag cpuset.sched_load_balance This setting indicates whether the CPUs in that CPU set must be in the same domain to allow the scheduler to move tasks freely within it. Disabling this setting in the root CPU set and only enabling it in some child CPU sets can prevent unnecessary load balancing of CPUs reserved for real-time or highly specific workloads.
The file cpuset.sched_relax_domain_level allows you to adjust how far the scheduler can look when:
- A task is awakened and an attempt is made to place it in a nearby free core.
- One CPU is left without work and takes over tasks from more heavily loaded CPUs.
Typical values (dependent on the architecture) range from 0 (do not search) until 5 (search at the full system level in NUMA systems), with intermediate levels for HT siblings, cores of the same socket, nodes, etc. It is a fine-grained tool that should only be used if you are very clear about the impact on latencies and caches.
How to use cpusets in practice: typical flow
Work flow To contain a "job" or service within a specific CPU set, it is usually:
- Make sure you have the cgroup/cpuset filesystem mounted (v1 or v2).
- Create the CPU set to
mkdirin the corresponding hierarchy. - Setup
cpuset.cpusycpuset.memsbefore adding tasks. - Optionally enable flags such as
memory_migrateocpu_exclusive. - Start a "parent" process for loading and move its PID to the CPU set writing it in
tasksocgroup.procs. - Launch (or let launch) child processes from that parent; they will inherit their membership in the cpuset.
If you use user space tools , the cgroup-bin / libcgroup (in Debian/Ubuntu) or equivalents in other distros, you can handle this more conveniently with commands type cgcreate, cgexec y cgclassifyor even with configuration files such as /etc/cgconfig.conf y /etc/cgrules.conf to assign groups based on user or command.
cgroups v1 vs cgroups v2: CPU and cpuset drivers in RHEL 8 and similar systems
In modern layouts such as RHEL8the kernel simultaneously supports cgroups v1 and v2By default, in RHEL 8 v1 is mounted at BootHowever, you can force the use of unified mode (v2) with kernel parameters:
cgroup_no_v1=all: disables all v1 controllers on startup.systemd.unified_cgroup_hierarchy=1: tells systemd to use cgroup v2 as the primary hierarchy.
After the change and restart, you can check with mount o findmnt that there are no longer any types of assemblies cgroup classics (except for systemd internals) and hand-assemble a tree v2, for example in /cgroups-v2:
mount -t cgroup2 none /cgroups-v2
In that root directory (root control group) you will see generic files starting with cgroup.* and other specific controllers that are active there, such as cpuset.cpus.effective o cpu.max.
Limit CPU with cgroup v2: CPU driver and cpuset together
In v2 we work with a unique tree and the controllers are activated by subtree using the file cgroup.subtree_controlA typical workflow for limiting CPU usage to a couple of processes would be:
- Activate the drivers
cpuycpusetfor the direct children of root writing in/cgroups-v2/cgroup.subtree_controlSomething like+cpu +cpuset. - Create a subgroup, for example
/cgroups-v2/Example/, withmkdir. - Verify that files like
cpu.maxycpuset.cpus. - Adjust
cpuset.cpusycpuset.memsto ensure that processes will compete on the same CPUs/nodes (the controller)cpuThis only applies if there are at least two processes competing on the same CPU. - Setup
cpu.maxwith a fee and a period, for example:
echo "200000 1000000" > /cgroups-v2/Example/cpu.max
In this case, all the group's processes They will only be able to consume a combined 0,2 seconds of CPU time per second natural. The rest of the time they will be strangled until the next period.
Then there is only add the PIDs of the desired applications to the group by writing them in /cgroups-v2/Example/cgroup.procsIf, for example, there are two intensive processes (PIDs 5439 and 5473) in that group, each will end up around the 10% CPUbecause they are sharing the 20% quota that you have set.
Other tools for understanding CPU, affinity, and hardware in Linux
To work with CPUsets, it is very useful to have a good understanding of the CPU topology and the CPU status on your system. Linux offers a lot of commands, pseudo-files, and tools like CPU-X for Linux that provide detailed information:
lscpu: It displays in a very readable way the architecture, number of logical and physical CPUs, sockets, hyper-threading, caches, virtualization support, etc./proc/cpuinfo: exposes detailed information by CPU logic: model, family, stepping, microcode, feature flags (SSE, AVX, VT-x, AMD-V, NX, etc.)./sys/devices/system/cpu/: very rich directory structure with one subdirectory per CPU (cpu0,cpu1, …) and others for cpufreq, cpuidle, microcode, topology, etc.
Within /sys/devices/system/cpu/cpu0/cpufreq/ You can see, for example: (and for more details, How to find out the CPU frequency).
cpuinfo_cur_freq: current frequency.scaling_max_freqyscaling_min_freq: limits within which cpufreq can scale.scaling_governor: active policy (performance, powersave, on demand, etc.).scaling_available_governorsyscaling_driver: modes and driver in use (for example,intel_pstate).
En /sys/devices/system/cpu/cpu0/cpuidle/ You will see states of rest (state0, state1, …) with latencies and consumption, handled by the subsystem cpuidleThis, along with cpufreq And the scheduler decides when to turn off cores or lower frequencies to save energy based on the load.
En mobile devices it begins to be used Energy-Aware Scheduling (EAS), which more intelligently unifies decisions from cpuidle, cpufreq and the scheduler to avoid inconsistencies such as waking up powered-off cores when other active cores are available.
Measure and monitor CPU usage by process and by group
When you're working with CPU sets and CPU drivers, you need tools to see if what you've configured has an effectFor example, if a VM like VirtualBox uses too much CPUSome very common options are:
topDynamic system view with overall CPU usage, per CPU, and per process. The field % CPU It allows you to see which tasks are overloading the system.mpstat(from the package)sysstat): CPU and aggregate statistics, useful for viewing load distribution and idle times (idle, iowait, steal, etc.).pscombined withsortYou can list the processes that consume the most CPU, for example:
ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10
Furthermore, accounting files such as cpuacct.usage y cpuacct.usage_percpu (in v1) or the built-in counters in cgroup v2 let you know how much CPU a particular group has consumed since it was created, which is ideal for internal billing or comparing the impact of different applications.
Ultimately, cpusets and cgroups give you the ability to model your machine as a set of resource "islands"assigning tasks and adjusting on the fly how CPU, memory, and I/O are distributed. A thorough understanding of the files of /sys/fs/cgroup and /sys/devices/system/cpu, along with the monitoring tools, allows you to go from "just seeing how the server is doing" to having very fine control of what runs where, how much it consumes and how you can improve its behavior when the load gets tight.
Passionate writer about the world of bytes and technology in general. I love sharing my knowledge through writing, and that's what I'll do on this blog, show you all the most interesting things about gadgets, software, hardware, tech trends, and more. My goal is to help you navigate the digital world in a simple and entertaining way.
