Understanding Memory Pressure Without Paging

A standalone post shape that is intentionally not part of any series, so the browse surfaces still exercise ordinary post rendering alongside series entries.

Not every long-form post needs to be part of a larger arc. Sometimes the useful shape is a single focused essay that starts with one systems symptom and follows it only as far as the explanation remains sharp. 1

Pressure is a coordination problem before it is a metrics problem

Memory pressure 2 is not only about how many bytes remain free. It is also about which subsystem is forced to react first, how expensive reclamation becomes, and whether the rest of the machine still makes forward progress while that happens. Once you start reading PSI alongside allocator behavior, the machine stops looking like it has “a memory number” and starts looking like it has a queueing problem with different failure surfaces.

The shortest useful checklist usually looks like this:

  1. Decide whether the symptom is latency, throughput, or outright failure.
  2. Separate reclaim work from application work.
  3. Check whether the slow path is local allocation pressure or writeback coordination.

That checklist is intentionally narrow because a small model you actually reuse is worth more than a “complete” taxonomy you forget after one incident.

The interesting signals are usually indirect

The first symptom is often not an obvious “out of memory” event. 3 4 It is tail latency growth, reclaim activity showing up in profiles, allocator stalls, or throughput dropping because work that looked CPU-bound is now waiting on memory housekeeping.

Three signals usually arrive before the dramatic failure mode:

  • allocator retries showing up in kernel counters
  • direct reclaim or writeback time surfacing in profiles
  • request latency widening long before the median moves

The machine is often telling you that it is spending time negotiating memory, not that it has “run out” of memory.

If the incident writeup only says “memory usage was high,” it usually missed the operationally useful part of the story.

struct pressure_snapshot {
  uint64_t reclaim_scans;
  uint64_t alloc_stalls;
  uint64_t writeback_pages;
};

That tiny snapshot is not enough to diagnose everything, but it is enough to stop the conversation from collapsing into one free-memory graph and a lot of hand-waving.

SignalWhat it usually meansWhy it matters
alloc_stallsTasks are entering slower allocation pathsUser-visible latency often moves here first.
reclaim_scansThe kernel is burning time searching for reclaimable pagesThis hints at pressure even when the box is not yet failing hard.
writeback_pagesDirty memory is forcing coordination with storageMemory pressure can become an I/O scheduling problem very quickly.
Why the table is still a simplification

None of these counters should be treated as a universal truth in isolation. The point is to preserve a useful mental model during investigation, not to pretend that one row of telemetry can summarize the full state of the VM subsystem.

Niche mechanics still deserve legible treatment

Even narrow technical artifacts should still feel at home in the prose system. For example, if you were tracking reclaim behavior by zone, a tiny table like this should remain readable rather than collapsing into generic documentation styling.

ZoneScan pressureReclaim outcome
DMA32highnoise, usually not the primary bottleneck
Normalsustainedoften where the useful story actually lives
Movableburstycan distort the picture if fragmentation is involved

And if you need one compact inline reminder, it can stay compact: watch psi, vmstat, and your service latency together rather than in separate mental buckets.

Keep the question narrow enough to finish

The point of a standalone post is not to cover the whole operating system. It is to give one durable mental model you can reuse the next time a machine becomes slower under load before it becomes obviously broken.

A practical way to stop the investigation from sprawling

Start with one host, one workload, and one visibly degraded path. If you expand to the whole fleet too early, you usually replace diagnosis with folklore.

Use the smallest framing that still explains the behavior:

  • What is stalling?
  • What work is the kernel doing on behalf of that stall?
  • What metric actually moves first?

Once those answers are clear, a short post like this has done its job. The rest can live in a follow-up note, a runbook, or a deeper series entry rather than bloating one page beyond usefulness. 5

Footnotes

  1. Linux pressure stall information is a good example of an indirect signal that becomes more useful once you stop treating memory pressure as a single free-memory number.

  2. Coordination matters because the slow path is often visible before outright failure; the kernel docs on pressure reporting are a useful framing reference. See Linux PSI documentation .

  3. Tail latency usually moves first because reclaim and writeback interference show up before average throughput fully collapses.

  4. Allocator stall counters are often easier to trend than narrative descriptions from application logs.

  5. This is also why footnotes are useful here: they let the main text stay short while still giving you room to link to a primary source, qualify a claim, or tuck away a side observation without breaking the argument’s pace.