Understanding Memory Pressure Without Paging

Not every long-form post needs to be part of a larger arc. Sometimes the useful shape is a single focused essay that starts with one systems symptom and follows it only as far as the explanation remains sharp.¹

Pressure is a coordination problem before it is a metrics problem

Memory pressure² is not only about how many bytes remain free. It is also about which subsystem is forced to react first, how expensive reclamation becomes, and whether the rest of the machine still makes forward progress while that happens. Once you start reading PSI alongside allocator behavior, the machine stops looking like it has “a memory number” and starts looking like it has a queueing problem with different failure surfaces.

The shortest useful checklist usually looks like this:

Decide whether the symptom is latency, throughput, or outright failure.
Separate reclaim work from application work.
Check whether the slow path is local allocation pressure or writeback coordination.

That checklist is intentionally narrow because a small model you actually reuse is worth more than a “complete” taxonomy you forget after one incident.

The interesting signals are usually indirect

The first symptom is often not an obvious “out of memory” event.³⁴ It is tail latency growth, reclaim activity showing up in profiles, allocator stalls, or throughput dropping because work that looked CPU-bound is now waiting on memory housekeeping.

Three signals usually arrive before the dramatic failure mode:

allocator retries showing up in kernel counters
direct reclaim or writeback time surfacing in profiles
request latency widening long before the median moves

The machine is often telling you that it is spending time negotiating memory, not that it has “run out” of memory.

If the incident writeup only says “memory usage was high,” it usually missed the operationally useful part of the story.

struct pressure_snapshot {
  uint64_t reclaim_scans;
  uint64_t alloc_stalls;
  uint64_t writeback_pages;
};

Here is a deliberately wide code sample that should scroll horizontally instead of wrapping.

static const char *memory_pressure_trace_header = "timestamp_ns,cgroup_path,reclaim_scans,alloc_stalls,writeback_pages,psi_some_avg10,psi_full_avg10,workload_phase,diagnostic_note";

That tiny snapshot is not enough to diagnose everything, but it is enough to stop the conversation from collapsing into one free-memory graph and a lot of hand-waving.

Signal	What it usually means	Why it matters
`alloc_stalls`	Tasks are entering slower allocation paths	User-visible latency often moves here first.
`reclaim_scans`	The kernel is burning time searching for reclaimable pages	This hints at pressure even when the box is not yet failing hard.
`writeback_pages`	Dirty memory is forcing coordination with storage	Memory pressure can become an I/O scheduling problem very quickly.

Why the table is still a simplification

None of these counters should be treated as a universal truth in isolation. The point is to preserve a useful mental model during investigation, not to pretend that one row of telemetry can summarize the full state of the VM subsystem.

Niche mechanics still deserve legible treatment

Even narrow technical artifacts should still feel at home in the prose system. For example, if you were tracking reclaim behavior by zone, a tiny table like this should remain readable rather than collapsing into generic documentation styling.

Zone	Scan pressure	Reclaim outcome
`DMA32`	high	noise, usually not the primary bottleneck
`Normal`	sustained	often where the useful story actually lives
`Movable`	bursty	can distort the picture if fragmentation is involved

And if you need one compact inline reminder, it can stay compact: watch psi, vmstat, and your service latency together rather than in separate mental buckets.

Keep the question narrow enough to finish

The point of a standalone post is not to cover the whole operating system. It is to give one durable mental model you can reuse the next time a machine becomes slower under load before it becomes obviously broken.

A practical way to stop the investigation from sprawling

Start with one host, one workload, and one visibly degraded path. If you expand to the whole fleet too early, you usually replace diagnosis with folklore.

Use the smallest framing that still explains the behavior:

What is stalling?
What work is the kernel doing on behalf of that stall?
What metric actually moves first?

Once those answers are clear, a short post like this has done its job. The rest can live in a follow-up note, a runbook, or a deeper series entry rather than bloating one page beyond usefulness.⁵

Link samples in prose

The page also keeps a small set of links inside ordinary prose so the link treatment can be checked in context. For example, the PSI docs are a good reference point, while the topics index shows how an internal navigation link reads next to surrounding text.

Here is the same idea in a list:

a link inside a list item to the feed
another one pointing at the kernel topic page
and a final example that uses a more descriptive memory pressure note

Even in a blockquote, a link like the PSI overview should stay readable without dominating the sentence.

Context	Example
Inline prose	runbook entry
Table cell	post index
Nested detail	feed listing

One more link sample

A folded note still needs a clear affordance when it includes a related topic.

Linux pressure stall information is a good example of an indirect signal that becomes more useful once you stop treating memory pressure as a single free-memory number. ↩
Coordination matters because the slow path is often visible before outright failure; the kernel docs on pressure reporting are a useful framing reference. See Linux PSI documentation. ↩
Tail latency usually moves first because reclaim and writeback interference show up before average throughput fully collapses. ↩
Allocator stall counters are often easier to trend than narrative descriptions from application logs. ↩
This is also why footnotes are useful here: they let the main text stay short while still giving you room to link to a primary source, qualify a claim, or tuck away a side observation without breaking the argument’s pace. ↩

Pressure is a coordination problem before it is a metrics problem#

The interesting signals are usually indirect#

Niche mechanics still deserve legible treatment#

Keep the question narrow enough to finish#

Link samples in prose#

Footnotes#

Pressure is a coordination problem before it is a metrics problem

The interesting signals are usually indirect

Niche mechanics still deserve legible treatment

Keep the question narrow enough to finish

Link samples in prose

Footnotes