Understanding caching in Postgres - An in-depth guide


Caching can be considered an important aspect in tuning database system performance.

While this post is mainly focused on postgres, it can be easily compared and understood with other database systems.


What is a cache and why do we need one

Different computer components operate at different speeds. We humans are extremely poor at understanding numbers at the scale that computers do.

Looking at the table below (taken from here) we can have an idea.

The numbers are approximated at human scale.

Access type Actual time Approximated time
1 CPU cycle0.3 ns1 s
Level 1 cache access0.9 ns3 s
Level 2 cache access2.8 ns9 s
Level 3 cache access12.9 ns43 s
Main memory access120 ns6 min
Solid-state disk I/O50-150 μs2-6 days
Rotational disk I/O1-10 ms1-12 months
Internet: SF to NYC40 ms4 years
Internet: SF to UK81 ms8 years
Internet: SF to Australia183 ms19 years

In a database system, we are mainly concerned about disk I/O.

Magnetic disks are poor for random I/O when compared to their newer counterparts the SSDs.

Most OLTP workloads are random I/O, hence the fetch from the disk can be very slow.

To overcome this, postgres caches data in RAM which can greatly improve performance. Even in the case of SSDs,RAM is much faster.

This general idea of a cache is common to almost all database systems.

Understanding terminologies

Before we move forward, it is necessary to understand certain terminologies.

I suggest to start reading with postgres physical storage.

Once you are done with that, inter db is an another one which goes in a little more depth. In particular, the section about heap tuples.

The official documentation for this is also available, but it is a little hard to get your head around.

Regardless of the content, postgres has a storage abstraction called a page(8KB in size). The image below gives a rough idea.

Postgres page

This abstraction is what we will be dealing with in the rest of this post.

What is cached?

Postgres caches the following.

We will explore how table data and indexes are cached in detail further in this post.

Memory areas

Postgres has several configuration parameters and understanding what they mean is really important.

For caching, the most important configuration is the shared_buffers.

Internally in the postgres source code, this is known as the NBuffers, and this where all of the shared data sits in the memory.

The shared_buffers is simply an array of 8KB blocks.Each page has metadata within itself to distinguish itself as mentioned above. Before postgres checks out the data from the disk, it first does a lookup for the pages in the shared_buffers, if there is a hit then it returns the data from there itself and thereby avoiding a disk I/O.

The LRU/Clock sweep cache algorithm

The mechanisms involved in putting data into a cache and evicting from them is controlled by a clock sweep algorithm.

It is built to handle OLTP workloads, so that almost all of the traffic are dealt with in memory.

Let’s talk about each action in detail.

Buffer allocation

Postgres is a process based system, i.e each connection has a native OS process of its own which is spawned from the postgres root process(previously called postmaster).

When a process requests for a page in the LRU cache (this is done whenever that page is accessed via a typical SQL query), it requests for a buffer allocation.

If the block is already in cache, it gets pinned and then returned. The process of pinning is a way to increase the usage count discussed below. A page is said to be unpinned when the usage count is zero.

Only if there are no buffers/slots free for a page, then it goes for buffer eviction.

Buffer eviction

Deciding which pages should be evicted from memory and written to disk is a classic computer science problem.

A plain LRU(Least Recently Used) algorithm does not work well in reality since it has no memory of the previous run.

Postgres keeps track of page usage count, so if a page usage count is zero, it is evicted from memory and written to disk. It is also written to disk when the page is dirty (see below).

Regardless of the nitty-gritty details, the cache algorithm by itself requires almost no tweaking and is much smarter than what people would usually think.

Dirty pages and cache invalidation

We were talking about select queries till now, what happens to DML queries?

Simple, they get written to the same pages. If present in memory, then they are written to it or else they are fetched from disk and then written to it.

This is where the notion of dirty pages come in, i.e a page has been modified and has not been written to disk.

Here is some more homework/study to be done, before we proceed, in particular about WAL and checkpoints.

WAL is a redo log that basically keeps track of whatever that is happening to the system. This is done by logging all changes separately to a WAL Log. Checkpointer is a process which writes the so called dirty pages to disk periodically and controlled by a time setting. It does so, because when the database crashes it does not need to replay everything from scratch.

This is the most common way of pages getting evicted from memory, the LRU eviction almost never happens in a typical scenario.

Understanding caches from explain analyze

Explain is a wonderful way to understand what is happening under the hoods. It can even tell how much data blocks came from disk and how much came from shared_buffers i.e memory.

A query plan below gives an example,

performance_test=# explain (analyze,buffers) select * from users order by userid limit 10;
  Limit  (cost=0.42..1.93 rows=10 width=219) (actual time=32.099..81.529 rows=10 loops=1)
    Buffers: shared read=13
    ->  Index Scan using users_userid_idx on users  (cost=0.42..150979.46 rows=1000000 width=219) (actual time=32.096..81.513 rows=10 loops=1)
          Buffers: shared read=13
  Planning time: 0.153 ms
  Execution time: 81.575 ms
 (6 rows)

Shared read, means it comes from the disk and it was not cached. If the query is run again, and if the cache configuration is correct (we will discuss about it below), it will show up as shared hit.

performance_test=# explain (analyze,buffers) select * from users order by userid limit 10;
  Limit  (cost=0.42..1.93 rows=10 width=219) (actual time=0.030..0.052 rows=10 loops=1)
    Buffers: shared hit=13
    ->  Index Scan using users_userid_idx on users  (cost=0.42..150979.46 rows=1000000 width=219) (actual time=0.028..0.044 rows=10 loops=1)
          Buffers: shared hit=13
  Planning time: 0.117 ms
  Execution time: 0.085 ms
 (6 rows)

It is very convenient this way to learn about how much is cached from a query perspective rather than fiddling with the internals of the OS/Postgres.

The case for sequential scans

A sequential scan i.e when there is no index and postgres has to fetch all the data from disk are a problem area for a cache like this.

Since a single seq scan can wipe all of the data from a cache, it is handled differently.

Instead of using a normal LRU/clock sweep algorithm, it uses a series of buffers of total 256 K.B in size. The below plan shows how it is handled.

performance_test=# explain (analyze,buffers) select count(*) from users;
  Aggregate  (cost=48214.95..48214.96 rows=1 width=0) (actual time=3874.445..3874.445 rows=1 loops=1)
    Buffers: shared read=35715
    ->  Seq Scan on users  (cost=0.00..45714.96 rows=999996 width=0) (actual time=6.024..3526.606 rows=1000000 loops=1)
          Buffers: shared read=35715
  Planning time: 0.114 ms
  Execution time: 3874.509 ms

Executing the above query again.

performance_test=# explain (analyze,buffers) select count(*) from users;
  Aggregate  (cost=48214.95..48214.96 rows=1 width=0) (actual time=426.385..426.385 rows=1 loops=1)
    Buffers: shared hit=32 read=35683
    ->  Seq Scan on users  (cost=0.00..45714.96 rows=999996 width=0) (actual time=0.036..285.363 rows=1000000 loops=1)
          Buffers: shared hit=32 read=35683
  Planning time: 0.048 ms
  Execution time: 426.431 ms

We can see that exactly 32 blocks have moved into memory i.e 32 * 8 = 256 KB. This is explained in the src/backend/storage/buffer/README

Memory flow and OS caching

Postgres as a cross platform database, relies heavily on the operating system for its caching.

The shared_buffers is actually duplicating what the OS does.A typical picture of how the data flows through postgres is given below.

Postgres Cache flow

This is confusing at first, since caching is managed by both the OS and postgres as well, but there are reasons to this.

Talking about operating system cache requires another post of its own, but there are many resources on the net which can leveraged.

Keep in mind that the OS caches data for the same reason we saw above, i.e why do we need a cache ?

We can classify the I/O as two types, i.e reads and writes. To put it even more simpler, data flows from disk to memory for reads and flows from memory to disk for writes.


For reads, when you consider the flow diagram above, the data flows from disk to OS cache and then to shared_buffers. We have already discussed how the pages will get pinned on to the shared_buffers until they get dirty/unpinned.

Sometimes, both the OS cache and shared_buffers can hold the same pages. This may lead for space wastage, but remember the OS cache is using a simple LRU and not a database optimized clock sweep. Once the pages take a hit on shared_buffers, the reads never reach the OS cache, and if there are any duplicates, they get removed easily.

In reality, there are not much pages which gets stacked on both the memory areas.

This is one of the reasons why it is advised to size the shared_buffers carefully. Using hard and fast rules such as giving it the lion’s share of the memory or giving it too little is going to hurt performance.

We will discuss more on optimization below.


Writes flow from memory to disk. This is where the concept of dirty pages come in.

Once a page is marked dirty, it gets flushed to the OS cache which then writes to disk. This is where the OS has more freedom to schedule I/O based on the incoming traffic.

As said above, if the OS cache size is less, then it cannot re-order the writes and optimize I/O. This is particularly important for a write heavy workload. So the OS cache size is important as well.

Initial configuration

As with many database systems, there is no silver bullet configuration which will just work. PostgreSQL ships with a basic configuration tuned for wide compatibility rather than performance.

It is the responsibility of the database administrator/developer to tune the configuration according to the application/workload. However, the folks at postgres have a good documentation of where to start

Once the default/startup configuration is set. We can do load/performance testing to see how it is holding up.

Keep in mind that the initial configuration is also tuned for availability rather than performance, it is better to always experiment and arrive at a config that is more suitable for the workload under consideration.

Optimize as you go

If you cannot measure something, you cannot optimize it

With postgres, there are two ways you can measure.

Operating system

While there is no general consensus on which platform postgres works best, I am assuming that you are using something in the linux family of operating systems. But the idea is kind of similar.

To start with, there is a tool called Io top which can measure disk I/O. Similar to top, this can come in handy when measuring disk I/O. Just run the command iotop to measure writes/reads.

IO top in action

This can give useful insights into how postgres is behaving under load i.e how much is hitting the disk and how much is from RAM which can be arrived based on the load being generated.

Directly from postgres

It is always better to monitor something directly from postgres,rather than going through the OS route.

Typically we would do OS level monitoring if we believe that there is something wrong with postgres itself, but this is rarely the case.

With postgres, there are several tools at our disposal for measuring performance with respect to memory.

    (total_time / 1000 / 60)::decimal as total_minutes,
    (total_time/calls)::decimal as average_time,
  FROM pg_stat_statements
  LIMIT 100;

Gives lot of details on how much time queries took and their average.

The disadvantage with this approach is it takes some amount of performance, so it is not generally recommended in production systems.

    SELECT c.relname,count(*) AS buffers
    FROM pg_class c INNER JOIN pg_buffercache b
    ON b.relfilenode=c.relfilenode INNER JOIN pg_database d
    ON (b.reldatabase=d.oid AND 
    GROUP BY c.relname ORDER BY 2 DESC LIMIT 100;

There are several more, but I have listed the most popular and easy to use ones for understanding postgres cache and also in general. Armed with these tools, there are no more excuses for a slow database because of memory problems


Tagged Under


Search this website...

Keeping up with blogs...

I blog occasionally and mostly write about Software engineering and the likes. If you are interested in keeping up with new blog posts, you should follow me on twitter where I usually tweet when I publish them. You can also use the RSS feed , or even subscribe via email below.

Feedio Subscribe

Share & Like

If you like this post, then you can either share/discuss/vote up on the below sites.

Thoughts ...

Please feel free to share your comments below for discussion

Blog comments powered by Disqus