[postgis-users] Large Databases

Thu Jun 29 06:45:45 PDT 2006

Hi, Robert,

Please keep the discussion on the list, so others can learn from it.

Robert Burgholzer wrote:

> Thanks for the reply. I did in fact see that in the logs and changed it,
> but did not report it to the list. What you say about a "regular vacuum"
> being effective makes more sense to me now, I guess, although I thought
> that you suggested that only the last page would be cleaned up in a
> "regular vacuum" - I assume that this means that each table will have
> one page vacuumed, but the sum of them cannot exceed the fsm setting.

No, not exactly, I'll try to give a brief but complete description of
what happens:

A "non-full" vacuum cleans all pages of the table, where cleaning means
 removing dead tuples, creating or enlarging free space in them.

Those pages at the very end of the underlying file that get completely
free during this process are returned back to the OS by truncating the file.

All other pages that contain at least some free space are registered in
the free space map (as long as the FSM has free entries).

When allocating new rows (during insert or update), PostgreSQL has the
following strategy to find free space:

- On update, check whether there's enough free space in the current page
  (The current page is the one where the old row version is.)

- Look into the free space map for a page containing enough free space

- Allocate a new page from the OS by enlarging the underlying file (or
creating a new one if the current one has reached max_file_size).

So those pages that contain free space, but don't fit into the free
space map, won't be reused by insert operations, and only occasionally
by updates (at least until the next vacuum comes around and can register
them into the FSM).

Basically summarized:

If, on average, you "touch" more pages between two vacuums than your
free space map can cope with, your database will continue growing.

Vacuum full goes further, actively moving non-dead rows towards the
beginning of the file. This aggressively groups (most) free space at the
end of the file, allowing file truncation.

If you want more details, look at the PostgreSQL documentation, or ask
here and I'll try to give you more pointers.

> I
> will have to change my weekly "vacuum full" cron job to a "vacuum", and
> try to observe the differences. 

Vacuum should run much more often than vacuum full, if possible using
autovacuum (which is automatically run from PostgreSQL 8.1 or newer).

It is less expensive as vacuum full, as it does not get a global table
lock, and the I/O impact can be tuned down using vacuum_delay.

> I was surprised to see the generally null response to the concept of
> performance tuning for high-volume data, unless of course everyone is
> much more knowledgeable about basic postgres tuning than myself (not too
> hard to achieve), or the average database is simply smaller than the
> unusually large ones that I deal with.

I'm afraid that this is not the best list for this subject, the
PostgreSQL Performance List <pgsql-performance at postgresql.org> is much
better for such discussions.

> My thought, however, is that the
> average postgresql database is NOT used for the purposes of performing
> high-volume calculations, but instead more mundane information storage
> and retrieval.

I think you're assuming wrong, PostgreSQL is used in a wide variety of
scenarios, including terabyte sized databases.

However, like most complex RDBMSes, it needs a certain amount of
understanding and configuration by the administrators to perform well.

HTH,
Markus

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org