[postgis-users] Fun with multikey GiST

Fri Jun 21 12:14:54 PDT 2002

Yesterday I posted a question about being able to create an index that looks 
at a geometry object and a string. After some poking around, I was able to 
come up with a solution that uses multikey GiST, but not directly with 
strings.

I need to query the database for all objects with a certain name within a 
certain geography. The names (street names, in my case) are not unique and 
are typically repeated in many different locations.

My first attempt used a straightforward AND clause to check for overlap with 
the region and also to check for the desired name. The standard geometry 
index works well in this case, but the number of rows returned by the index 
was too large (e.g., tens of thousands of rows). Each row would be checked 
sequentially for name equality.

After reading about multikey GiST, I thought that an index over both the 
geometry object and the name would do the trick. Further investigation showed 
that there don't appear to be any GiST operators for string data. I briefly 
(very briefly) considered writing some, but that looked like a lot of work 
for uncertain gain.

There are, however, GiST operators for integers. Since I only needed to check 
for equality between strings (not ordering) I decided to write a simple hash 
function in PLPGSQL and build an index over the geometry column and the hash 
of the name.

My next discovery was that a multikey index can't have function calls in it 
(a single key index can). This complicated matters a little, since I had 
hoped to quickly set up an index and try it out. I then tried to create a 
view on my original table which supplied the hash, but the indexer complained 
that it wanted a real table to work with.

Undeterred, I added a column to my table and entered a simple UPDATE 
statement to set up the hash values. I went out for about an hour. When I 
came back, it was still going. My test table had over two million rows and it 
looked like it was going to take a *long* time for the update to finish since 
the backend process wasn't consuming very much CPU.

The "solution" was to go back to the view I created and use it to create a 
new concrete table with the hash value in each row. This took only a few 
minutes to complete and I was much closer to trying out the index.

I entered the CREATE INDEX command with the two columns (geometry and integer 
hash) and watched the backend process absorb the CPU for a while. It was late 
and I was tired, so I went to bed wondering if it would complete before 
morning.

After I got up and rubbed the sleep out of my eyes, I checked on the index 
again. It was done, and no errors were visible. The backend process showed 
about 1 hour of CPU time on it and I think most of that went into creating 
the multikey index. For reasons unrelated to this experiment, the table that 
I successfully indexed was larger than what I started out with. It has about 
4.25 million rows. I'm running postgres on a 500MHz Athlon box (1 cpu) with 
256M of memory and a decent SCSI disk.

Creating a single key index on just the geometry takes several minutes, but 
really isn't that bad at all. I'm guessing that my multikey index took an 
order of magnitude longer to build.

After all this work, I was half-expecting the query planner to avoid my 
multikey index. But it did the right thing and used the index when I needed 
it to.

Query performance seems OK. Not blistering, but maybe 4x better than using 
just the geometry index and then a sequential scan. My figures are very loose 
here since I tested my single key index against a table about half the size 
of the one used for the multi key index.

Specifically, I'd get results back from the single key index and scan in 
about 20 seconds of wall time. This was on a table of about 2 million 
records. The GiST index narrowed the results down to 20,000 to 25,000 rows 
and it was sequential from that point on.

Using the multikey index on my 4.25 million row table, I get results in about 
10 seconds of wall time. This table is roughly twice as big as the other one, 
and the query runs about twice as fast, so that's where my figure of 4x came 
from--not hard science by any stretch.

I was hoping for another order of magnitude speedup with the multikey index, 
but maybe that was too ambitious. I suspect that I'll have to do something 
more clever than just hashing on street names to get that level of 
performance.

Paul Ramsey was wondering if it's worth the effort to build build support for 
general SQL types into the GiST code. I think that's definitely a worthwhile 
pursuit. It would have saved me a lot of trouble in massaging my VARCHAR data 
so that I could use the GIST_INT4_OPS.  One take-away point is that GiST is a 
really good general purpose tool, but it's not a silver bullet for 
performance.

If anyone has more specific questions about my little experiment here, please 
feel free to ask.

Andy Turk
andy at streetlight.com