[postgis-users] Fun with multikey GiST
Andy Turk
andy at streetlight.com
Fri Jun 21 12:14:54 PDT 2002
Yesterday I posted a question about being able to create an index that looks
at a geometry object and a string. After some poking around, I was able to
come up with a solution that uses multikey GiST, but not directly with
strings.
I need to query the database for all objects with a certain name within a
certain geography. The names (street names, in my case) are not unique and
are typically repeated in many different locations.
My first attempt used a straightforward AND clause to check for overlap with
the region and also to check for the desired name. The standard geometry
index works well in this case, but the number of rows returned by the index
was too large (e.g., tens of thousands of rows). Each row would be checked
sequentially for name equality.
After reading about multikey GiST, I thought that an index over both the
geometry object and the name would do the trick. Further investigation showed
that there don't appear to be any GiST operators for string data. I briefly
(very briefly) considered writing some, but that looked like a lot of work
for uncertain gain.
There are, however, GiST operators for integers. Since I only needed to check
for equality between strings (not ordering) I decided to write a simple hash
function in PLPGSQL and build an index over the geometry column and the hash
of the name.
My next discovery was that a multikey index can't have function calls in it
(a single key index can). This complicated matters a little, since I had
hoped to quickly set up an index and try it out. I then tried to create a
view on my original table which supplied the hash, but the indexer complained
that it wanted a real table to work with.
Undeterred, I added a column to my table and entered a simple UPDATE
statement to set up the hash values. I went out for about an hour. When I
came back, it was still going. My test table had over two million rows and it
looked like it was going to take a *long* time for the update to finish since
the backend process wasn't consuming very much CPU.
The "solution" was to go back to the view I created and use it to create a
new concrete table with the hash value in each row. This took only a few
minutes to complete and I was much closer to trying out the index.
I entered the CREATE INDEX command with the two columns (geometry and integer
hash) and watched the backend process absorb the CPU for a while. It was late
and I was tired, so I went to bed wondering if it would complete before
morning.
After I got up and rubbed the sleep out of my eyes, I checked on the index
again. It was done, and no errors were visible. The backend process showed
about 1 hour of CPU time on it and I think most of that went into creating
the multikey index. For reasons unrelated to this experiment, the table that
I successfully indexed was larger than what I started out with. It has about
4.25 million rows. I'm running postgres on a 500MHz Athlon box (1 cpu) with
256M of memory and a decent SCSI disk.
Creating a single key index on just the geometry takes several minutes, but
really isn't that bad at all. I'm guessing that my multikey index took an
order of magnitude longer to build.
After all this work, I was half-expecting the query planner to avoid my
multikey index. But it did the right thing and used the index when I needed
it to.
Query performance seems OK. Not blistering, but maybe 4x better than using
just the geometry index and then a sequential scan. My figures are very loose
here since I tested my single key index against a table about half the size
of the one used for the multi key index.
Specifically, I'd get results back from the single key index and scan in
about 20 seconds of wall time. This was on a table of about 2 million
records. The GiST index narrowed the results down to 20,000 to 25,000 rows
and it was sequential from that point on.
Using the multikey index on my 4.25 million row table, I get results in about
10 seconds of wall time. This table is roughly twice as big as the other one,
and the query runs about twice as fast, so that's where my figure of 4x came
from--not hard science by any stretch.
I was hoping for another order of magnitude speedup with the multikey index,
but maybe that was too ambitious. I suspect that I'll have to do something
more clever than just hashing on street names to get that level of
performance.
Paul Ramsey was wondering if it's worth the effort to build build support for
general SQL types into the GiST code. I think that's definitely a worthwhile
pursuit. It would have saved me a lot of trouble in massaging my VARCHAR data
so that I could use the GIST_INT4_OPS. One take-away point is that GiST is a
really good general purpose tool, but it's not a silver bullet for
performance.
If anyone has more specific questions about my little experiment here, please
feel free to ask.
Andy Turk
andy at streetlight.com
More information about the postgis-users
mailing list