[Gdal-dev] INDEX/JOIN Performance

Thu Mar 20 17:50:16 EST 2003

Paul / Daniel,

I thought I should do a little performance testing on the indexing for joins.
The results show that the join does add alot of overhead to very simple
queries but that overall throughput is still pretty respectable.

Files:
primary.dbf: 300000 records each with one integer key field.
secondary.dbf: 1000000 records each with on integer key field.

warmerda at gdal2200[506]% time ogrinfo -ro . -sql 'select * from primary' > & wrk3
2.423u 0.177s 0:02.85 90.8%     0+0k 0+0io 1336pf+0w
warmerda at gdal2200[510]% time ogrinfo -ro . -sql 'select * from primary left join secondary on primary.skey = 
secondary.key' > & wrk2
13.662u 7.660s 0:21.42 99.5%    0+0k 0+0io 1337pf+0w
warmerda at gdal2200[511]% ls -l
total 56848
-rw-r--r--    1 warmerda users     3600065 Mar 20 17:37 primary.dbf
-rw-r--r--    1 warmerda users        1146 Mar 20 17:37 ptest.py
-rw-r--r--    1 warmerda users    12000065 Mar 20 17:27 secondary.dbf
-rw-r--r--    1 warmerda users         222 Mar 20 17:30 secondary.idm
-rw-r--r--    1 warmerda users    17066496 Mar 20 17:41 secondary.ind

In short, the query without doing the join took 2.4s wall time.
The query with the join took 22s wall time.  This was to process all
300000 records in the primary file.  Joining 3000000 records against
another table in 22s isn't bad.

The primary keys were essentially random so the the page in memory from
the .ind file was essentially never the right one. I don't know it for
a fact, but for this situation I believe the reading of the .ind file pages
on each request was likely the dominant factor with the join case.

Performance without the index was of course terrible.  One or two records
produced per second since it had to search through all of secondary.dbf for
each primary record processed.

I ran a second test with the same sized files, but the primary keys were
in the same order as the secondary, so there was a large degree of
"cache hits" as far as the index file and the secondary file were
concerned.  This took 11.4s wall time.  Substantially better than the
random access case, but not yet coming close to the unjoined case.  In
an idealized world the best we could hope for in the joined case would
be double the time of the unjoined case (assuming the indexing was in all
senses free and worked perfectly).  As it turns out, even in this case
about 50% of the CPU time is index lookup work.

Best regards,

-- 
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | Geospatial Programmer for Rent