[postgis-users] speeding up "not in" query

Sun Jul 25 21:24:24 PDT 2010

Thank you all for your suggestions. 

My original query not only took 4.6 hours to run, but also returned the
wrong answer. The "not exists" version below returned the correct answer
(as far as I can tell) in 2 minutes. I thought these were essentially
the same query but apparently they are not. I have not yet tried the
left join version.

SELECT *
FROM parcels_esri
WHERE NOT EXISTS
(SELECT survey.parcelnum
FROM survey WHERE parcels_esri.parcelnumb = survey.parcelnum)

"Seq Scan on parcels_esri  (cost=0.00..3320785.37 rows=193444
width=4537) (actual time=17.793..4154.274 rows=84489 loops=1)"
"  Filter: (NOT (subplan))"
"  SubPlan"
"    ->  Index Scan using parcelnum_idx on survey  (cost=0.00..8.35
rows=1 width=26) (actual time=0.009..0.009 rows=1 loops=386887)"
"          Index Cond: ($0 = parcelnum)"
"Total runtime: 4200.203 ms"

Running EXPLAIN ANALYZE on my original query takes more time than I am
willing to wait. This seems like a very problematic formulation:

SELECT *
FROM parcels_esri
WHERE parcels_esri.parcelnumb NOT IN
(SELECT survey.parcelnum
FROM survey)

Thanks again for your help. I now know the huge improvements that can
come from such variations.

-david