[postgis-users] OT Understanding slow queries
Charles Galpin
cgalpin at lhsw.com
Thu Aug 25 11:13:30 PDT 2011
I haven't noticed count(*) causing additional query time, but I was just using it to rule out overhead of pulling all those rows back.
All of your feedback has been most helpful. Yes this query is contrived but I *thought* it was representative of a worst case scenario that might be similar to future data sets and it's likely not. Due to your prompting I did figure out why this particular problem i was seeing was slow and it was really just user stupidity.
The real use case is using geoserver to visualize this data. In the worst case scenario someone zooms out nice and far and effectively gets all the links in that join. Now geoserver seems to be able to get all the links with something like "get * from links" and generate the tiles at this zoom level fairly quickly so I figured all the other overhead being equal, the join (sans the count(*) ) would be the worst case and although it is, I think you are right that it simply has to join across all those ids and there is no way to improve that if selecting all.
I'll show a more reasonable case of what usually happens, but I'll explain the actual problem. I though adding a distinct on (link_id) would speed up the join since the source_link table has many rows for each link_id but it turns out this was what was making it slow (and I didn't realize that's what the real query was doing). I also though for some reason an index would make distinct run quickly since the index is effectively all unique values right?
So here is what is typically going on for the query and after changing the "select distinct on (link_id) l.*" to "select l.*" it performs reasonably with an additional level of filtering with a bounding box.
explain analyse SELECT "link_id",encode(ST_AsBinary(ST_Force_2D("the_geom")),'base64') as "the_geom"
FROM (select l.* from links l, source_link ld where l.link_id = ld.link_id) as "vtable"
WHERE "the_geom" && ST_GeomFromText('POLYGON ((-74.79526213727112 40.11142841966784, -74.79526213727112 40.21270521911115, -74.66273955428638 40.21270521911115, -74.66273955428638 40.11142841966784, -74.79526213727112 40.11142841966784))', 4326)
/*
'Hash Join (cost=3722.92..74955.80 rows=58813 width=378) (actual time=412.729..5946.610 rows=44469 loops=1)'
' Hash Cond: (ld.link_id = l.link_id)'
' -> Seq Scan on source_link ld (cost=0.00..58282.29 rows=3179029 width=10) (actual time=0.026..2823.685 rows=3179029 loops=1)'
' -> Hash (cost=3706.60..3706.60 rows=1306 width=378) (actual time=7.805..7.805 rows=1285 loops=1)'
' -> Bitmap Heap Scan on links l (cost=74.42..3706.60 rows=1306 width=378) (actual time=0.944..6.093 rows=1285 loops=1)'
' Recheck Cond: (the_geom && '0103000020E61000000100000005000000E6D42993E5B252C0BF285549430E4440E6D42993E5B252C0ADE2B3EC391B44403DDB29536AAA52C0ADE2B3EC391B44403DDB29536AAA52C0BF285549430E4440E6D42993E5B252C0BF285549430E4440'::geometry)'
' -> Bitmap Index Scan on links_geom_idx (cost=0.00..74.09 rows=1306 width=0) (actual time=0.883..0.883 rows=1285 loops=1)'
' Index Cond: (the_geom && '0103000020E61000000100000005000000E6D42993E5B252C0BF285549430E4440E6D42993E5B252C0ADE2B3EC391B44403DDB29536AAA52C0ADE2B3EC391B44403DDB29536AAA52C0BF285549430E4440E6D42993E5B252C0BF285549430E4440'::geometry)'
'Total runtime: 5983.473 ms'
So what prompted my initial concern is solved. This slowness caught me at a time where in the back of my mind i am contemplating our next data challenge which is why I looked into it. If I run into problems with our other data set I'll be sure to cover all my options and give realistic queries if I ask for help.
Thanks again for all the advice.
charles
----- Original Message -----
From: "Ben Madin" <lists at remoteinformation.com.au>
To: "PostGIS Users Discussion" <postgis-users at postgis.refractions.net>
Sent: Thursday, August 25, 2011 1:21:00 PM GMT -05:00 US/Canada Eastern
Subject: Re: [postgis-users] OT Understanding slow queries
Steve,
does this just apply to count(*), or is count(id) just as bad? I was originally a MySQL user and count(*) could be very efficient there.
cheers
Ben
On 26/08/2011, at 12:01 AM, Stephen Woodbridge wrote:
> The issue here is the count(*) which forces a full table scan in postgresql as fork mentioned. You need to look at a real query, unless you are really doing a count(*).
>
> -Steve
>
> On 8/25/2011 11:49 AM, Ben Madin wrote:
>> I'm no expert at this, but my understanding (which is limited) was
>> that you are asking for the whole table, so indexing doesn't really
>> get used (my understanding is that indexing helps to find the page
>> for a subset of data more quickly than scanning through the whole
>> lot).
>>
>> Also, you might be able to get some speed up by using a different
>> join type (outer join and not null where clause)?
>>
>> cheers
>>
>> Ben
>>
>>
>> On 25/08/2011, at 9:41 PM, Charles Galpin wrote:
>>
>>> If this is too off topic, please let me know and I'll sign up on a
>>> postgres list to get help. But this is related to my use of postgis
>>> and If anyone knows this stuff, it's you guys.
>>>
>>> I have an example query that I expect to be much faster, but my
>>> main concern is we are about to do some visualization of historical
>>> congestion data which will require queries across much larger data
>>> sets - like 150 million records a day. We are about to test using
>>> partitions but the number per table will still be much larger than
>>> what I am dealing with now.
>>>
>>> So here is a query I would think would be much faster than 43
>>> seconds for two tables, one with about 97k rows, and the other 3.2
>>> million.
>>>
>>> explain select count(l.*) from links l, source_link ld where
>>> l.link_id = ld.link_id; /* 'Aggregate (cost=174731.72..174731.73
>>> rows=1 width=32)' ' -> Hash Join (cost=13024.27..166784.14
>>> rows=3179029 width=32)' ' Hash Cond: (ld.link_id =
>>> l.link_id)' ' -> Seq Scan on source_link ld
>>> (cost=0.00..58282.29 rows=3179029 width=10)' ' -> Hash
>>> (cost=10963.12..10963.12 rows=96812 width=42)' ' ->
>>> Seq Scan on links l (cost=0.00..10963.12 rows=96812 width=42)' */
>>>
>>> Each table has an index on link_id, defined like this
>>>
>>> CREATE INDEX links_link_id_idx ON links USING btree (link_id);
>>>
>>> CREATE INDEX source_link_link_id_idx ON source_link USING btree
>>> (link_id);
>>>
>>> Shouldn't this index prevent these sequential scans, or am I
>>> misreading this? Should this really take 43 seconds?
>>>
>>> thanks for any advice, charles
>>>
>>> _______________________________________________ postgis-users
>>> mailing list postgis-users at postgis.refractions.net
>>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>
>> _______________________________________________ postgis-users mailing
>> list postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
postgis-users at postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users
More information about the postgis-users
mailing list