[postgis-users] Question on optimizing slow geospatial query

Wed Feb 11 01:55:56 PST 2015

Hey,
for the following I make the hypothesis that
 1. for you trip A->B should be separated from trip B->A.
 2. od1.taz.geom are non overlaping polygon (ie your zone are non
overlaping)
 3. whatever startloc, endloc, there exist a zone overlapping it.

Now you querry look like this

--compute the origin_destination matrix
DROP TABLE IF EXISTS  origin_destination_matrix ;
CREATE TABLE origin_destination_matrix  AS
SELECT t.trip_id, zone1 AS starting_zone, zone2 as ending_zone
FROM od1.trip_v1 as t
  INNER JOIN od1.taz AS zone1 ON ( ST_Intersects(zone1.geom ,t.startloc) =
TRUE )
  INNER JOIN od1.taz AS zone2 ON ( ST_Intersects(zone2.geom ,t.endloc) =
TRUE ) ;

--aggregate to get the number of trip from X to Y
SELECT count() as number_of_trip_going_from, starting_zone, ending_zone
FROM origin_destination_matrix
GROUP BY starting_zone, ending_zone ;

I would be supprised that you find something faster.

Please note that if some of my hypothesis are false, you can still manage :
 if you drop 1. : the aggregate becomes
SELECT count() as number_of_trip_going_from, l as starting , g as ending
FROM origin_destination_matrix  , LEAST(starting_zone,ending_zone) as
l, GREATEST(starting_zone,ending_zone)
as g
GROUP BY l, g ;

if you drop 2.  : you need a way to choose which zone is the correct one
when startloc or endloc is in several zone (example : the smallest area
zone)

the origin destination matrix computation becomes :
SELECT DISTINCT ON ( t.trip_id) t.trip_id, zone1 AS starting_zone, zone2 as
ending_zone
FROM od1.trip_v1 as t
  INNER JOIN od1.taz AS zone1 ON ( ST_Intersects(zone1.geom ,t.startloc) =
TRUE )
  INNER JOIN od1.taz AS zone2 ON ( ST_Intersects(zone2.geom ,t.endloc) =
TRUE )
ORDER BY ST_Area(zone1.geom) ASC, ST_Area(zone2.geom) ASC;

If you drop 3. , it is easy to deal with.
The dirty way is to add a polygon in "taz" that overlap all of you
startloc/endloc, you can call it 'UNKNOWN', then use the "drop 2.
hypothesis".

The clean way is to look in origin destination matrix, and get missing trip_id
(they are missing because they where no zone for startloc and/or endloc).
Then do something for theim (I don"t know what you want to do with it)

getting the missing trip_id  can be done with an EXCEPT for instance :

WITH missing_in_origin_destination AS ( --compute the missing trip, because
startloc and/or endloc have no corresponding zone
SELECT t.trip_id
FROM od1.trip_v1 as t
EXCEPT ALL
SELECT odm.trip_id
FROM origin_destination_matrix  AS odm
)
--your querry to do something with trip that are not in origin_destination
matrix

Cheers,
Rémi-C

2015-02-11 9:37 GMT+01:00 Trang Nguyen <Trang.Nguyen at inrix.com>:

>  Hi Birgit,
>
>
>
> Thanks very much.
>
> I’m not familiar with the “with” clause but will look into that. There are
> also indices on startts and z.uuid ( time range condition is ok, since
> these are timestamps and not days).
>
>
>
> I am also finding that a cause of slowness was due to the OR condition:
> ST_intersects(t.startloc, z.geom)
>
> or ST_intersects(t.endloc, z.geom)
>
>
>
> Splitting up the query  into two unions, one using a where on ST_intersects(t.startloc,
> z.geom) and the second on ST_intersects(t.endloc, z.geom), but your
> suggestion would get me closer to what I wanted in the final output.
>
>
>
> Best,
>
> Trang
>
>
>
> *From:* postgis-users-bounces at lists.osgeo.org [mailto:
> postgis-users-bounces at lists.osgeo.org] *On Behalf Of *Birgit Laggner
> *Sent:* Wednesday, February 11, 2015 12:20 AM
> *To:* postgis-users at lists.osgeo.org
> *Subject:* Re: [postgis-users] Question on optimizing slow geospatial
> query
>
>
>
> Hi Trang,
>
> I think, it could help to create btree indices on "startts" and "uuid",
> too, since you are using them in your where clause as a filter (a probably
> unnecessary question regarding your date filter: I would expect the result
> of "t.startts > '2015-01-16' and t.startts < '2015-01-17'" to be null
> because there are no days between the 16th and the 17th of january - but
> perhaps it was only an example...). And in general, my suggestion would be
> to reduce the use of st_intersects to the necessary mimimum.
> You could use the with clause for the filtration of your input data and
> afterwards double join the two tables first on the startloc and second on
> the endloc for the assignment of the origin and the destination zone. Then
> group by origin and destination zones while counting your trips and you
> should have your end result.
>
> Here is how I would imagine the query:
>
> with
> t as (select trip_id, startloc, endloc from od1.trip_v1 where startts
> between 'minimum start date' and 'maximum start date'),
> z as (select zone from od1.taz where uuid in ('kansas_303', 'kansas_601',
> 'kansas_603', etc))
>
> select  count(t.trip_id) as number_of_trips, orig.zone as orig_zone,
> dest.zone as dest_zone from t left join z as orig on
> st_intersects(t.startloc, z.geom) left join z as dest on
> st_intersects(t.endloc, z.geom) group by orig.zone, dest.zone;
>
>
>
> However, I am not sure about how the gist indices work together with the
> subselects of the with clause...
>
> Hope this helps,
>
> Birgit.
>
>
>  Am 11.02.2015 um 08:19 schrieb Trang Nguyen:
>
> Hi,
>
>
>
> I am a newbie to Postgres/PostGIS and have a long running query that I
> would like to optimize.
>
> There are two tables (trip and zone) that I am joining in the query, one
> which has “startloc” and “endloc” columns with type Geometry(Point) and
> other which contains a Geometry(MultiPolygon). There are GIST indexes on
> all above columns:
>
>
>
> CREATE TABLE od1.trip_v1
>
> (
>
>   pkey bigint NOT NULL,
>
>   trip_id character varying,
>
>   startts timestamp without time zone,
>
>   endts timestamp without time zone,
>
>   startloc geometry(Point),
>
>   endloc geometry(Point),
>
>   …
>
> )
>
>
>
> CREATE TABLE od1.taz
>
> (
>
>   uuid character varying NOT NULL,
>
>   zone character varying NOT NULL,
>
>   createdts timestamp without time zone NOT NULL,
>
>   updatedts timestamp without time zone NOT NULL,
>
>   geom geometry(MultiPolygon) NOT NULL,
>
>   CONSTRAINT taz_pkey PRIMARY KEY (uuid)
>
> )
>
>
> I’m interested in building a matrix that, for a given set of input zones,
> returns trips along with their start and end zones. Output looks like:
>
>
>
> 10 trips that start at Zone A, ends at Zone B
>
> 2 trips that start at Zone A, ends at Zone C
>
> 9 trips that start at Zone A, ends at other
>
> 13 trips that start at Zone C, ends at Zone D
>
>
>
> Since I am dealing with a large dataset (> 24 million records and
> growing), I was planning on writing a query that returns the trips grouped
> by each zone along with match condition (start, end or both) and doing
> aggregation on the client layer. I’m not sure whether this is the best
> approach but I expect that otherwise, I would end up having to write a very
> complex query to handle that type of aggregation.
>
>
>
> Even so, the current query is very slow with very high cost:
>
>
>
> SELECT t.trip_id,
>
> case
>
>                 when ST_intersects(t.startloc, z.geom) and
> ST_intersects(t.endloc, z.geom) then 'orig-dest'
>
>                 when ST_intersects(t.startloc, z.geom) then 'orig'
>
>                 when ST_intersects(t.endloc, z.geom) then 'dest'
>
>                 else 'none'
>
> end  as match_cond,
>
> z.zone from od1.trip_v1               t, od1.taz z
>
> where t.startts > '2015-01-16' and t.startts < '2015-01-17'
>
> and z.uuid in ('kansas_303', 'kansas_601', 'kansas_603', etc)
>
> and ST_intersects(t.startloc, z.geom)
>
> or ST_intersects(t.endloc, z.geom)
>
> group by z.zone, t.trip_id, match_cond;
>
>
>
> Explain plan:
>
> "Group  (cost=231446695055.73..245971533247.59 rows=14240037443
> width=3498)"
>
> "  ->  Sort  (cost=231446695055.73..231482295149.34 rows=14240037443
> width=3498)"
>
> "        Sort Key: z.zone, t.trip_id, (CASE WHEN ((t.startloc && z.geom)
> AND _st_intersects(t.startloc, z.geom) AND (t.endloc && z.geom) AND
> _st_intersects(t.endloc, z.geom)) THEN 'orig-dest'::text WHEN ((t.startloc
> && z.geom) AND _st_intersects(t.startloc, (...)"
>
> "        ->  Nested Loop  (cost=91.70..14401634128.24 rows=14240037443
> width=3498)"
>
> "              ->  Seq Scan on taz z  (cost=0.00..739.19 rows=4619
> width=3406)"
>
> "              ->  Bitmap Heap Scan on trip_v1 t  (cost=91.70..4151.26
> rows=453 width=107)"
>
> "                    Recheck Cond: ((startloc && z.geom) OR (endloc &&
> z.geom))"
>
> "                    Filter: (((startts > '2015-01-16 00:00:00'::timestamp
> without time zone) AND (startts < '2015-01-17 00:00:00'::timestamp without
> time zone) AND ((z.uuid)::text = ANY
> ('{kansas_303,kansas_601,kansas_603,kansas_604,kansas_10,kansas_11,kan
> (...)"
>
> "                    ->  BitmapOr  (cost=91.70..91.70 rows=2706 width=0)"
>
> "                          ->  Bitmap Index Scan on idx_trip_v1_startloc
> (cost=0.00..45.74 rows=1353 width=0)"
>
> "                                Index Cond: (startloc && z.geom)"
>
> "                          ->  Bitmap Index Scan on idx_trip_v1_endloc
> (cost=0.00..45.74 rows=1353 width=0)"
>
> "                                Index Cond: (endloc && z.geom)"
>
>
>
> Some help or suggestions on how to speed up the query would be much
> appreciated.
>
> Also, I currently don’t have a specific map projection defined on the geom
> columns so they are using the default of 0 in postgis. The points in both
> trip and zone geometries are lon/lat. Is this an issue for the ST_intersect?
>
>
>
> Thanks,
> Trang
>
>
>
>
>
>
>  _______________________________________________
>
> postgis-users mailing list
>
> postgis-users at lists.osgeo.org
>
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20150211/4f59fb50/attachment.html>