[postgis-users] Optimizer choosing slower strategy (Bitmap Heap vs Index)?

Thu Jul 27 00:33:03 PDT 2017

Hi Dan,

2017-07-26 19:15 GMT+02:00 Dan Lyke <danlyke at flutterby.com>:

> On Wed, 26 Jul 2017 12:31:07 +0200
> Giuseppe Broccolo <g.broccolo.7 at gmail.com> wrote:
> > On old dataset, the index is scanned through the bitmap: there are two
> > phases, in the first the index is used to build a bitmap, then the
> > query bases to find matching entries through the bitmap itself. This
> > strategy is generally chosen if the index scan would involve several
> > accesses on same blocks of the indexes.
> [snip]
>
> Huh. So a possibility is that somehow the data ended up in disk
> proximity in the first database, and is widely spread out in the
> second? This assumption based on the query mostly being about:
>
>   geometry &&
>   ST_SetSRID(ST_MakeBox2D(ST_MakePoint(...),ST_MakePoint(...)), 4326)
>
> (Basically, we're looking for a set of line segments within a bounding
> region).
>

It's not related to how data is spread (or less) on disk, but how geometry
are spatially
distributed. PostgreSQL is smart enough to keep data contiguous. More
specifically
in this case, if data has been freshly imported on the DB, data/index
blocks are filled
sequentially.

What I was wondering about is if there could be a reason of why index scan
in now based
on the bitmap, so why it has more to inspect the same blocks through the
bitmap.

Moreover, if RAM resource is enough to contain the data in memory, blocks
distribution
on disk should not be a problem (anyway, since we are talking about this,
could you provide
the ratio datasetsize/RAMresource?).

> The datasets are just two subsequent quarters of a combination of HERE
> and OpenStreetMap data. This particular table is just HERE road
> segments. The import process in both cases should be the same.
>
> One weirdness that I'm trying to figure out how to quantify: This
> particular table in the quarter's data is 33GB, the new quarter's
> data is 70GB[1], and I'm trying to figure out why. The geometry as
> ST_AsText is 14G in the old quarter, new one is 16G, so not a huge
> change there. Similar difference in number of rows (< 20%, not
> enough to account for >2x data size...
>
> [1] SELECT relname, relpages, reltuples::numeric,
>            pg_size_pretty(pg_table_size(oid))
>     FROM pg_class WHERE oid='...'::regclass;
>

Could this be related to more metadata present in the new dataset (e.g. more
attributes per record)? I actually don't know at all HERE datasets.

If I correctly understood, the new dataset includes 16GB of geospatial data
instead
of the 14GB of the old dataset? Well, ~14% of more data shouldn't make
change
planner's strategy...but a larger amount of metadata could justify a
significant spread
of geospatial data in an higher amount of blocks on disk, so the change in
the strategy.

> Here's EXPLAIN ANALYZE, old one :
>
>  Sort  (cost=6802.93..6806.30 rows=1346 width=128) (actual
> time=132.710..133.295 rows=477 loops=1)
>    Sort Key: (st_distance_sphere('...'::geometry, st_geometryn(way, 1))),
> osm_id
>    Sort Method: quicksort  Memory: 62kB
>    ->  Bitmap Heap Scan on line  (cost=50.85..6732.98 rows=1346 width=128)
> (actual time=0.976..131.077 rows=477 loops=1)
>          Recheck Cond: (way && '...'::geometry)
>          Heap Blocks: exact=261
>          ->  Bitmap Index Scan on line_index  (cost=0.00..50.52 rows=1346
> width=0) (actual time=0.418..0.418 rows=477 loops=1)
>                Index Cond: (way && '...'::geometry)
>  Planning time: 15.924 ms
>  Execution time: 133.925 ms
> (10 rows)
>
>
> New one:
>
>  Sort  (cost=2595.88..2597.16 rows=509 width=127) (actual
> time=220.955..221.449 rows=505 loops=1)
>    Sort Key: (st_distance_sphere('...'::geometry, st_geometryn(way, 1))),
> osm_id
>    Sort Method: quicksort  Memory: 64kB
>    ->  Index Scan using line_index on line  (cost=0.55..2573.00 rows=509
> width=127) (actual time=3.407..219.551 rows=505 loops=1)
>          Index Cond: (way && '...'::geometry)
>  Planning time: 0.469 ms
>  Execution time: 222.014 ms
> (7 rows)
>

Well, the number of matched rows looks to be almost the same in both cases
(505 vs. 477) confirming that the geospatial content in
the two version of the dataset has small changes/new entries. This could be
confirmed comparing indexes size on the two datasets.

Most of the time in the new dataset (3.407..219.551 ms) is spent in the
index scan, a larger amount of time if compared with the index scan
needed to build and scan the bitmap (0.418..131.077 ms). This because index
scan strategy (differently from index only scan one) needs to inspects
table blocks too for visibility of rows, and seems that lot of blocks
(2573) are inspected in the new dataset during index scan.

Anyway, probably the index scan is the proper plan chosen by the planner
(you can check this turning off index scan plan, and see an higher
execution time in case the bitmap index/heap scan is used in query
execution on the new dataset): check about why new dataset is so large
(70GB vs. 33GB), this could be the real reason of the more time needed for
query execution.

Hope this helps again :)
Giuseppe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20170727/c96888c0/attachment.html>