<div dir="ltr"><div class="gmail_default" style="font-family:monospace,monospace">Hey,<br></div><div class="gmail_default" style="font-family:monospace,monospace">This is an interesting problem for sure.<br></div><div class="gmail_default" style="font-family:monospace,monospace">AS always in the "slow query" topic,<br></div><div class="gmail_default" style="font-family:monospace,monospace">improvement can come from a number of places<br></div><div class="gmail_default" style="font-family:monospace,monospace"> - hardware : are your tables on SSD, does your index fit in RAM<br></div><div class="gmail_default" style="font-family:monospace,monospace"> - postgres tuning : you have customised postgresql.conf to fit your hardware<br></div><div class="gmail_default" style="font-family:monospace,monospace"> - query writing : your query is optimally written (looks fine here)<br></div><div class="gmail_default" style="font-family:monospace,monospace"> - data model (detailed after)<br><br></div><div class="gmail_default" style="font-family:monospace,monospace">IF you want fast results, you need a good indexing, and this is possible if you exploit the sparsness of your data.<br></div><div class="gmail_default" style="font-family:monospace,monospace">So where is the sparsness?<br></div><div class="gmail_default" style="font-family:monospace,monospace"><br><img src="https://github.com/Remi-C/neuron_in_postgis/raw/master/neuron_in_postgis_conceptual_propositions.png" style="margin-right: 0px;" width="390" height="1462"><br><br></div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace"> * Scenario A.1: your usual slices affects a majority of trees<br>  - you don't really need an index on X,Y, but mainly on Z. <br></div><div class="gmail_default" style="font-family:monospace,monospace">    The idea is to separate your index into index(X,Y) + index(Z)<br></div><div class="gmail_default" style="font-family:monospace,monospace">     Try adding an index on GIST(numrange(ST_Zmin,ST_ZMax)<wbr>) (you could add it anyway)<br></div><div class="gmail_default" style="font-family:monospace,monospace">     and an index on GIST(geom) (aka only 2D).<br></div><div class="gmail_default" style="font-family:monospace,monospace"> * Scenario A.2 : your usual slices affect few trees<br></div><div class="gmail_default" style="font-family:monospace,monospace">  - you could easily perform a 2 part query, where the first part use an index on tree bounding box, and the second perform the query on edge. This require to create and maintain a tree_bounding_box table with appropriate indexes, which is quite easy with triggers Ex (not optimised):<br></div><div class="gmail_default" style="font-family:monospace,monospace">   WITH filtering_by_tree AS (<br></div><div class="gmail_default" style="font-family:monospace,monospace">     SELECT distinct tree_id <br></div><div class="gmail_default" style="font-family:monospace,monospace">     FROM my_sync_table_of_tree_<wbr>bounding_boxes as tree<br></div><div class="gmail_default" style="font-family:monospace,monospace">     WHERE tree.bbox &&& your_slice<br></div><div class="gmail_default" style="font-family:monospace,monospace">   ) SELECT edge_id<br></div><div class="gmail_default" style="font-family:monospace,monospace">     FROM treenode_edge  AS te<br></div><div class="gmail_default" style="font-family:monospace,monospace">     WHERE te.geom &&& your_slice <br>      AND EXISTS (SELECT 1 FROM filtering_by_tree As ft WHERE ft.tree_id = te.tree_id)<br></div><div class="gmail_default" style="font-family:monospace,monospace">  - you could use pgpointcloud to similar effect by construction a patch per tree.<br><br></div><div class="gmail_default" style="font-family:monospace,monospace">Now something is very ambiguous in your data, and it could have major consequences :<br> * Are the edges between nodes that represent neuron __only logical__ (they are straight lines, always the same Zmax-Zmin, and represent a conceptual graph), or are they polylines that represent the __actual__ neurone path?<br><br></div><div class="gmail_default" style="font-family:monospace,monospace"> * Scenario B.1.1 : edges are only logical and have a constant Zmax-Zmin, i.e. nodes have quantified Z :<br></div><div class="gmail_default" style="font-family:monospace,monospace">  - you don't need edges at all for indexing, you should work only on nodes (i.e. index nodes IN Z + (X,Y), find nodes, then find relevant edges). The great advantage in this case is that you can use the full power of pgpointcloud to create meaningfull patches (of variables size for instance, see <a href="http://www.sciencedirect.com/science/article/pii/S092427161630123X" target="_blank">this</a>, 3.1, Adapt patch grouping rules ), which means scalling in the range of billions of points possible.<br><br></div><div class="gmail_default" style="font-family:monospace,monospace">  - you could try to think of using the dual of your current graph (edges become nodes, nodes become edges).<br><br></div><div class="gmail_default" style="font-family:monospace,monospace"> * Scenario B.1.2 : edges are only logical, but nodes don't have a constant Zmax-Zmin. : you could insert (virtual) nodes so you are again in the same case as scenario B.1.1<br></div><div class="gmail_default" style="font-family:monospace,monospace"> <br></div><div class="gmail_default" style="font-family:monospace,monospace"> * Scenario B.2 : edges are actual neurone path. <br></div><div class="gmail_default" style="font-family:monospace,monospace">  - you could construct intermediary objects (groups of neurone) of variable size (those within cubes), so as to have again a 2 steps query : first look for groups of neuron, then for neurons inside<br><br></div><div class="gmail_default" style="font-family:monospace,monospace"> * Scenario C : your data don't have obvious sparsity. <br></div><div class="gmail_default" style="font-family:monospace,monospace">  - use the same strategy you currently use, but scale by partitionning your data into numerous tables.<br></div><div class="gmail_default" style="font-family:monospace,monospace">   You can see the partitionning entry of the <a href="https://www.postgresql.org/docs/current/static/ddl-partitioning.html">postgres manual.</a> <br></div><div class="gmail_default" style="font-family:monospace,monospace">   This would be much easier to create and maintain if you could partition on only Z dimension.<br></div><div class="gmail_default" style="font-family:monospace,monospace">   Note that you will need to duplicate (and then de-duplicate) some edges.<br><br><br></div><div class="gmail_default" style="font-family:monospace,monospace">It's hard to recommend anything without further information on your data/your requirement/your ressources.<br><br></div><div class="gmail_default" style="font-family:monospace,monospace">Cheers,<br></div><div class="gmail_default" style="font-family:monospace,monospace">Rémi-C<br></div><div class="gmail_default" style="font-family:monospace,monospace"><br><br></div><div class="gmail_default" style="font-family:monospace,monospace"> <br><br><br></div><div class="gmail_default" style="font-family:monospace,monospace"><br><br></div><div class="gmail_extra"><br><div class="gmail_quote">2017-01-17 22:43 GMT+01:00 Tom Kazimiers <span dir="ltr"><<a href="mailto:tom@voodoo-arts.net" target="_blank">tom@voodoo-arts.net</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

I am using Postgres 9.6 with PostGIS 2.3 and want to improve the performance of a bounding box intersection query on a large set of lines. I asked the same questions two month ago on <a href="http://gis.stackexchange.com" rel="noreferrer" target="_blank">gis.stackexchange.com</a>, but didn't get any reply. Therefore, please excuse me for cross-posting, but I figured this list would be a good place to ask. This is the original post:<br>

<br>

 <a href="https://gis.stackexchange.com/questions/215762" rel="noreferrer" target="_blank">https://gis.stackexchange.<wbr>com/questions/215762</a><br>

<br>

Problem: We store about 6 million lines/edges in a 3D space and use an axis<br>

aligned bounding box to find all intersecting (and included) lines. This works<br>

well, but we want to improve the query time for larger point sets.<br>

<br>

Context: Tree-like 3D structures represent neurons, each node has a parent node<br>

or it is the root. At the moment we deal with about 15 million nodes, grouped<br>

into 150000 trees (many > 10000 nodes). I want to improve existing performance<br>

bottlenecks with bigger result sets plus I plan to scale this setup to 10-100x<br>

the nodes. Typical query bounding boxes are rather slices than boxes, i.e. they expand any range in XY, but only a short distance in Z, but it could in principal any dimension that is "flat".<br>

<br>

Setup: The table storing edges looks like this:<br>

<br>

 =>\d+ treenode_edge<br>

                   Tabelle »public.treenode_edge«<br>

         Spalte   |          Typ          | Attribute  ------------+-----------------<wbr>------+-----------<br>

  id         | bigint                | not null<br>

  project_id | integer               | not null<br>

  edge       | geometry(LineStringZ) |  Indexe:<br>

          "treenode_edge_pkey" PRIMARY KEY, btree (id)<br>

          "treenode_edge_gix" gist (edge gist_geometry_ops_nd)<br>

          "treenode_edge_project_id_inde<wbr>x" btree (project_id)<br>

<br>

Note that there is a 3D index for the edge column in place (treenode_edge_gix).<br>

<br>

Current timing: To request 2000 nodes with an typical bounding box size I use<br>

the following query:<br>

<br>

 SELECT <a href="http://te.id" rel="noreferrer" target="_blank">te.id</a><br>

 FROM treenode_edge te<br>

                                                                 -- left bottom z2, right top z1    WHERE te.edge &&& 'LINESTRINGZ( -537284.0  699984.0 84770.0,<br>

                                                                  1456444.0 -128432.0 84735.0)'<br>

                                   -- left top halfz, right top halfz,<br>

                                   -- right bottom halfz, left bottom halfz,<br>

                                   -- left top halfz; halfz (distance)<br>

 AND ST_3DDWithin(te.edge, ST_MakePolygon(ST_GeomFromText<wbr>(<br>

          'LINESTRING( -537284.0 -128432.0 84752.5, 1456444.0 -128432.0 84752.5,<br>

                                   1456444.0  699984.0 84752.5, -537284.0  699984.0 84752.5,<br>

                                   -537284.0 -128432.0 84752.5)')), 17.5)<br>

 AND te.project_id = 1<br>

 LIMIT 2000;<br>

<br>

This takes about 900ms, statistics are up-to-date. This is not so bad, but I look for strategies to make this much faster still. The &&& operator filters already most of the existing edges by bounding box test (using index) so that ST_3DWithin only needs to check the edges that are most likely part of the result.<br>

<br>

The query plan looks like this:<br>

<br>

 Limit  (cost=48.26..4311.24 rows=70 width=8) (actual time=856.261..864.208 rows=2000 loops=1)<br>

         Buffers: shared hit=20470<br>

         ->  Bitmap Heap Scan on treenode_edge te  (cost=48.26..4311.24 rows=70 width=8) (actual time=856.257..863.974 rows=2000 loops=1)<br>

                   Recheck Cond: (edge &&& '01020000800200000000000000886<wbr>520C100000000A05C2541000000002<wbr>0B2F440000000003C3936410000000<wbr>0005BFFC000000000F0AFF440'::ge<wbr>ometry)<br>

                   Filter: ((edge && '01030000800100000005000000000<wbr>00000AB6520C100000000185CFFC00<wbr>0000000F0AFF44000000000AB6520C<wbr>100000000C35C254100000000F0AFF<wbr>440000000804D39364100000000C35<wbr>C25410000000020B2F440000000804<wbr>D39364100000000185CFFC00000000<wbr>020B2F44000000000AB6520C100000<wbr>000185CFFC000000000F0AFF440'::<wbr>geometry) AND (project_id = 1) AND ('0103000080010000000500000000<wbr>000000886520C100000000005BFFC0<wbr>0000000008B1F440000000003C3936<wbr>4100000000005BFFC00000000008B1<wbr>F440000000003C39364100000000A0<wbr>5C25410000000008B1F44000000000<wbr>886520C100000000A05C2541000000<wbr>0008B1F44000000000886520C10000<wbr>0000005BFFC00000000008B1F440':<wbr>:geometry && st_expand(edge, '17.5'::double precision)) AND _st_3ddwithin(edge, '01030000800100000005000000000<wbr>00000886520C100000000005BFFC00<wbr>000000008B1F440000000003C39364<wbr>100000000005BFFC00000000008B1F<wbr>440000000003C39364100000000A05<wbr>C25410000000008B1F440000000008<wbr>86520C100000000A05C25410000000<wbr>008B1F44000000000886520C100000<wbr>000005BFFC00000000008B1F440'::<wbr>geometry, '17.5'::double precision))<br>

                   Heap Blocks: exact=1816<br>

                   Buffers: shared hit=20470<br>

                   ->  Bitmap Index Scan on treenode_edge_gix  (cost=0.00..48.25 rows=1044 width=0) (actual time=855.795..855.795 rows=2856 loops=1)<br>

                                 Index Cond: (edge &&& '01020000800200000000000000886<wbr>520C100000000A05C2541000000002<wbr>0B2F440000000003C3936410000000<wbr>0005BFFC000000000F0AFF440'::ge<wbr>ometry)<br>

                                 Buffers: shared hit=18654<br>

  Planning time: 3.467 ms<br>

  Execution time: 864.614 ms<br>

<br>

This shows clearly that the big bounding box test on all existing edges is the<br>

problem, though it already makes use of the GiST index. Are there maybe other indices available that can discard many far-away bounding boxes without looking at them, i.e. that useses a hierarchy like an octree? So far I didn't find any, but it looked like the SP-GiST interface might be useful if one were to create a new one.<br>

<br>

Alternatives: There is one approach that I currently implement to improve this,<br>

but I'd like to hear if someone has alternative, possibly better or other<br>

suggestions?<br>

<br>

This is what I tried: create a new table that would partition the space<br>

into many cubes and store for each all intersecting edges. New, updated or<br>

deleted nodes would then lead to an update of the relevant cubes, something<br>

that would probably have to run asynchronously (didn't implement the update yet). Queries then pre-filtered the space by only looking at relevant cubes instead of using the &&& operator with the bounding box of each edge. For a range of cubse sizes this improved query time quite a bit, the above query would run in about a third of thet time above. However, I realize cell sizes will behave differently for different dataset, but I didn't try to implement a dynamic cube size update, yet.  Thinking about it, it makes me wonder if it wouldn't be best if I would implement a octree---table based or as an extension.<br>

<br>

Now I also saw the pointcloud extension and its indexing looked very<br>

interesting (octree), but I couldn't really see how it could work in my use case. But I would appreciate ideas if you see a way to make it work.<br>

<br>

Best,<br>

Tom<br>

______________________________<wbr>_________________<br>

postgis-users mailing list<br>

<a href="mailto:postgis-users@lists.osgeo.org" target="_blank">postgis-users@lists.osgeo.org</a><br>

<a href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailma<wbr>n/listinfo/postgis-users</a></blockquote></div><br></div></div>