[pdal] [EXTERNAL] Re: Oracle PDAL queries not scaling

Wed Aug 12 03:40:40 PDT 2015

Hi,

Is not quite significant. For example the 23090M data set loading just 
finished.

Total time was 3816 seconds.
  - Loading 1400 files (16 simult. process) took 3660 seconds.
  - Create primary index (obj_id/blk_id) took 2 seconds
  - Insert into USER_SDO_GEOM_METADATA was less than 1 second
  - Create spatial index was 35 seconds
  - Analyse table was 4 seconds
  - Gather_table_stats was 112 seconds

So, it is reasonable

O.

On 12-08-15 12:03, Peter van Oosterom wrote:
> Hi Oscar,
>
> Great that PDAL/Oracle now scales well!
>
> It might be that for large datasets analyze+stats may get expensive 
> part of load time.
> To avoid this step, an alternative would be to add hint for query 
> optimizer to use index.
>
> However, perhaps the analyze+stats at block level are not too 
> expensive (as it would
> be at point level) and in that case the needed time might be 
> neglectable, and in that
> case better to do so (as in your script now and without hit).
>
> Good to check how much time these steps take and then we know what to do
> (especially for loading complete ahn with 640 billion points).
>
> Kind regards, Peter.
>
> Oscar Martinez Rubi schreef op 12-8-2015 om 11:53:
>> Hi,
>>
>> When you say that statistics are auto-gathered...by who? PDAL or Oracle?
>>  - In PDAL OCIWritter I do not see anything to do the analyze and 
>> gather_stats. I would maybe put it just after creating the indexes, 
>> and also optional as the indexes creation, because you only want to 
>> do that after the loading of the last file.
>>  - In Oracle itself I have not disabled anything (as far as I 
>> remember) and they are indeed computed automatically after a while, 
>> the problem is how long is "this while", to be sure, and as you 
>> recommend, it is maybe just better to do it your own once you know 
>> the loading is over.
>>
>> So, bottom line: I do the loading of all the files (with pdal) with 
>> deactivated indexing. And then, the stuff I do after the last pdal 
>> loading is:
>>
>>  - Create primary key on obj_id/blk_id. This is the one you suggest
>>  - Insert info in USER_SDO_GEOM_METADATA of the whole extent of all 
>> the loaded data (from different files and in different PDAL runs)
>>  - Create the spatial index
>>  - Analyze and gather stats
>>
>> The SQL commands for that are:
>>
>> ALTER TABLE AHN_BLCK ADD CONSTRAINT AHN_BLCK_PK PRIMARY KEY (OBJ_ID, 
>> BLK_ID) USING INDEX TABLESPACE INDX;
>>
>> INSERT INTO USER_SDO_GEOM_METADATA VALUES ('AHN_BLCK','BLK_EXTENT',
>> SDO_DIM_ARRAY(SDO_DIM_ELEMENT('X',60000.0,100000.0,0.0001),
>> SDO_DIM_ELEMENT('Y',425000.0,475000.0,0.0001)),28992);
>>
>> CREATE INDEX AHN_BLCK_SIDX ON AHN_BLCK (BLK_EXTENT) INDEXTYPE IS 
>> MDSYS.SPATIAL_INDEX PARAMETERS ('TABLESPACE=INDX 
>> WORK_TABLESPACE=PCWORK LAYER_GTYPE=POLYGON SDO_INDX_DIMS=2 
>> SDO_RTR_PCTFREE=0') PARALLEL 16 ;
>>
>> ANALYZE TABLE AHN_BLCK  COMPUTE SYSTEM STATISTICS FOR TABLE;
>>
>> BEGIN
>> DBMS_STATS.GATHER_TABLE_STATS('PDAL23090M','AHN_BLCK',NULL,NULL,FALSE,'FOR 
>> ALL COLUMNS SIZE AUTO',8,'ALL');
>> END;
>>
>> After the doing all these steps the query times are as expected so I 
>> am going to assume those are the exact proper steps. The times are 
>> now scalable and quite nice (right now still busy with the 23 billion 
>> dataset but I guess it is safe to assume they will also be fine ;) )!
>>
>> Time[s]      pdal20M    pdal210M    pdal2201M  pdal23090M
>> ---------  ---------  ----------  ----------- ------------
>> 01_0            0.63        0.37         0.41  -
>> 01_1            0.18        0.2          0.22  -
>> 02_0            1.18        1.51         1.54  -
>> 02_1            0.96        0.96         0.93  -
>> 03_0            0.24        0.25         0.25  -
>> 03_1            0.18        0.18         0.18  -
>> 04_0            1.49        1.53         1.44  -
>> 04_1            1.05        1.07         1.06  -
>> 05_0            0.66        0.78         0.76  -
>> 05_1            0.49        0.51         0.49  -
>> 06_0            1.42        1.61         1.64  -
>> 06_1            1.17        1.2          1.21  -
>> 07_0            1.6         2.12         2.15  -
>> 07_1            1.62        1.35         1.43  -
>>
>> Thanks for your help and suggestions!
>>
>> Regards,
>>
>> O.
>>
>>
>> On 12-08-15 00:30, Smith, Michael ERDC-RDE-CRREL-NH wrote:
>>> The statistics should actually auto gather (unless you've disabled 
>>> that part). Its done as part of the DBMS Auto tasks and should 
>>> gather when the tables are stale. Although this is more for a 
>>> production type operation. If you are doing testing, you absolutely 
>>> should make sure your statistics are up to date otherwise you will 
>>> get bad access plans.
>>>
>>> Its also recommended to set a unique index on your Block table on 
>>> the Obj_ID/Blk_ID (although this is primarily used to individually 
>>> select pointclouds).  The more info you can give the optimizer, the 
>>> better your query will perform.
>>>
>>> You can run (and should) explain plan's on your data access queries 
>>> and see what estimates the optimizer returns. If they don't match 
>>> what you expect, then you're probably feeding bad information to the 
>>> optimizer.
>>>
>>> Mike
>>>
>>> ----
>>> Michael Smith
>>> US Army Corps
>>> Remote Sensing GIS/Center
>>> michael.smith at usace.army.mil
>>>
>>>
>>> From: <pdal-bounces at lists.osgeo.org 
>>> <mailto:pdal-bounces at lists.osgeo.org>> on behalf of Oscar Martinez 
>>> Rubi <o.rubi at esciencecenter.nl <mailto:o.rubi at esciencecenter.nl>>
>>> Date: Tuesday, August 11, 2015 at 11:04 AM
>>> To: Peter van Oosterom <P.J.M.vanOosterom at tudelft.nl 
>>> <mailto:P.J.M.vanOosterom at tudelft.nl>>, Albert Godfrind 
>>> <albert.godfrind at oracle.com <mailto:albert.godfrind at oracle.com>>, 
>>> "pdal at lists.osgeo.org <mailto:pdal at lists.osgeo.org>" 
>>> <pdal at lists.osgeo.org <mailto:pdal at lists.osgeo.org>>
>>> Cc: Theo Tijssen <T.P.M.Tijssen at tudelft.nl 
>>> <mailto:T.P.M.Tijssen at tudelft.nl>>, "mike.horhammer at oracle.com 
>>> <mailto:mike.horhammer at oracle.com>" <mike.horhammer at oracle.com 
>>> <mailto:mike.horhammer at oracle.com>>
>>> Subject: [EXTERNAL] Re: [pdal] Oracle PDAL queries not scaling
>>> Resent-From: Michael Smith <michael.smith at usace.army.mil 
>>> <mailto:michael.smith at usace.army.mil>>
>>>
>>>     Hi,
>>>
>>>     What I did now is to force statistics gathering, so:
>>>
>>>     ANALYZE TABLE AHN_BLCK compute system statistics for table;
>>>     BEGIN
>>>     dbms_stats.gather_table_stats('PDAL20M','AHN_BLCK',NULL,NULL,FALSE,'FOR
>>>     ALL COLUMNS SIZE AUTO',8,'ALL');
>>>     END;
>>>
>>>     And after doing this the queries are scalable, so in this way I
>>>     do not need to wait for the DB to learn...
>>>
>>>     Regards,
>>>
>>>     O.
>>>
>>>
>>>     On 11-08-15 16:51, Peter van Oosterom wrote:
>>>>     Hi Oscar,
>>>>
>>>>     It feels like when you fetch the data, this is based on a query
>>>>     execution plan that does a full table scan to get the blocks.
>>>>     Even if there is an index, the database may not use this.
>>>>     However, the database may notice that the actual query
>>>>     execution was disappointing (collecting statistics), and that
>>>>     after repeating the same tests, the database behaviour changed
>>>>     its behaviour and does scale well.
>>>>
>>>>     [others: this was not in email of Oscar, but after repeating
>>>>     the test the data was fetched in about 0.02 seconds for all
>>>>     sizes 20M, 210M and 2201M. So, also the fetching in case of
>>>>     small dataset becomes significantly faster form 0.5 vs. 0.02
>>>>     seconds.]
>>>>
>>>>     Would be good to see the actual query execution plain or force
>>>>     the database to use the index (with an hint). My hand-on
>>>>     practical Oracle syntax knowledge is too low to give exact hits
>>>>     how to do this, but perhaps others can help here.
>>>>
>>>>     Kind regards, Peter.
>>>>
>>>>
>>>>     On 11-8-2015 16:26, Oscar Martinez Rubi wrote:
>>>>>     Hi,
>>>>>
>>>>>     I have investigated a bit more this issue.
>>>>>
>>>>>     I wanted to see what data does the OCI reader actually reads,
>>>>>     so I executed the pre-selection query out of PDAL, in python.
>>>>>     So I run the attached script (freshly after loading the data).
>>>>>
>>>>>     The script runs the same exact query as done in PDAL twice and
>>>>>     prints for each run the number of returned blocks, time spent
>>>>>     in the query and time spent to fetch the results.
>>>>>
>>>>>     the exact query is:
>>>>>
>>>>>     SELECT
>>>>>         l."OBJ_ID", l."BLK_ID", l."BLK_EXTENT", l."BLK_DOMAIN",
>>>>>     l."PCBLK_MIN_RES", l."PCBLK_MAX_RES", l."NUM_POINTS",
>>>>>     l."NUM_UNSORTED_POINTS", l."PT_SORT_DIM", l."POINTS", b.pc
>>>>>     FROM
>>>>>         AHN_BLCK l, AHN_BASE b, QUERY_POLYGONS g
>>>>>     WHERE
>>>>>         l.obj_id = b.id
>>>>>         AND
>>>>>         SDO_FILTER(l.blk_extent,g.geom) = 'TRUE' AND g.id = 1;
>>>>>
>>>>>     The results I get are:
>>>>>
>>>>>     20M run 1:
>>>>>         #blocks: 12
>>>>>         query time[s]: 0.113833904266
>>>>>         fetch time[s]: *0.571593046188*
>>>>>     20M run 2:
>>>>>         #blocks: 12
>>>>>         query time[s]: 0.000102996826172
>>>>>         fetch time[s]: *0.500910997391*
>>>>>     210M run 1:
>>>>>         #blocks: 13
>>>>>         query time[s]: 0.0586049556732
>>>>>         fetch time[s]: *5.09832000732*
>>>>>     210M run 2:
>>>>>         #blocks: 13
>>>>>         query time[s]: 0.000245094299316
>>>>>         fetch time[s]: *5.05038785934*
>>>>>     2201M run 1:
>>>>>         #blocks: 13
>>>>>         query time[s]: 0.070690870285
>>>>>         fetch time[s]: *52.4960811138*
>>>>>     2201M run 2:
>>>>>         #blocks: 13
>>>>>         query time[s]: 0.000225067138672
>>>>>         fetch time[s]: *53.1006689072*
>>>>>
>>>>>     So, even though the query times and the number of returned
>>>>>     blocks are similar the fetch times are the not. We can see the
>>>>>     scaling issue there. Somehow the fetching is much more
>>>>>     expensive (10x) when points are 10x.
>>>>>
>>>>>     I also noticed that after a while doing queries the times get
>>>>>     much better and scalable even when I do new queries with other
>>>>>     polygons. So, the first queries suffer of scaling issues but
>>>>>     later it gets better.
>>>>>
>>>>>     Any idea why?
>>>>>
>>>>>     Regards,
>>>>>
>>>>>     O.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     On 10-08-15 18:40, Albert Godfrind wrote:
>>>>>>     Like many tools that access an Oracle database, we lack the
>>>>>>     ability to see what actually happens in the database at a
>>>>>>     detailed level, i.e. which actual queries are sent, and how
>>>>>>     the database executes them in terms of CPU use, logical and
>>>>>>     physical I/Os, network throughput and latency.
>>>>>>
>>>>>>     So I think it is important to add some debugging / tracing
>>>>>>     facility to let me see what happens:
>>>>>>
>>>>>>     1) An option to make PDAL (actually the OCI driver here) log
>>>>>>     each SQL statement it executes, together with the elapsed
>>>>>>     time and the number of rows (blocks) fetched. Obviously we
>>>>>>     have that statement in the input XML file, but a trace would
>>>>>>     put everything in a single log and include proper measurements.
>>>>>>
>>>>>>     2) More important: an option to make the OCI driver enable
>>>>>>     SQL tracing at the database side. This is simple to do by
>>>>>>     just issuing an “ALTER SESSION …” statement before running
>>>>>>     the queries. The resulting trace will show all details about
>>>>>>     execution times as well as resource consumption (CPU and IO)
>>>>>>     and wait times. That could be added as an option in the XML
>>>>>>     file. Or maybe extend the XML file with the option to specify
>>>>>>     a SQ statement to be performed before each query (we could
>>>>>>     then use that to manually add the ALTER SESSION statement.
>>>>>>
>>>>>>     The resulting trace can help isolate the bottleneck as one of:
>>>>>>
>>>>>>     1) the I/Os in the database, to fetch the blocks from disk
>>>>>>     (mostly I/O)
>>>>>>     2) the network time to pass the blocks to the PDAL client
>>>>>>     (network throughput and latency)
>>>>>>     3) the time to process the blocks in the PDAL client (mostly CPU)
>>>>>>
>>>>>>     Albert
>>>>>>
>>>>>>>     On 5-Aug-2015, at 12:26, Oscar Martinez Rubi
>>>>>>>     <o.rubi at esciencecenter.nl <mailto:o.rubi at esciencecenter.nl>>
>>>>>>>     wrote:
>>>>>>>
>>>>>>>     Hi,
>>>>>>>
>>>>>>>     I did a test to see how good Oracle with PDAL scale with
>>>>>>>     bigger data sets. I had 3 datasets that are self-contained
>>>>>>>     with 20M, 210M and 2201M points. I loaded them in different
>>>>>>>     Oracle DBs with PDAL and laz-perf. And, for each of them I
>>>>>>>     ran 7 queries (via a pdal pipeline that preselects blocks,
>>>>>>>     applies a crop and then write to a LAS file)
>>>>>>>
>>>>>>>     The results are in the attached file.
>>>>>>>
>>>>>>>     Regarding the loading, for the 20M I only used one core (it
>>>>>>>     is only one file) while for the others I used 16 cores, i.e.
>>>>>>>     16 simult. PDAL instances loading data to Oracle. I opened
>>>>>>>     an issue in GitHub because I noticed that in some of the
>>>>>>>     runs the size that I got was too large, and I do not know
>>>>>>>     what caused that. The attached numbers are when everything
>>>>>>>     seemed to work and the sizes were as expected.
>>>>>>>
>>>>>>>     This message, though, is about the queries. Each query is
>>>>>>>     run twice in each DB. As you can see in the results file,
>>>>>>>     for 10x more points in the data set the queries are 10x
>>>>>>>     slower, at least for the first run (with the 2201M the
>>>>>>>     second run is much faster but this does not happen with the
>>>>>>>     210M).
>>>>>>>
>>>>>>>     Find also attached one of the XML that i used for the
>>>>>>>     queries (example is for query1). Note that the geometry is
>>>>>>>     previously inserted in oracle so I can use to pre-filter
>>>>>>>     blocks with the query option in oci reader
>>>>>>>
>>>>>>>     First I though that maybe the query option in the oci reader
>>>>>>>     in the XML was ignored and that all the blocks of the
>>>>>>>     dataset were being processed by PDAL (that would explain 10x
>>>>>>>     more points 10x slower queries) but I ran a pdal pipeline
>>>>>>>     for query1 with verbose and I saw that the crop filter
>>>>>>>     "only" processed 120000 points which makes sense taking into
>>>>>>>     account that region of query 1 only has 74818 points. Or
>>>>>>>     maybe the crop still process all the blocks extents but only
>>>>>>>     opens and decompress the points of the overlapping ones?
>>>>>>>
>>>>>>>     Any idea what is happening?
>>>>>>>
>>>>>>>     Regards,
>>>>>>>
>>>>>>>     O.
>>>>>>>     <results.txt><query1.xml>
>>>>>>
>>>>>>     --
>>>>>>     ORACLE <BLOCKEDoracle.comBLOCKED>
>>>>>>     Albert Godfrind | Geospatial technologies | Tel: +33 4 93 00
>>>>>>     80 67| Mobile: +33 6 09 97 27 23| Skype:albert-godfrind
>>>>>>     <skype:albert-godfrind>
>>>>>>     OracleServer Technologies
>>>>>>     400 Av. Roumanille,BP 309 |06906 Sophia Antipolis cedex|France
>>>>>>     Everything you ever wanted to know about Oracle Spatial
>>>>>>     <BLOCKEDapress.com/9781590598993BLOCKED>
>>>>>>
>>>>>>     <BLOCKEDlocationintelligence.net/dc/BLOCKED>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>     -- 
>>>>     Peter van OosteromP.J.M.vanOosterom at tudelft.nl
>>>>     Section GIS technology      (room 00-west-520) Department OTB
>>>>     Faculty of Architecture and the Built Environment, TU Delft
>>>>     tel (+31) 15 2786950        Julianalaan 134, 2628 BL Delft, NL
>>>>     fax (+31) 15 2784422        P.O. Box 5043, 2600 GA Delft, NL
>>>>     BLOCKEDgeomatics.tudelft.nlBLOCKED  MSc Geomatics
>>>>     BLOCKEDmsc-gima.nlBLOCKED       MSc GIMA (Geo-Info Management&Appl)
>>>>     BLOCKEDgdmc.nlBLOCKED  
>>>>
>>>
>>
>
>
> -- 
> Peter van OosteromP.J.M.vanOosterom at tudelft.nl
> Section GIS technology      Department OTB
> Faculty of Architecture and the Built Environment, TU Delft
> tel (+31) 15 2786950        Julianalaan 134, 2628 BL Delft, NL
> fax (+31) 15 2784422        P.O. Box 5030, 2600 GA Delft, NL
> http://geomatics.tudelft.nl  MSc Geomatics
> http://www.msc-gima.nl       MSc GIMA (Geo-Info Management&Appl)
> http://www.gdmc.nl  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pdal/attachments/20150812/240f1ec0/attachment-0001.html>