[pdal] [EXTERNAL] Re: Oracle PDAL queries not scaling

Albert Godfrind albert.godfrind at oracle.com
Wed Aug 12 05:38:29 PDT 2015


By default stats gathering happens on a statistical sample of the data, and it does typically not take very long.

It is done automatically by the database. It runs a daily job to collect stats for those tables whose cardinality had significantly changed. By default the job runs at 22:00. But your DBA may have changed that or even disabled it.

I still wonder why the lack of stats impacts that query. We do not keep statistics for spatial distribution before 12c, and use a fixed selectivity that makes the spatial index appealing to the optimizer in pretty much all cases, especially for such simple filter queries. Stats start playing a role when queries join multiple tables and multiple selection criteria. 

But then having up to date statistics is a better solution than forcing a query through hints. 

Albert

--
Albert Godfrind
+33 6 09 97 27 23
Sent from my iPhone

> On 12 août 2015, at 12:03, Peter van Oosterom <P.J.M.vanOosterom at tudelft.nl> wrote:
> 
> Hi Oscar,
> 
> Great that PDAL/Oracle now scales well!
> 
> It might be that for large datasets analyze+stats may get expensive part of load time.
> To avoid this step, an alternative would be to add hint for query optimizer to use index.
> 
> However, perhaps the analyze+stats at block level are not too expensive (as it would
> be at point level) and in that case the needed time might be neglectable, and in that
> case better to do so (as in your script now and without hit).
> 
> Good to check how much time these steps take and then we know what to do
> (especially for loading complete ahn with 640 billion points).
> 
> Kind regards, Peter.
> 
> Oscar Martinez Rubi schreef op 12-8-2015 om 11:53:
>> Hi,
>> 
>> When you say that statistics are auto-gathered...by who? PDAL or Oracle? 
>>  - In PDAL OCIWritter I do not see anything to do the analyze and gather_stats. I would maybe put it just after creating the indexes, and also optional as the indexes creation, because you only want to do that after the loading of the last file.
>>  - In Oracle itself I have not disabled anything (as far as I remember) and they are indeed computed automatically after a while, the problem is how long is "this while", to be sure, and as you recommend, it is maybe just better to do it your own once you know the loading is over.
>> 
>> So, bottom line: I do the loading of all the files (with pdal) with deactivated indexing. And then, the stuff I do after the last pdal loading is:
>> 
>>  - Create primary key on obj_id/blk_id. This is the one you suggest
>>  - Insert info in USER_SDO_GEOM_METADATA of the whole extent of all the loaded data (from different files and in different PDAL runs)
>>  - Create the spatial index
>>  - Analyze and gather stats
>> 
>> The SQL commands for that are:
>> 
>> ALTER TABLE AHN_BLCK ADD CONSTRAINT AHN_BLCK_PK PRIMARY KEY (OBJ_ID, BLK_ID) USING INDEX TABLESPACE INDX;
>> 
>> INSERT INTO USER_SDO_GEOM_METADATA VALUES ('AHN_BLCK','BLK_EXTENT',
>>                SDO_DIM_ARRAY(SDO_DIM_ELEMENT('X',60000.0,100000.0,0.0001),
>>               SDO_DIM_ELEMENT('Y',425000.0,475000.0,0.0001)),28992);
>> 
>> CREATE INDEX AHN_BLCK_SIDX ON AHN_BLCK (BLK_EXTENT) INDEXTYPE IS MDSYS.SPATIAL_INDEX PARAMETERS ('TABLESPACE=INDX WORK_TABLESPACE=PCWORK LAYER_GTYPE=POLYGON SDO_INDX_DIMS=2 SDO_RTR_PCTFREE=0') PARALLEL 16 ;
>> 
>> ANALYZE TABLE AHN_BLCK  COMPUTE SYSTEM STATISTICS FOR TABLE;
>> 
>> BEGIN
>>     DBMS_STATS.GATHER_TABLE_STATS('PDAL23090M','AHN_BLCK',NULL,NULL,FALSE,'FOR ALL COLUMNS SIZE AUTO',8,'ALL');
>> END;
>> 
>> After the doing all these steps the query times are as expected so I am going to assume those are the exact proper steps. The times are now scalable and quite nice (right now still busy with the 23 billion dataset but I guess it is safe to assume they will also be fine ;) )!
>> 
>> Time[s]      pdal20M    pdal210M    pdal2201M  pdal23090M
>> ---------  ---------  ----------  -----------  ------------
>> 01_0            0.63        0.37         0.41  -
>> 01_1            0.18        0.2          0.22  -
>> 02_0            1.18        1.51         1.54  -
>> 02_1            0.96        0.96         0.93  -
>> 03_0            0.24        0.25         0.25  -
>> 03_1            0.18        0.18         0.18  -
>> 04_0            1.49        1.53         1.44  -
>> 04_1            1.05        1.07         1.06  -
>> 05_0            0.66        0.78         0.76  -
>> 05_1            0.49        0.51         0.49  -
>> 06_0            1.42        1.61         1.64  -
>> 06_1            1.17        1.2          1.21  -
>> 07_0            1.6         2.12         2.15  -
>> 07_1            1.62        1.35         1.43  -
>> 
>> Thanks for your help and suggestions!
>> 
>> Regards,
>> 
>> O.
>> 
>> 
>>> On 12-08-15 00:30, Smith, Michael ERDC-RDE-CRREL-NH wrote:
>>> The statistics should actually auto gather (unless you've disabled that part). Its done as part of the DBMS Auto tasks and should gather when the tables are stale. Although this is more for a production type operation. If you are doing testing, you absolutely should make sure your statistics are up to date otherwise you will get bad access plans. 
>>> 
>>> Its also recommended to set a unique index on your Block table on the Obj_ID/Blk_ID (although this is primarily used to individually select pointclouds).  The more info you can give the optimizer, the better your query will perform. 
>>> 
>>> You can run (and should) explain plan's on your data access queries and see what estimates the optimizer returns. If they don't match what you expect, then you're probably feeding bad information to the optimizer.
>>> 
>>> Mike
>>> 
>>> ----
>>> Michael Smith
>>> US Army Corps
>>> Remote Sensing GIS/Center
>>> michael.smith at usace.army.mil
>>> 
>>> 
>>> From: <pdal-bounces at lists.osgeo.org> on behalf of Oscar Martinez Rubi <o.rubi at esciencecenter.nl>
>>> Date: Tuesday, August 11, 2015 at 11:04 AM
>>> To: Peter van Oosterom <P.J.M.vanOosterom at tudelft.nl>, Albert Godfrind <albert.godfrind at oracle.com>, "pdal at lists.osgeo.org" <pdal at lists.osgeo.org>
>>> Cc: Theo Tijssen <T.P.M.Tijssen at tudelft.nl>, "mike.horhammer at oracle.com" <mike.horhammer at oracle.com>
>>> Subject: [EXTERNAL] Re: [pdal] Oracle PDAL queries not scaling
>>> Resent-From: Michael Smith <michael.smith at usace.army.mil>
>>> 
>>> Hi,
>>> 
>>> What I did now is to force statistics gathering, so:
>>> 
>>> ANALYZE TABLE AHN_BLCK compute system statistics for table;
>>> BEGIN
>>>     dbms_stats.gather_table_stats('PDAL20M','AHN_BLCK',NULL,NULL,FALSE,'FOR ALL COLUMNS SIZE AUTO',8,'ALL');
>>> END;
>>> 
>>> And after doing this the queries are scalable, so in this way I do not need to wait for the DB to learn...
>>> 
>>> Regards,
>>> 
>>> O.
>>> 
>>> 
>>>> On 11-08-15 16:51, Peter van Oosterom wrote:
>>>> Hi Oscar,
>>>> 
>>>> It feels like when you fetch the data, this is based on a query execution plan that does a full table scan to get the blocks. Even if there is an index, the database may not use this. However, the database may notice that the actual query execution was disappointing (collecting statistics), and that after repeating the same tests, the database behaviour changed its behaviour and does scale well. 
>>>> 
>>>> [others: this was not in email of Oscar, but after repeating the test the data was fetched in about 0.02 seconds for all sizes 20M, 210M and 2201M. So, also the fetching in case of small dataset becomes significantly faster form 0.5 vs. 0.02 seconds.]
>>>> 
>>>> Would be good to see the actual query execution plain or force the database to use the index (with an hint). My hand-on practical Oracle syntax knowledge is too low to give exact hits how to do this, but perhaps others can help here.
>>>> 
>>>> Kind regards, Peter.
>>>> 
>>>> 
>>>>> On 11-8-2015 16:26, Oscar Martinez Rubi wrote:
>>>>> Hi,
>>>>> 
>>>>> I have investigated a bit more this issue.
>>>>> 
>>>>> I wanted to see what data does the OCI reader actually reads, so I executed the pre-selection query out of PDAL, in python. So I run the attached script (freshly after loading the data).
>>>>> 
>>>>> The script runs the same exact query as done in PDAL twice and prints for each run the number of returned blocks, time spent in the query and time spent to fetch the results.
>>>>> 
>>>>> the exact query is:
>>>>> 
>>>>> SELECT 
>>>>>     l."OBJ_ID", l."BLK_ID", l."BLK_EXTENT", l."BLK_DOMAIN", l."PCBLK_MIN_RES", l."PCBLK_MAX_RES", l."NUM_POINTS", l."NUM_UNSORTED_POINTS", l."PT_SORT_DIM",  l."POINTS", b.pc
>>>>> FROM 
>>>>>     AHN_BLCK l, AHN_BASE b, QUERY_POLYGONS g
>>>>> WHERE
>>>>>     l.obj_id = b.id
>>>>>     AND
>>>>>     SDO_FILTER(l.blk_extent,g.geom) = 'TRUE' AND g.id = 1;
>>>>> 
>>>>> The results I get are:
>>>>> 
>>>>> 20M run 1:                                            
>>>>>     #blocks: 12
>>>>>     query time[s]: 0.113833904266
>>>>>     fetch time[s]: 0.571593046188
>>>>> 20M run 2:                                            
>>>>>     #blocks: 12
>>>>>     query time[s]: 0.000102996826172
>>>>>     fetch time[s]: 0.500910997391
>>>>> 210M run 1:  
>>>>>     #blocks: 13
>>>>>     query time[s]: 0.0586049556732
>>>>>     fetch time[s]: 5.09832000732
>>>>> 210M run 2:  
>>>>>     #blocks: 13
>>>>>     query time[s]: 0.000245094299316
>>>>>     fetch time[s]: 5.05038785934
>>>>> 2201M run 1:  
>>>>>     #blocks: 13
>>>>>     query time[s]: 0.070690870285
>>>>>     fetch time[s]: 52.4960811138
>>>>> 2201M run 2:  
>>>>>     #blocks: 13
>>>>>     query time[s]: 0.000225067138672
>>>>>     fetch time[s]: 53.1006689072
>>>>> 
>>>>> So, even though the query times and the number of returned blocks are similar the fetch times are the not. We can see the scaling issue there. Somehow the fetching is much more expensive (10x) when points are 10x.
>>>>> 
>>>>> I also noticed that after a while doing queries the times get much better and scalable even when I do new queries with other polygons. So, the first queries suffer of scaling issues but later it gets better. 
>>>>> 
>>>>> Any idea why?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> O.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 10-08-15 18:40, Albert Godfrind wrote:
>>>>>> Like many tools that access an Oracle database, we lack the ability to see what actually happens in the database at a detailed level, i.e. which actual queries are sent, and how the database executes them in terms of CPU use, logical and physical I/Os, network throughput and latency.
>>>>>> 
>>>>>> So I think it is important to add some debugging / tracing facility to let me see what happens:
>>>>>> 
>>>>>> 1) An option to make PDAL (actually the OCI driver here) log each SQL statement it executes, together with the elapsed time and the number of rows (blocks) fetched. Obviously we have that statement in the input XML file, but a trace would put everything in a single log and include proper measurements.
>>>>>> 
>>>>>> 2) More important: an option to make the OCI driver enable SQL tracing at the database side. This is simple to do by just issuing an “ALTER SESSION …” statement before running the queries. The resulting trace will show all details about execution times as well as resource consumption (CPU and IO) and wait times. That could be added as an option in the XML file. Or maybe extend the XML file with the option to specify a SQ statement to be performed before each query (we could then use that to manually add the ALTER SESSION statement.
>>>>>> 
>>>>>> The resulting trace can help isolate the bottleneck as one of:
>>>>>> 
>>>>>> 1) the I/Os in the database, to fetch the blocks from disk (mostly I/O)
>>>>>> 2) the network time to pass the blocks to the PDAL client (network throughput and latency)
>>>>>> 3) the time to process the blocks in the PDAL client (mostly CPU)
>>>>>> 
>>>>>> Albert
>>>>>> 
>>>>>>> On 5-Aug-2015, at 12:26, Oscar Martinez Rubi <o.rubi at esciencecenter.nl> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I did a test to see how good Oracle with PDAL scale with bigger data sets. I had 3 datasets that are self-contained with 20M, 210M and 2201M points. I loaded them in different Oracle DBs with PDAL and laz-perf. And, for each of them I ran 7 queries (via a pdal pipeline that preselects blocks, applies a crop and then write to a LAS file)
>>>>>>> 
>>>>>>> The results are in the attached file.
>>>>>>> 
>>>>>>> Regarding the loading, for the 20M I only used one core (it is only one file) while for the others I used 16 cores, i.e. 16 simult. PDAL instances loading data to Oracle. I opened an issue in GitHub because I noticed that in some of the runs the size that I got was too large, and I do not know what caused that. The attached numbers are when everything seemed to work and the sizes were as expected.
>>>>>>> 
>>>>>>> This message, though, is about the queries. Each query is run twice in each DB. As you can see in the results file, for 10x more points in the data set the queries are 10x slower, at least for the first run (with the 2201M the second run is much faster but this does not happen with the 210M).
>>>>>>> 
>>>>>>> Find also attached one of the XML that i used for the queries (example is for query1). Note that the geometry is previously inserted in oracle so I can use to pre-filter blocks with the query option in oci reader
>>>>>>> 
>>>>>>> First I though that maybe the query option in the oci reader in the XML was ignored and that all the blocks of the dataset were being processed by PDAL (that would explain 10x more points 10x slower queries) but I ran a pdal pipeline for query1 with verbose and I saw that the crop filter "only" processed 120000 points which makes sense taking into account that region of query 1 only has 74818 points. Or maybe the crop still process all the blocks extents but only opens and decompress the points of the overlapping ones?
>>>>>>> 
>>>>>>> Any idea what is happening?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> O.
>>>>>>> <results.txt><query1.xml>
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Albert Godfrind | Geospatial technologies | Tel: +33 4 93 00 80 67 | Mobile: +33 6 09 97 27 23 | Skype: albert-godfrind
>>>>>> Oracle Server Technologies
>>>>>> 400 Av. Roumanille, BP 309  | 06906 Sophia Antipolis cedex | France
>>>>>> Everything you ever wanted to know about Oracle Spatial
>>>> 
>>>> 
>>>> -- 
>>>> Peter van Oosterom          P.J.M.vanOosterom at tudelft.nl
>>>> Section GIS technology      (room 00-west-520) Department OTB
>>>> Faculty of Architecture and the Built Environment, TU Delft
>>>> tel (+31) 15 2786950        Julianalaan 134, 2628 BL Delft, NL
>>>> fax (+31) 15 2784422        P.O. Box 5043, 2600 GA Delft, NL
>>>> BLOCKEDgeomatics.tudelft.nlBLOCKED MSc Geomatics
>>>> BLOCKEDmsc-gima.nlBLOCKED      MSc GIMA (Geo-Info Management&Appl)
>>>> BLOCKEDgdmc.nlBLOCKED 
>>>> 
> 
> 
> -- 
> Peter van Oosterom          P.J.M.vanOosterom at tudelft.nl
> Section GIS technology      Department OTB
> Faculty of Architecture and the Built Environment, TU Delft
> tel (+31) 15 2786950        Julianalaan 134, 2628 BL Delft, NL
> fax (+31) 15 2784422        P.O. Box 5030, 2600 GA Delft, NL
> http://geomatics.tudelft.nl MSc Geomatics
> http://www.msc-gima.nl      MSc GIMA (Geo-Info Management&Appl)
> http://www.gdmc.nl 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pdal/attachments/20150812/6ae2ac78/attachment-0001.html>


More information about the pdal mailing list