[pdal] [EXTERNAL] Re: Oracle PDAL queries not scaling
Oscar Martinez Rubi
o.rubi at esciencecenter.nl
Wed Aug 12 03:40:40 PDT 2015
Hi,
Is not quite significant. For example the 23090M data set loading just
finished.
Total time was 3816 seconds.
- Loading 1400 files (16 simult. process) took 3660 seconds.
- Create primary index (obj_id/blk_id) took 2 seconds
- Insert into USER_SDO_GEOM_METADATA was less than 1 second
- Create spatial index was 35 seconds
- Analyse table was 4 seconds
- Gather_table_stats was 112 seconds
So, it is reasonable
O.
On 12-08-15 12:03, Peter van Oosterom wrote:
> Hi Oscar,
>
> Great that PDAL/Oracle now scales well!
>
> It might be that for large datasets analyze+stats may get expensive
> part of load time.
> To avoid this step, an alternative would be to add hint for query
> optimizer to use index.
>
> However, perhaps the analyze+stats at block level are not too
> expensive (as it would
> be at point level) and in that case the needed time might be
> neglectable, and in that
> case better to do so (as in your script now and without hit).
>
> Good to check how much time these steps take and then we know what to do
> (especially for loading complete ahn with 640 billion points).
>
> Kind regards, Peter.
>
> Oscar Martinez Rubi schreef op 12-8-2015 om 11:53:
>> Hi,
>>
>> When you say that statistics are auto-gathered...by who? PDAL or Oracle?
>> - In PDAL OCIWritter I do not see anything to do the analyze and
>> gather_stats. I would maybe put it just after creating the indexes,
>> and also optional as the indexes creation, because you only want to
>> do that after the loading of the last file.
>> - In Oracle itself I have not disabled anything (as far as I
>> remember) and they are indeed computed automatically after a while,
>> the problem is how long is "this while", to be sure, and as you
>> recommend, it is maybe just better to do it your own once you know
>> the loading is over.
>>
>> So, bottom line: I do the loading of all the files (with pdal) with
>> deactivated indexing. And then, the stuff I do after the last pdal
>> loading is:
>>
>> - Create primary key on obj_id/blk_id. This is the one you suggest
>> - Insert info in USER_SDO_GEOM_METADATA of the whole extent of all
>> the loaded data (from different files and in different PDAL runs)
>> - Create the spatial index
>> - Analyze and gather stats
>>
>> The SQL commands for that are:
>>
>> ALTER TABLE AHN_BLCK ADD CONSTRAINT AHN_BLCK_PK PRIMARY KEY (OBJ_ID,
>> BLK_ID) USING INDEX TABLESPACE INDX;
>>
>> INSERT INTO USER_SDO_GEOM_METADATA VALUES ('AHN_BLCK','BLK_EXTENT',
>> SDO_DIM_ARRAY(SDO_DIM_ELEMENT('X',60000.0,100000.0,0.0001),
>> SDO_DIM_ELEMENT('Y',425000.0,475000.0,0.0001)),28992);
>>
>> CREATE INDEX AHN_BLCK_SIDX ON AHN_BLCK (BLK_EXTENT) INDEXTYPE IS
>> MDSYS.SPATIAL_INDEX PARAMETERS ('TABLESPACE=INDX
>> WORK_TABLESPACE=PCWORK LAYER_GTYPE=POLYGON SDO_INDX_DIMS=2
>> SDO_RTR_PCTFREE=0') PARALLEL 16 ;
>>
>> ANALYZE TABLE AHN_BLCK COMPUTE SYSTEM STATISTICS FOR TABLE;
>>
>> BEGIN
>> DBMS_STATS.GATHER_TABLE_STATS('PDAL23090M','AHN_BLCK',NULL,NULL,FALSE,'FOR
>> ALL COLUMNS SIZE AUTO',8,'ALL');
>> END;
>>
>> After the doing all these steps the query times are as expected so I
>> am going to assume those are the exact proper steps. The times are
>> now scalable and quite nice (right now still busy with the 23 billion
>> dataset but I guess it is safe to assume they will also be fine ;) )!
>>
>> Time[s] pdal20M pdal210M pdal2201M pdal23090M
>> --------- --------- ---------- ----------- ------------
>> 01_0 0.63 0.37 0.41 -
>> 01_1 0.18 0.2 0.22 -
>> 02_0 1.18 1.51 1.54 -
>> 02_1 0.96 0.96 0.93 -
>> 03_0 0.24 0.25 0.25 -
>> 03_1 0.18 0.18 0.18 -
>> 04_0 1.49 1.53 1.44 -
>> 04_1 1.05 1.07 1.06 -
>> 05_0 0.66 0.78 0.76 -
>> 05_1 0.49 0.51 0.49 -
>> 06_0 1.42 1.61 1.64 -
>> 06_1 1.17 1.2 1.21 -
>> 07_0 1.6 2.12 2.15 -
>> 07_1 1.62 1.35 1.43 -
>>
>> Thanks for your help and suggestions!
>>
>> Regards,
>>
>> O.
>>
>>
>> On 12-08-15 00:30, Smith, Michael ERDC-RDE-CRREL-NH wrote:
>>> The statistics should actually auto gather (unless you've disabled
>>> that part). Its done as part of the DBMS Auto tasks and should
>>> gather when the tables are stale. Although this is more for a
>>> production type operation. If you are doing testing, you absolutely
>>> should make sure your statistics are up to date otherwise you will
>>> get bad access plans.
>>>
>>> Its also recommended to set a unique index on your Block table on
>>> the Obj_ID/Blk_ID (although this is primarily used to individually
>>> select pointclouds). The more info you can give the optimizer, the
>>> better your query will perform.
>>>
>>> You can run (and should) explain plan's on your data access queries
>>> and see what estimates the optimizer returns. If they don't match
>>> what you expect, then you're probably feeding bad information to the
>>> optimizer.
>>>
>>> Mike
>>>
>>> ----
>>> Michael Smith
>>> US Army Corps
>>> Remote Sensing GIS/Center
>>> michael.smith at usace.army.mil
>>>
>>>
>>> From: <pdal-bounces at lists.osgeo.org
>>> <mailto:pdal-bounces at lists.osgeo.org>> on behalf of Oscar Martinez
>>> Rubi <o.rubi at esciencecenter.nl <mailto:o.rubi at esciencecenter.nl>>
>>> Date: Tuesday, August 11, 2015 at 11:04 AM
>>> To: Peter van Oosterom <P.J.M.vanOosterom at tudelft.nl
>>> <mailto:P.J.M.vanOosterom at tudelft.nl>>, Albert Godfrind
>>> <albert.godfrind at oracle.com <mailto:albert.godfrind at oracle.com>>,
>>> "pdal at lists.osgeo.org <mailto:pdal at lists.osgeo.org>"
>>> <pdal at lists.osgeo.org <mailto:pdal at lists.osgeo.org>>
>>> Cc: Theo Tijssen <T.P.M.Tijssen at tudelft.nl
>>> <mailto:T.P.M.Tijssen at tudelft.nl>>, "mike.horhammer at oracle.com
>>> <mailto:mike.horhammer at oracle.com>" <mike.horhammer at oracle.com
>>> <mailto:mike.horhammer at oracle.com>>
>>> Subject: [EXTERNAL] Re: [pdal] Oracle PDAL queries not scaling
>>> Resent-From: Michael Smith <michael.smith at usace.army.mil
>>> <mailto:michael.smith at usace.army.mil>>
>>>
>>> Hi,
>>>
>>> What I did now is to force statistics gathering, so:
>>>
>>> ANALYZE TABLE AHN_BLCK compute system statistics for table;
>>> BEGIN
>>> dbms_stats.gather_table_stats('PDAL20M','AHN_BLCK',NULL,NULL,FALSE,'FOR
>>> ALL COLUMNS SIZE AUTO',8,'ALL');
>>> END;
>>>
>>> And after doing this the queries are scalable, so in this way I
>>> do not need to wait for the DB to learn...
>>>
>>> Regards,
>>>
>>> O.
>>>
>>>
>>> On 11-08-15 16:51, Peter van Oosterom wrote:
>>>> Hi Oscar,
>>>>
>>>> It feels like when you fetch the data, this is based on a query
>>>> execution plan that does a full table scan to get the blocks.
>>>> Even if there is an index, the database may not use this.
>>>> However, the database may notice that the actual query
>>>> execution was disappointing (collecting statistics), and that
>>>> after repeating the same tests, the database behaviour changed
>>>> its behaviour and does scale well.
>>>>
>>>> [others: this was not in email of Oscar, but after repeating
>>>> the test the data was fetched in about 0.02 seconds for all
>>>> sizes 20M, 210M and 2201M. So, also the fetching in case of
>>>> small dataset becomes significantly faster form 0.5 vs. 0.02
>>>> seconds.]
>>>>
>>>> Would be good to see the actual query execution plain or force
>>>> the database to use the index (with an hint). My hand-on
>>>> practical Oracle syntax knowledge is too low to give exact hits
>>>> how to do this, but perhaps others can help here.
>>>>
>>>> Kind regards, Peter.
>>>>
>>>>
>>>> On 11-8-2015 16:26, Oscar Martinez Rubi wrote:
>>>>> Hi,
>>>>>
>>>>> I have investigated a bit more this issue.
>>>>>
>>>>> I wanted to see what data does the OCI reader actually reads,
>>>>> so I executed the pre-selection query out of PDAL, in python.
>>>>> So I run the attached script (freshly after loading the data).
>>>>>
>>>>> The script runs the same exact query as done in PDAL twice and
>>>>> prints for each run the number of returned blocks, time spent
>>>>> in the query and time spent to fetch the results.
>>>>>
>>>>> the exact query is:
>>>>>
>>>>> SELECT
>>>>> l."OBJ_ID", l."BLK_ID", l."BLK_EXTENT", l."BLK_DOMAIN",
>>>>> l."PCBLK_MIN_RES", l."PCBLK_MAX_RES", l."NUM_POINTS",
>>>>> l."NUM_UNSORTED_POINTS", l."PT_SORT_DIM", l."POINTS", b.pc
>>>>> FROM
>>>>> AHN_BLCK l, AHN_BASE b, QUERY_POLYGONS g
>>>>> WHERE
>>>>> l.obj_id = b.id
>>>>> AND
>>>>> SDO_FILTER(l.blk_extent,g.geom) = 'TRUE' AND g.id = 1;
>>>>>
>>>>> The results I get are:
>>>>>
>>>>> 20M run 1:
>>>>> #blocks: 12
>>>>> query time[s]: 0.113833904266
>>>>> fetch time[s]: *0.571593046188*
>>>>> 20M run 2:
>>>>> #blocks: 12
>>>>> query time[s]: 0.000102996826172
>>>>> fetch time[s]: *0.500910997391*
>>>>> 210M run 1:
>>>>> #blocks: 13
>>>>> query time[s]: 0.0586049556732
>>>>> fetch time[s]: *5.09832000732*
>>>>> 210M run 2:
>>>>> #blocks: 13
>>>>> query time[s]: 0.000245094299316
>>>>> fetch time[s]: *5.05038785934*
>>>>> 2201M run 1:
>>>>> #blocks: 13
>>>>> query time[s]: 0.070690870285
>>>>> fetch time[s]: *52.4960811138*
>>>>> 2201M run 2:
>>>>> #blocks: 13
>>>>> query time[s]: 0.000225067138672
>>>>> fetch time[s]: *53.1006689072*
>>>>>
>>>>> So, even though the query times and the number of returned
>>>>> blocks are similar the fetch times are the not. We can see the
>>>>> scaling issue there. Somehow the fetching is much more
>>>>> expensive (10x) when points are 10x.
>>>>>
>>>>> I also noticed that after a while doing queries the times get
>>>>> much better and scalable even when I do new queries with other
>>>>> polygons. So, the first queries suffer of scaling issues but
>>>>> later it gets better.
>>>>>
>>>>> Any idea why?
>>>>>
>>>>> Regards,
>>>>>
>>>>> O.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 10-08-15 18:40, Albert Godfrind wrote:
>>>>>> Like many tools that access an Oracle database, we lack the
>>>>>> ability to see what actually happens in the database at a
>>>>>> detailed level, i.e. which actual queries are sent, and how
>>>>>> the database executes them in terms of CPU use, logical and
>>>>>> physical I/Os, network throughput and latency.
>>>>>>
>>>>>> So I think it is important to add some debugging / tracing
>>>>>> facility to let me see what happens:
>>>>>>
>>>>>> 1) An option to make PDAL (actually the OCI driver here) log
>>>>>> each SQL statement it executes, together with the elapsed
>>>>>> time and the number of rows (blocks) fetched. Obviously we
>>>>>> have that statement in the input XML file, but a trace would
>>>>>> put everything in a single log and include proper measurements.
>>>>>>
>>>>>> 2) More important: an option to make the OCI driver enable
>>>>>> SQL tracing at the database side. This is simple to do by
>>>>>> just issuing an “ALTER SESSION …” statement before running
>>>>>> the queries. The resulting trace will show all details about
>>>>>> execution times as well as resource consumption (CPU and IO)
>>>>>> and wait times. That could be added as an option in the XML
>>>>>> file. Or maybe extend the XML file with the option to specify
>>>>>> a SQ statement to be performed before each query (we could
>>>>>> then use that to manually add the ALTER SESSION statement.
>>>>>>
>>>>>> The resulting trace can help isolate the bottleneck as one of:
>>>>>>
>>>>>> 1) the I/Os in the database, to fetch the blocks from disk
>>>>>> (mostly I/O)
>>>>>> 2) the network time to pass the blocks to the PDAL client
>>>>>> (network throughput and latency)
>>>>>> 3) the time to process the blocks in the PDAL client (mostly CPU)
>>>>>>
>>>>>> Albert
>>>>>>
>>>>>>> On 5-Aug-2015, at 12:26, Oscar Martinez Rubi
>>>>>>> <o.rubi at esciencecenter.nl <mailto:o.rubi at esciencecenter.nl>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I did a test to see how good Oracle with PDAL scale with
>>>>>>> bigger data sets. I had 3 datasets that are self-contained
>>>>>>> with 20M, 210M and 2201M points. I loaded them in different
>>>>>>> Oracle DBs with PDAL and laz-perf. And, for each of them I
>>>>>>> ran 7 queries (via a pdal pipeline that preselects blocks,
>>>>>>> applies a crop and then write to a LAS file)
>>>>>>>
>>>>>>> The results are in the attached file.
>>>>>>>
>>>>>>> Regarding the loading, for the 20M I only used one core (it
>>>>>>> is only one file) while for the others I used 16 cores, i.e.
>>>>>>> 16 simult. PDAL instances loading data to Oracle. I opened
>>>>>>> an issue in GitHub because I noticed that in some of the
>>>>>>> runs the size that I got was too large, and I do not know
>>>>>>> what caused that. The attached numbers are when everything
>>>>>>> seemed to work and the sizes were as expected.
>>>>>>>
>>>>>>> This message, though, is about the queries. Each query is
>>>>>>> run twice in each DB. As you can see in the results file,
>>>>>>> for 10x more points in the data set the queries are 10x
>>>>>>> slower, at least for the first run (with the 2201M the
>>>>>>> second run is much faster but this does not happen with the
>>>>>>> 210M).
>>>>>>>
>>>>>>> Find also attached one of the XML that i used for the
>>>>>>> queries (example is for query1). Note that the geometry is
>>>>>>> previously inserted in oracle so I can use to pre-filter
>>>>>>> blocks with the query option in oci reader
>>>>>>>
>>>>>>> First I though that maybe the query option in the oci reader
>>>>>>> in the XML was ignored and that all the blocks of the
>>>>>>> dataset were being processed by PDAL (that would explain 10x
>>>>>>> more points 10x slower queries) but I ran a pdal pipeline
>>>>>>> for query1 with verbose and I saw that the crop filter
>>>>>>> "only" processed 120000 points which makes sense taking into
>>>>>>> account that region of query 1 only has 74818 points. Or
>>>>>>> maybe the crop still process all the blocks extents but only
>>>>>>> opens and decompress the points of the overlapping ones?
>>>>>>>
>>>>>>> Any idea what is happening?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> O.
>>>>>>> <results.txt><query1.xml>
>>>>>>
>>>>>> --
>>>>>> ORACLE <BLOCKEDoracle.comBLOCKED>
>>>>>> Albert Godfrind | Geospatial technologies | Tel: +33 4 93 00
>>>>>> 80 67| Mobile: +33 6 09 97 27 23| Skype:albert-godfrind
>>>>>> <skype:albert-godfrind>
>>>>>> OracleServer Technologies
>>>>>> 400 Av. Roumanille,BP 309 |06906 Sophia Antipolis cedex|France
>>>>>> Everything you ever wanted to know about Oracle Spatial
>>>>>> <BLOCKEDapress.com/9781590598993BLOCKED>
>>>>>>
>>>>>> <BLOCKEDlocationintelligence.net/dc/BLOCKED>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Peter van OosteromP.J.M.vanOosterom at tudelft.nl
>>>> Section GIS technology (room 00-west-520) Department OTB
>>>> Faculty of Architecture and the Built Environment, TU Delft
>>>> tel (+31) 15 2786950 Julianalaan 134, 2628 BL Delft, NL
>>>> fax (+31) 15 2784422 P.O. Box 5043, 2600 GA Delft, NL
>>>> BLOCKEDgeomatics.tudelft.nlBLOCKED MSc Geomatics
>>>> BLOCKEDmsc-gima.nlBLOCKED MSc GIMA (Geo-Info Management&Appl)
>>>> BLOCKEDgdmc.nlBLOCKED
>>>>
>>>
>>
>
>
> --
> Peter van OosteromP.J.M.vanOosterom at tudelft.nl
> Section GIS technology Department OTB
> Faculty of Architecture and the Built Environment, TU Delft
> tel (+31) 15 2786950 Julianalaan 134, 2628 BL Delft, NL
> fax (+31) 15 2784422 P.O. Box 5030, 2600 GA Delft, NL
> http://geomatics.tudelft.nl MSc Geomatics
> http://www.msc-gima.nl MSc GIMA (Geo-Info Management&Appl)
> http://www.gdmc.nl
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pdal/attachments/20150812/240f1ec0/attachment-0001.html>
More information about the pdal
mailing list