[pdal] Oracle PDAL queries not scaling

Tue Aug 11 08:04:10 PDT 2015

Hi,

What I did now is to force statistics gathering, so:

ANALYZE TABLE AHN_BLCK compute system statistics for table;
BEGIN
dbms_stats.gather_table_stats('PDAL20M','AHN_BLCK',NULL,NULL,FALSE,'FOR 
ALL COLUMNS SIZE AUTO',8,'ALL');
END;

And after doing this the queries are scalable, so in this way I do not 
need to wait for the DB to learn...

Regards,

O.

On 11-08-15 16:51, Peter van Oosterom wrote:
> Hi Oscar,
>
> It feels like when you fetch the data, this is based on a query 
> execution plan that does a full table scan to get the blocks. Even if 
> there is an index, the database may not use this. However, the 
> database may notice that the actual query execution was disappointing 
> (collecting statistics), and that after repeating the same tests, the 
> database behaviour changed its behaviour and does scale well.
>
> [others: this was not in email of Oscar, but after repeating the test 
> the data was fetched in about 0.02 seconds for all sizes 20M, 210M and 
> 2201M. So, also the fetching in case of small dataset becomes 
> significantly faster form 0.5 vs. 0.02 seconds.]
>
> Would be good to see the actual query execution plain or force the 
> database to use the index (with an hint). My hand-on practical Oracle 
> syntax knowledge is too low to give exact hits how to do this, but 
> perhaps others can help here.
>
> Kind regards, Peter.
>
>
> On 11-8-2015 16:26, Oscar Martinez Rubi wrote:
>> Hi,
>>
>> I have investigated a bit more this issue.
>>
>> I wanted to see what data does the OCI reader actually reads, so I 
>> executed the pre-selection query out of PDAL, in python. So I run the 
>> attached script (freshly after loading the data).
>>
>> The script runs the same exact query as done in PDAL twice and prints 
>> for each run the number of returned blocks, time spent in the query 
>> and time spent to fetch the results.
>>
>> the exact query is:
>>
>> SELECT
>>     l."OBJ_ID", l."BLK_ID", l."BLK_EXTENT", l."BLK_DOMAIN", 
>> l."PCBLK_MIN_RES", l."PCBLK_MAX_RES", l."NUM_POINTS", 
>> l."NUM_UNSORTED_POINTS", l."PT_SORT_DIM",  l."POINTS", b.pc
>> FROM
>>     AHN_BLCK l, AHN_BASE b, QUERY_POLYGONS g
>> WHERE
>>     l.obj_id = b.id
>>     AND
>>     SDO_FILTER(l.blk_extent,g.geom) = 'TRUE' AND g.id = 1;
>>
>> The results I get are:
>>
>> 20M run 1:
>>     #blocks: 12
>>     query time[s]: 0.113833904266
>>     fetch time[s]: *0.571593046188*
>> 20M run 2:
>>     #blocks: 12
>>     query time[s]: 0.000102996826172
>>     fetch time[s]: *0.500910997391*
>> 210M run 1:
>>     #blocks: 13
>>     query time[s]: 0.0586049556732
>>     fetch time[s]: *5.09832000732*
>> 210M run 2:
>>     #blocks: 13
>>     query time[s]: 0.000245094299316
>>     fetch time[s]: *5.05038785934*
>> 2201M run 1:
>>     #blocks: 13
>>     query time[s]: 0.070690870285
>>     fetch time[s]: *52.4960811138*
>> 2201M run 2:
>>     #blocks: 13
>>     query time[s]: 0.000225067138672
>>     fetch time[s]: *53.1006689072*
>>
>> So, even though the query times and the number of returned blocks are 
>> similar the fetch times are the not. We can see the scaling issue 
>> there. Somehow the fetching is much more expensive (10x) when points 
>> are 10x.
>>
>> I also noticed that after a while doing queries the times get much 
>> better and scalable even when I do new queries with other polygons. 
>> So, the first queries suffer of scaling issues but later it gets better.
>>
>> Any idea why?
>>
>> Regards,
>>
>> O.
>>
>>
>>
>>
>> On 10-08-15 18:40, Albert Godfrind wrote:
>>> Like many tools that access an Oracle database, we lack the ability 
>>> to see what actually happens in the database at a detailed level, 
>>> i.e. which actual queries are sent, and how the database executes 
>>> them in terms of CPU use, logical and physical I/Os, network 
>>> throughput and latency.
>>>
>>> So I think it is important to add some debugging / tracing facility 
>>> to let me see what happens:
>>>
>>> 1) An option to make PDAL (actually the OCI driver here) log each 
>>> SQL statement it executes, together with the elapsed time and the 
>>> number of rows (blocks) fetched. Obviously we have that statement in 
>>> the input XML file, but a trace would put everything in a single log 
>>> and include proper measurements.
>>>
>>> 2) More important: an option to make the OCI driver enable SQL 
>>> tracing at the database side. This is simple to do by just issuing 
>>> an “ALTER SESSION …” statement before running the queries. The 
>>> resulting trace will show all details about execution times as well 
>>> as resource consumption (CPU and IO) and wait times. That could be 
>>> added as an option in the XML file. Or maybe extend the XML file 
>>> with the option to specify a SQ statement to be performed before 
>>> each query (we could then use that to manually add the ALTER SESSION 
>>> statement.
>>>
>>> The resulting trace can help isolate the bottleneck as one of:
>>>
>>> 1) the I/Os in the database, to fetch the blocks from disk (mostly I/O)
>>> 2) the network time to pass the blocks to the PDAL client (network 
>>> throughput and latency)
>>> 3) the time to process the blocks in the PDAL client (mostly CPU)
>>>
>>> Albert
>>>
>>>> On 5-Aug-2015, at 12:26, Oscar Martinez Rubi 
>>>> <o.rubi at esciencecenter.nl <mailto:o.rubi at esciencecenter.nl>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I did a test to see how good Oracle with PDAL scale with bigger 
>>>> data sets. I had 3 datasets that are self-contained with 20M, 210M 
>>>> and 2201M points. I loaded them in different Oracle DBs with PDAL 
>>>> and laz-perf. And, for each of them I ran 7 queries (via a pdal 
>>>> pipeline that preselects blocks, applies a crop and then write to a 
>>>> LAS file)
>>>>
>>>> The results are in the attached file.
>>>>
>>>> Regarding the loading, for the 20M I only used one core (it is only 
>>>> one file) while for the others I used 16 cores, i.e. 16 simult. 
>>>> PDAL instances loading data to Oracle. I opened an issue in GitHub 
>>>> because I noticed that in some of the runs the size that I got was 
>>>> too large, and I do not know what caused that. The attached numbers 
>>>> are when everything seemed to work and the sizes were as expected.
>>>>
>>>> This message, though, is about the queries. Each query is run twice 
>>>> in each DB. As you can see in the results file, for 10x more points 
>>>> in the data set the queries are 10x slower, at least for the first 
>>>> run (with the 2201M the second run is much faster but this does not 
>>>> happen with the 210M).
>>>>
>>>> Find also attached one of the XML that i used for the queries 
>>>> (example is for query1). Note that the geometry is previously 
>>>> inserted in oracle so I can use to pre-filter blocks with the query 
>>>> option in oci reader
>>>>
>>>> First I though that maybe the query option in the oci reader in the 
>>>> XML was ignored and that all the blocks of the dataset were being 
>>>> processed by PDAL (that would explain 10x more points 10x slower 
>>>> queries) but I ran a pdal pipeline for query1 with verbose and I 
>>>> saw that the crop filter "only" processed 120000 points which makes 
>>>> sense taking into account that region of query 1 only has 74818 
>>>> points. Or maybe the crop still process all the blocks extents but 
>>>> only opens and decompress the points of the overlapping ones?
>>>>
>>>> Any idea what is happening?
>>>>
>>>> Regards,
>>>>
>>>> O.
>>>> <results.txt><query1.xml>
>>>
>>> --
>>> ORACLE <http://www.oracle.com>
>>> Albert Godfrind | Geospatial technologies | Tel: +33 4 93 00 80 67| 
>>> Mobile: +33 6 09 97 27 23| Skype:albert-godfrind <skype:albert-godfrind>
>>> OracleServer Technologies
>>> 400 Av. Roumanille,BP 309 |06906 Sophia Antipolis cedex|France
>>> Everything you ever wanted to know about Oracle Spatial 
>>> <http://www.apress.com/9781590598993>
>>>
>>> <http://www.locationintelligence.net/dc/>
>>>
>>>
>>>
>>
>
>
> -- 
> Peter van OosteromP.J.M.vanOosterom at tudelft.nl
> Section GIS technology      (room 00-west-520) Department OTB
> Faculty of Architecture and the Built Environment, TU Delft
> tel (+31) 15 2786950        Julianalaan 134, 2628 BL Delft, NL
> fax (+31) 15 2784422        P.O. Box 5043, 2600 GA Delft, NL
> http://geomatics.tudelft.nl  MSc Geomatics
> http://www.msc-gima.nl       MSc GIMA (Geo-Info Management&Appl)
> http://www.gdmc.nl  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pdal/attachments/20150811/ef6b8d9c/attachment-0001.html>