[pdal] Oracle PDAL queries not scaling
Oscar Martinez Rubi
o.rubi at esciencecenter.nl
Tue Aug 11 08:04:10 PDT 2015
Hi,
What I did now is to force statistics gathering, so:
ANALYZE TABLE AHN_BLCK compute system statistics for table;
BEGIN
dbms_stats.gather_table_stats('PDAL20M','AHN_BLCK',NULL,NULL,FALSE,'FOR
ALL COLUMNS SIZE AUTO',8,'ALL');
END;
And after doing this the queries are scalable, so in this way I do not
need to wait for the DB to learn...
Regards,
O.
On 11-08-15 16:51, Peter van Oosterom wrote:
> Hi Oscar,
>
> It feels like when you fetch the data, this is based on a query
> execution plan that does a full table scan to get the blocks. Even if
> there is an index, the database may not use this. However, the
> database may notice that the actual query execution was disappointing
> (collecting statistics), and that after repeating the same tests, the
> database behaviour changed its behaviour and does scale well.
>
> [others: this was not in email of Oscar, but after repeating the test
> the data was fetched in about 0.02 seconds for all sizes 20M, 210M and
> 2201M. So, also the fetching in case of small dataset becomes
> significantly faster form 0.5 vs. 0.02 seconds.]
>
> Would be good to see the actual query execution plain or force the
> database to use the index (with an hint). My hand-on practical Oracle
> syntax knowledge is too low to give exact hits how to do this, but
> perhaps others can help here.
>
> Kind regards, Peter.
>
>
> On 11-8-2015 16:26, Oscar Martinez Rubi wrote:
>> Hi,
>>
>> I have investigated a bit more this issue.
>>
>> I wanted to see what data does the OCI reader actually reads, so I
>> executed the pre-selection query out of PDAL, in python. So I run the
>> attached script (freshly after loading the data).
>>
>> The script runs the same exact query as done in PDAL twice and prints
>> for each run the number of returned blocks, time spent in the query
>> and time spent to fetch the results.
>>
>> the exact query is:
>>
>> SELECT
>> l."OBJ_ID", l."BLK_ID", l."BLK_EXTENT", l."BLK_DOMAIN",
>> l."PCBLK_MIN_RES", l."PCBLK_MAX_RES", l."NUM_POINTS",
>> l."NUM_UNSORTED_POINTS", l."PT_SORT_DIM", l."POINTS", b.pc
>> FROM
>> AHN_BLCK l, AHN_BASE b, QUERY_POLYGONS g
>> WHERE
>> l.obj_id = b.id
>> AND
>> SDO_FILTER(l.blk_extent,g.geom) = 'TRUE' AND g.id = 1;
>>
>> The results I get are:
>>
>> 20M run 1:
>> #blocks: 12
>> query time[s]: 0.113833904266
>> fetch time[s]: *0.571593046188*
>> 20M run 2:
>> #blocks: 12
>> query time[s]: 0.000102996826172
>> fetch time[s]: *0.500910997391*
>> 210M run 1:
>> #blocks: 13
>> query time[s]: 0.0586049556732
>> fetch time[s]: *5.09832000732*
>> 210M run 2:
>> #blocks: 13
>> query time[s]: 0.000245094299316
>> fetch time[s]: *5.05038785934*
>> 2201M run 1:
>> #blocks: 13
>> query time[s]: 0.070690870285
>> fetch time[s]: *52.4960811138*
>> 2201M run 2:
>> #blocks: 13
>> query time[s]: 0.000225067138672
>> fetch time[s]: *53.1006689072*
>>
>> So, even though the query times and the number of returned blocks are
>> similar the fetch times are the not. We can see the scaling issue
>> there. Somehow the fetching is much more expensive (10x) when points
>> are 10x.
>>
>> I also noticed that after a while doing queries the times get much
>> better and scalable even when I do new queries with other polygons.
>> So, the first queries suffer of scaling issues but later it gets better.
>>
>> Any idea why?
>>
>> Regards,
>>
>> O.
>>
>>
>>
>>
>> On 10-08-15 18:40, Albert Godfrind wrote:
>>> Like many tools that access an Oracle database, we lack the ability
>>> to see what actually happens in the database at a detailed level,
>>> i.e. which actual queries are sent, and how the database executes
>>> them in terms of CPU use, logical and physical I/Os, network
>>> throughput and latency.
>>>
>>> So I think it is important to add some debugging / tracing facility
>>> to let me see what happens:
>>>
>>> 1) An option to make PDAL (actually the OCI driver here) log each
>>> SQL statement it executes, together with the elapsed time and the
>>> number of rows (blocks) fetched. Obviously we have that statement in
>>> the input XML file, but a trace would put everything in a single log
>>> and include proper measurements.
>>>
>>> 2) More important: an option to make the OCI driver enable SQL
>>> tracing at the database side. This is simple to do by just issuing
>>> an “ALTER SESSION …” statement before running the queries. The
>>> resulting trace will show all details about execution times as well
>>> as resource consumption (CPU and IO) and wait times. That could be
>>> added as an option in the XML file. Or maybe extend the XML file
>>> with the option to specify a SQ statement to be performed before
>>> each query (we could then use that to manually add the ALTER SESSION
>>> statement.
>>>
>>> The resulting trace can help isolate the bottleneck as one of:
>>>
>>> 1) the I/Os in the database, to fetch the blocks from disk (mostly I/O)
>>> 2) the network time to pass the blocks to the PDAL client (network
>>> throughput and latency)
>>> 3) the time to process the blocks in the PDAL client (mostly CPU)
>>>
>>> Albert
>>>
>>>> On 5-Aug-2015, at 12:26, Oscar Martinez Rubi
>>>> <o.rubi at esciencecenter.nl <mailto:o.rubi at esciencecenter.nl>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I did a test to see how good Oracle with PDAL scale with bigger
>>>> data sets. I had 3 datasets that are self-contained with 20M, 210M
>>>> and 2201M points. I loaded them in different Oracle DBs with PDAL
>>>> and laz-perf. And, for each of them I ran 7 queries (via a pdal
>>>> pipeline that preselects blocks, applies a crop and then write to a
>>>> LAS file)
>>>>
>>>> The results are in the attached file.
>>>>
>>>> Regarding the loading, for the 20M I only used one core (it is only
>>>> one file) while for the others I used 16 cores, i.e. 16 simult.
>>>> PDAL instances loading data to Oracle. I opened an issue in GitHub
>>>> because I noticed that in some of the runs the size that I got was
>>>> too large, and I do not know what caused that. The attached numbers
>>>> are when everything seemed to work and the sizes were as expected.
>>>>
>>>> This message, though, is about the queries. Each query is run twice
>>>> in each DB. As you can see in the results file, for 10x more points
>>>> in the data set the queries are 10x slower, at least for the first
>>>> run (with the 2201M the second run is much faster but this does not
>>>> happen with the 210M).
>>>>
>>>> Find also attached one of the XML that i used for the queries
>>>> (example is for query1). Note that the geometry is previously
>>>> inserted in oracle so I can use to pre-filter blocks with the query
>>>> option in oci reader
>>>>
>>>> First I though that maybe the query option in the oci reader in the
>>>> XML was ignored and that all the blocks of the dataset were being
>>>> processed by PDAL (that would explain 10x more points 10x slower
>>>> queries) but I ran a pdal pipeline for query1 with verbose and I
>>>> saw that the crop filter "only" processed 120000 points which makes
>>>> sense taking into account that region of query 1 only has 74818
>>>> points. Or maybe the crop still process all the blocks extents but
>>>> only opens and decompress the points of the overlapping ones?
>>>>
>>>> Any idea what is happening?
>>>>
>>>> Regards,
>>>>
>>>> O.
>>>> <results.txt><query1.xml>
>>>
>>> --
>>> ORACLE <http://www.oracle.com>
>>> Albert Godfrind | Geospatial technologies | Tel: +33 4 93 00 80 67|
>>> Mobile: +33 6 09 97 27 23| Skype:albert-godfrind <skype:albert-godfrind>
>>> OracleServer Technologies
>>> 400 Av. Roumanille,BP 309 |06906 Sophia Antipolis cedex|France
>>> Everything you ever wanted to know about Oracle Spatial
>>> <http://www.apress.com/9781590598993>
>>>
>>> <http://www.locationintelligence.net/dc/>
>>>
>>>
>>>
>>
>
>
> --
> Peter van OosteromP.J.M.vanOosterom at tudelft.nl
> Section GIS technology (room 00-west-520) Department OTB
> Faculty of Architecture and the Built Environment, TU Delft
> tel (+31) 15 2786950 Julianalaan 134, 2628 BL Delft, NL
> fax (+31) 15 2784422 P.O. Box 5043, 2600 GA Delft, NL
> http://geomatics.tudelft.nl MSc Geomatics
> http://www.msc-gima.nl MSc GIMA (Geo-Info Management&Appl)
> http://www.gdmc.nl
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/pdal/attachments/20150811/ef6b8d9c/attachment-0001.html>
More information about the pdal
mailing list