[postgis-users] A PostGIS-Raster data proposal

Marshall, Steve smarshall at wsi.com
Tue Oct 31 13:02:16 PST 2006


Paul,

I'll try not too get hung up on performance.  I agree that manipulation
functionality will be the reason to do raster-in-DB, and performance
will almost assuredly be degraded to some degree.  However, in my humble
opinion, maintaining an acceptable performance is the make or break
characteristic that will determine whether or not raster-in-DB will
really get used.  

Regarding the chunking of data, this basically is already happening
under the hood for both LO and TOAST implementations.  If I understand
the math correctly, TOAST is broken into chunks such that each chunk of
data AND its associated tuple data will fit into 1/4 of a page, i.e. 4
data tuples per disk page.

The logic to maintain the chunking is indeed a bit of a pain, but
fortunately the PostgreSQL folks have written most of it for us.  The
notable exception is a way to do seeking writes to TOAST data.  However,
I'm hopefully that I can use the LO implementation as guidance for
implementing that without too much trouble.  

Steve
-----Original Message-----
From: postgis-users-bounces at postgis.refractions.net
[mailto:postgis-users-bounces at postgis.refractions.net] On Behalf Of Paul
Ramsey
Sent: Tuesday, October 31, 2006 3:50 PM
To: postgis-users at postgis.refractions.net
Subject: Re: [postgis-users] A PostGIS-Raster data proposal

Remember not to get too hung up on performance... after all, we knew
going into a database would degrade performance.  Presumably there are
other characteristics of the database that we *like* that is worth
taking a performance hit.

Also consider the Oracle approach in your thinking, which is to shred
everything into < pagesize chunks.  We used this approach in our only
raster-in-db implementation (using the CHIP type) which I can tell you
from experience can be a real PITA to manage.

P

Quoting Stephen Marshall <smarshall at wsi.com>:

> I've extended my testing of server-side access to include testing 
> large objects (LO's) as well as TOAST.
> 
> I found that LO's have similar performance characteristics to TOAST.  
> On average, it takes tens of microseconds to access a 1000 bytes chunk

> of data from a much larger LO.  For both TOAST and LO's, the access 
> time varied around close to 40 microseconds per access, compared to 
> around 4 microseconds to do a similiar access from a local file using 
> fseek and fread.
> 
> The LO and flatfile tests differed from the TOAST tests in that LO's 
> and flatfiles must be opened before access.  Opening the LO and 
> flatfiles were done once prior to all the reads, and were not included

> in timing tests.  Open a file general takes around 100 microseconds 
> and opening a LO takes several hundred microseconds, so the overhead 
> is large relative to the time to read a small chunk of data.  It will 
> be important to avoid designs the require opening an LO or file more
than once.
> 
> Determining the size the data also has performance characteristics I 
> have not yet tested thoroughly, but which may be important.  My tests 
> require knowing the total data size, which may not necessarily be 
> needed for raster-in-DB (for example the data formats themselves may 
> carry sufficient size information).  However, finding the size of the 
> TOAST or LO can be relatively expensive.
> 
> I am looking into ways to find the size of the TOAST object, but my 
> current method seems to require reading all the data into memory.  
> Since this subverts the whole point of seeking within the TOASTed 
> data, I'm not happy with this.  I'm looking into how to just scape the

> length from the data from the bytea, which I believe should be stored 
> in the first 4 bytes of a bytea (variable length field).  If anyone 
> knows different or better ways to find the length of a bytea variable,
please let me know.
> 
> Finding the size of an LO seems to require seeking to the end of the 
> LO and using lo_ftell to return the seek position.  This takes more 
> time than a single 1000 byte read (~ 65 microseconds).  Anyone know a 
> faster way to find the size of an LO?
> 
> Finally, these tests (still) do not include the time it takes to 
> transfer the data back to the database client, thus all the accesses 
> are inherently acces to the local system only.  I'll be expanding the 
> tests to include remote access soon.
> 
> Steve Marshall
> 
> Marshall, Steve wrote:
> 
> >Per Frank Warmerdam's suggestion, I've done a test of access 
> >performance using internal  postgresql toast functions vs. normal
file seeking.
> >
> >The test involved seeking in a toasted bytea column containing 
> >approximately 20 MB of binary data.  The TOAST column was set to 
> >EXTERNAL storage (i.e. in separate TOAST table, but not compressed).
> >The test involved seeking through the data sequentially in chunks of 
> >1000 bytes, and measuring the time to retrieve each chunk.  The code 
> >to do this was encapsulated in a postgresql server-side function and 
> >invoked through SQL.  I restarted the PostgreSQL server before the 
> >test to avoid having any cacing of data in shared memory, which could

> >artificially speed up the data access.
> >
> >As a comparison, I also wrote a program that would do the equivalent 
> >data access from a file.  The file contained the same data as the 
> >bytea column, and the access was replaced with fseek and fread calls.
> >
> >The results of the test were that toast seeking was about 10 times 
> >more expensive than seeking in a local file.  Each local file access 
> >averaged in microseconds, while toast-seeks averaged 10's of 
> >microseconds.  The worst case file seeking was in milliseconds, while

> >worst case toast-seeking was in 10's of milliseconds.  The absolute 
> >values for toast-seeks don't seem too bad to me, but it is a bit 
> >worrying that the values are an order of magnitude worse than local
file I/O.
> >
> >I did play around with some parameters in the DB test.  Changing the 
> >chunk size did not make a big difference, but it got a small boost by

> >setting it to the toast chunk size (1994 bytes). I did not vary the 
> >test to do seeking around randomly instead of sequentially.  This 
> >might give a boost to the DB implementation due to caching; I'm not 
> >sure what this would do to file I/O.
> >
> >I also have not explored the performance of repeated access to the 
> >same data segments.  Here PostgreSQL data caching might help DB 
> >access relative to file I/O.
> >
> >There are still more things to do here, but I thought I'd share some 
> >early results.  I'm happy to provide the code and SQL definitions for

> >the test, if anyone else is interested in it.
> >
> >Steve Marshall
> >
> >
> 
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
> 




_______________________________________________
postgis-users mailing list
postgis-users at postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users




More information about the postgis-users mailing list