[postgis-devel] code reorganization in pgsql2shp.c

Mon Nov 17 14:41:58 PST 2003

strk wrote:
> I don't think making a row buffer will speed up things very much.
> If memory usage was a matter before, I think this is still the easiest
> way to keep it low. If we want to buffer multiple rows we should also
> compute their size to decide when to stop based on a threshold ....

dblasby wrote:
> I think its pretty easy - just do a "FETCH n" instead of "FETCH 1".
> Just default n to something like 1000 but provide a command line switch 
> to decrease it if needed.  The only time it'll be a problem is if 1000 
> (or whatever they've chosen) is bigger than memory.

pramsey wrote:
> It fetching one record at a time the best policy? Maybe a few hundred or
> thousand at a time would be faster?

I've been asked an opinion and I provided it.
I might have used the wrong terms, so I'll try to be more clear about it.

The "best" policy depends on many factors. One of these factors is
development time. Althought FETCHing n instead of 1 sounds a quick
modification it implies further thoughts about memory usage. This
have been excellently be done by Dave, which proposes a command line
switch. This brings to other thoughts about how to name that switch,
and where to document it... 
Still, whether or not FETCHing n rows instead of 1 is the "best" policy
is not a question I would answer with a decise "YES" !

I did not say that FETCHing 1 IS the best, but I said it is the easiest way
to keep memory usage low, where the term "low" is a pretty generic one.
For sure we cannot (without increasing complexity) divide a feature by
its sub-objects, so processing ONE record at the time guarantees minimal
memory usage for the minimal human time resource.

Reduced memory has been the first request about the dumper, considering
the fact that you usually use a database for a large number of data.

When it comes to speed, I noticed there are more important modifications
to do before warring about number of records returned from a cursor:

strk wrote:
> I don't really know about low-level handling of cursors by postgresql,
> but I suppose the parse, rewrite, plan and optimize stages are skipped.

dblasby wrote:
strk wrote:
> 
> > If we care about speed, it might be worth using WKB or internal binary
> > representation instead of parsing WKT. Internal binary representation
> > will forget about endianness, so it will be safe only when both data
> > and dumper run on the same machine (or endian-compatible machines).
> 
> I think pgsql2shp already uses WKB.

Nope. Dumper parses WKT. Loader writes WKT.

> 
> We should make both pgsql2shp and shp2pgsql use WKB so there isnt any 
> numeric drift.  I'm currently looking at making the JDBC driver be WKB 
> aware so we dont accidently move points by a wee bit.
> 

I agree.

--strk(shattered);

PS: fetching N instead of 1 can be surely done in less then the time 
    required to write this mail. It is left to the reader as an excercise.