[postgis-devel] about serialization i PostGIS 3.0

Wed Oct 3 06:27:20 PDT 2018

Hello

I have followed some of the discussions about serialization i PostGIS
3, but I have probably missed some.

One optional serialization as mentioned in the wiki is some sort of
specific compression like twkb. That is an interesting topic of
course:-)

I have been thinking some about it, and have an idea.

In twkb we compress by using delta values between vertex points and
packs those values with variable integer á la protobuf.

What we loose is 2 things. We must have a fixed precision, and the more
precision the larger result. So, if someone really needs 6 decimals on
meter based projections there is not much compression at all. That is
because we steal 1 bit per byte to tell if there is another byte
involved in the value.

We also loose alignment. We cannot with simple pointer arithmetic find
the 5:th point in a point array.

My idea is about solving the second part, and I don't know how it will
affect the overall compression.

The idea is that a geometry should have a fixed number of bytes per
coordinate in a whole geometry (or only in a point array). It can also
be fixed per dimmension, if htat is enough gain.

We calculate the bounding box. and descide from its width and height
and the requested precision how many bytes we need to describe every
point inside the bounding box.

Then we choose the lower left corner of the bounding box as origo (no
negative values needed which simplifies encoding and decoding
slightly).

I am not sure the compression will be any worse than variable integer
since we get back the stolen bit per byte.

So if a precision of 1 meter is requested in a meter based projection,
all geometries with a bounding box of max 255 meters size will only get
1 byte coordinates. Comparing with twkb we get a 1 byte coordinate as
long as it is not more than 63 meters in a specific dimension from the
previous point. That is because we loose 1 bit to variable integer and
1 bit to get signed integer. But that is comparing apples to oranges
since in twkb it is from last point and now I am discussing from our
local origo.

We also get all coordinates aligned, so, by reading the byte size in
the header and the coordinates of the lower left corner (also from the
header or from the stored bounding box, we can pick any vertex point
instantly.

I think this would perform quite well for anything but points on many
real life data sets.

For points it will be just extra overhead.

For encoding and decoding twkb performs well in terms of speed. I guess
something like this would be quite the same speed, if not faster. 

But for the cases when the user wants many decimals per meter the
compression will still be bad. 

My experience is that compression like this (twkb) is very much faster
than generic compression. I have done some testing creating twkb
datasets in the dataabse and comparing that to writing wkb to a file
and compress it with some external tool. Compressing 1 GB file takes
quite a long time with ordinary zip tools.

Any thoughts?

Nicklas Avén