[gdal-dev] Design for sub-second accuracy in OGR ?

Sun Apr 5 12:25:53 PDT 2015

Hi,

In an effort of revisiting http://trac.osgeo.org/gdal/ticket/2680, which is 
about lack of precision of the current datetime structure, I've imagined 
different solutions how to modify the OGRField structure, and failed to pick up 
one that would be the obvious solution, so opinions are welcome.

The issue is how to add (at least) microsecond accuracy to the datetime 
structure, as a few formats support it explicitely or implicitely : MapInfo, 
GPX, Atom (GeoRSS driver), GeoPackage, SQLite, PostgreSQL, CSV, GeoJSON, ODS, 
XLSX, KML (potentially GML too)...

Below a few potential solutions :

---------------------------------------
Solution 1) : Millisecond accuracy, second becomes a float

This is the solution I've prototyped.

typedef union {
[...]
    struct {
        GInt16  Year;
        GByte   Month;
        GByte   Day;
        GByte   Hour;
        GByte   Minute;
        GByte   TZFlag; 
        GByte   Precision; /* value in OGRDateTimePrecision */
        float   Second; /* from 00.000 to 60.999 (millisecond accuracy) */
    } Date;
} OGRField

So sub-second precision is representing with a single precision floating point 
number, storing both integral and decimal parts. (we could theorically have a 
hundredth of millisecond accuracy, 10^-5 s, since 6099999 fits on the 23 bits 
of the mantissa)

Another addition is the Precision field that indicates which parts of the 
datetime structure are significant.

/** Enumeration that defines the precision of a DateTime.
  * @since GDAL 2.0
  */
typedef enum
{
    ODTP_Undefined,     /**< Undefined */
    ODTP_Guess,         /**< Only valid when setting through SetField(i,year, 
month...) where OGR will guess */
    ODTP_Y,             /**< Year is significant */
    ODTP_YM,            /**< Year and month are significant*/
    ODTP_YMD,           /**< Year, month and day are significant */
    ODTP_YMDH,          /**< Year, month, day (if OFTDateTime) and hour are 
significant */
    ODTP_YMDHM,         /**< Year, month, day (if OFTDateTime), hour and 
minute are significant */
    ODTP_YMDHMS,        /**< Year, month, day (if OFTDateTime), hour, minute 
and integral second are significant */
    ODTP_YMDHMSm,       /**< Year, month, day (if OFTDateTime), hour, minute 
and second with microseconds are significant */
} OGRDateTimePrecision;

I think this is important since "2015/04/05 17:12:34" and "2015/04/05 
17:12:34.000" do not really mean the same thing and it might be good to be 
able to preserve the original accuracy when converting between formats.

A drawback of this solution is that the size of the OGRField structure 
increases from 8 bytes to 12 on 32 bit builds (and remain 16 bytes on 64 bit). 
This is probably not that important since in most cases not that many OGRField 
structures are instanciated at one time (typically, you iterate over features 
one at a time).
This could be more of a problem for use cases that involve the MEM driver, as 
it keep all features in memory.

Another drawback is that the change of the structure might not be directly 
noticed by application developers as the Second field name is preserved, but a 
new Precision field is added, so there's a risk that Precision is let 
uninitialized if the field is set through OGRFeature::SetField(int iFieldIndex, 
OGRField* psRawField). That could lead to unexpected formatting (but hopefully 
not crashes with defensive programming). However I'd think it is unlikely that 
many applications directly manipulate OGRField directly, instead of using the 
getters and setters of OGRFeature.

---------------------------------------
Solution 2) : Millisecond accuracy, second and milliseconds in distinct fields

typedef union {
[...]
    struct {
        GInt16  Year;
        GByte   Month;
        GByte   Day;
        GByte   Hour;
        GByte   Minute;
        GByte   TZFlag;
        GByte   Precision; /* value in OGRDateTimePrecision */
        GByte   Second; /* from 0 to 60 */
	GUInt16 Millisecond; /* from 0 to 999 */
    } Date;
} OGRField

Same size of structure as in 1)

---------------------------------------
Solution 3) : Millisecond accuracy, pack all fields

Conceptually, this would use bit fields to avoid wasting unused bits :

typedef union {
[...]
  struct {
    GInt16        Year;
    GUIntBig     Month:4;
    GUIntBig     Day:5;
    GUIntBig     Hour:5;
    GUIntBig     Minute:6;
    GUIntBig     Second:6;
    GUIntBig     Millisecond:10; /* 0-999 */
    GUIntBig     TZFlag:8;
    GUIntBig     Precision:4;
 } Date;
} OGRField;

This was proposed in the above mentionned ticket. And as there were enough 
remaining bits, I've also added the Precision field (and in all other 
solutions).

The advantage is that sizeof(mydate) remains 8 bytes on 32 bits builds.

But the C standard only defines bitfields of int/unsigned int, so this is not 
portable, plus the fact that the way bits are packed is not defined by the 
standard, so different compilers could come up with different packing. A 
workaround is to do the bit manipulation through macros :

typedef union {
[...]
  struct {
	GUIntBig	opaque;
  } Date;
} OGRField;

#define GET_BITS(x,y_bits,shift)        (int)(((x).Date.opaque >> (shift)) & 
((1 << (y_bits))-1))

#define GET_YEAR(x)              (short)GET_BITS(x,16,64-16)
#define GET_MONTH(x)             GET_BITS(x,4,64-16-4)
#define GET_DAY(x)               GET_BITS(x,5,64-16-4-5)
#define GET_HOUR(x)              GET_BITS(x,5,64-16-4-5-5)
#define GET_MINUTE(x)            GET_BITS(x,6,64-16-4-5-5-6)
#define GET_SECOND(x)            GET_BITS(x,6,64-16-4-5-5-6-6)
#define GET_MILLISECOND(x)       GET_BITS(x,10,64-16-4-5-5-6-6-10)
#define GET_TZFLAG(x)            GET_BITS(x,8,64-16-4-5-5-6-6-10-8)
#define GET_PRECISION(x)         GET_BITS(x,4,64-16-4-5-5-6-6-10-8-4)

#define SET_BITS(x,y,y_bits,shift)  (x).Date.opaque = ((x).Date.opaque & (~( 
(GUIntBig)((1 << (y_bits))-1) << (shift) )) | ((GUIntBig)(y) << (shift)))

#define SET_YEAR(x,val)            SET_BITS(x,val,16,64-16)
#define SET_MONTH(x,val)           SET_BITS(x,val,4,64-16-4)
#define SET_DAY(x,val)             SET_BITS(x,val,5,64-16-4-5)
#define SET_HOUR(x,val)            SET_BITS(x,val,5,64-16-4-5-5)
#define SET_MINUTE(x,val)          SET_BITS(x,val,6,64-16-4-5-5-6)
#define SET_SECOND(x,val)          SET_BITS(x,val,6,64-16-4-5-5-6-6)
#define SET_MILLISECOND(x,val)     SET_BITS(x,val,10,64-16-4-5-5-6-6-10)
#define SET_TZFLAG(x,val)          SET_BITS(x,val,8,64-16-4-5-5-6-6-10-8)
#define SET_PRECISION(x,val)       SET_BITS(x,val,4,64-16-4-5-5-6-6-10-8-4)

Main advantage: the size of OGRField remains unchanged (so 8 bytes on 32-bit 
builds).

Drawback: manipulation of datetime members is less natural, but there are not 
that many places in the GDAL code base were the OGRField.Date members are 
used, so it is not much that a problem.

---------------------------------------
Solution 4) : Microsecond accuracy with one field

Solution 1) used a float for second and sub-second, but a float has only 23 bits 
of mantissa, which is enough to represent second with millisecond accuracy, 
but not for microsecond (you need 26 bits for that). So use a 32-bit integer 
instead of a 32-bit floating point.

typedef union {
[...]
    struct {
        GInt16  Year;
        GByte   Month;
        GByte   Day;
        GByte   Hour;
        GByte   Minute;
        GByte   TZFlag; 
        GByte   Precision; /* value in OGRDateTimePrecision */
        GUInt32 Microseconds; /* 00000000 to 59999999 */
    } Date;
} OGRField

Same as solution 1: sizeof(OGRField) becomes 12 bytes on 32-bit builds (and 
remain 16 bytes on 64-bit builds)

We would need to add an extra value in OGRDateTimePrecision to mean the 
microsecond accuracy.

Not really clear we need microseconds accuracy... Most formats that support 
subsecond accuracy use ISO 8601 representation (e.g. YYYY-MM-
DDTHH:MM:SS.sssssZ) that doesn't define the maximal number of decimals beyond 
second. From http://www.postgresql.org/docs/9.1/static/datatype-datetime.html, 
PostgreSQL supports microsecond accuracy.

---------------------------------------
Solution 5) : Microsecond with 3 fields

A variant where we split second into 3 integer parts:

typedef union {
[...]
    struct {
        GInt16  Year;
        GByte   Month;
        GByte   Day;
        GByte   Hour;
        GByte   Minute;
        GByte   TZFlag;
        GByte   Precision; /* value in OGRDateTimePrecision */
	GByte   Second; /* 0 to 59 */
        GUInt16  Millisecond; /* 0 to 999 */
        GUInt16 Microsecond; /* 0 to 999 */
    } Date;
} OGRField

Drawback: due to alignment, sizeof(OGRField) becomes 16 bytes on 32-bit builds 
(and remain 16 bytes on 64-bit builds)

---------------------------------------
Solution 6) : Nanosecond accuracy and beyond !

Now that we are using 16 bytes, why not having nanosecond accuracy ? 

typedef union {
[...]
    struct {
        GInt16  Year;
        GByte   Month;
        GByte   Day;
        GByte   Hour;
        GByte   Minute;
        GByte   TZFlag; 
        GByte   Precision; /* value in OGRDateTimePrecision */
	double   Second; /* 0.000000000 to 60.999999999 */
    } Date;
} OGRField

Actually we even have picosecond accuracy! (since for picoseconds, we need 46 
bits and a double has 52 bits of mantissa). And if we use a 64-bit integer 
instead of a double, we can have femtosecond accuracy ;-)

Any preference ?

Even

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com