[gdal-dev] Design for sub-second accuracy in OGR ?

Even Rouault even.rouault at spatialys.com
Mon Apr 6 15:15:49 PDT 2015


Le lundi 06 avril 2015 23:32:40, Dmitriy Baryshnikov a écrit :
> Why not read all date/time data from records as accurate as possible?

That's what I intended and prototyped. Drivers analyze the date/time value and 
they set it with an evolution of the existing SetField() method for 
date/times, which takes an additional OGRDateTimePrecision ePrecision member, 
that they can let to _Guess if they don't know the precision (case when it 
comes from binary format typically) or explicitely set when it comes from a 
text format.

> For example for OFTDate we get date by GetFieldAsDateTime
> <http://www.gdal.org/classOGRFeature.html#a6c5d2444407b07e07b79863c42ee7a49
> > and time is zero.
> It's strange to analyse data structure during reading the records as we
> already have field definition.

At the OGR model level yes, but there's no provision in the formats themselves 
to store the level of precision of the date/time.

> We can use for old datasets type DateTime + SubType ODTP_YMDHMSm and new
> datasets let the user to choose the subtype. Certainly some formats
> support this new type + subtype now (i.e. Postgres/PostGIS, etc.).

I'm not sure how your suggestion would work (I'm not sure I've understood it 
correctly), at least on the reading side. When reading, you can only know if a 
field is a Date, Time or DateTime (and some formats might even not have that 
level of distinctions) by examining the layer/table metadata. To know if a 
datetime has second or millisecond accuracy, you need to fetch records (that 
might potentially be costly in case you have many records with null values for 
that field), so this is an operation we don't generally want to do so as to 
have a GetLayerDefn() that works format.

Although in quite a few of the formats I mentionned (GPX, Atom (GeoRSS 
driver), CSV in AUTODETECT_TYPES=YES mode, GeoJSON, ODS, XLSX, LIBKML)), they 
work by a preliminary ingestion/analysis phase of the whole dataset, so we 
could probably figure the maximum accuracy of all date/time records of a given 
field. But that would involve much more rework of those drivers than I've 
currently prototyped... And that wouldn't solve the problem for Postgres, 
MapInfo, SQLite and GeoPackage.

> 
> Postgres data type mapping:
> date -> OFTDateTime + ODTP_YMD
> time-> OFTDateTime + ODTP_HMS
> timestamp -> OFTDateTime + ODTP_YMDHMSm

Postgres case is probably not the best one to illustrate that accuracy concept 
here since internally it stores timestamps as uint8, so "2015/04/05 17:12:34" 
and "2015/04/05 17:12:34.000" are stored the same. Consequently, on reading 
you have to trust the "Guess" mode (which uses that simple heuristics: if the 
milliseconds are not 0, then you have millisecond accuracy, otherwise second). 
All other formats (except MapInfo) store them as a string.

Oh well, if that sounds too weird/confusing to have this precision information 
at the record level, maybe we can drop and always output to the millisecond 
for text formats (although there's a potential risk that would cause issues to 
parsers that wouldn't expect a decimal second). Although I had imagined this 
precision information more as an implementation detail than something we would 
really want to advertize and that applications would have to care about (it is 
an optional parameter in the modified getters/setters of OGRFeature I 
prototyped) :

    int                 GetFieldAsDateTime( int i, 
                                     int *pnYear, int *pnMonth, int *pnDay,
                                     int *pnHour, int *pnMinute, float 
*pfSecond, 
                                     int *pnTZFlag, OGRDateTimePrecision* 
pePrecision = NULL );

    void                SetField( int i, int nYear, int nMonth, int nDay,
                                  int nHour=0, int nMinute=0, float 
fSecond=0.f, 
                                  int nTZFlag = 0, OGRDateTimePrecision 
ePrecision = ODTP_Guess );

Hum, I'm just thinking we could also just implement the Guess logic in output, 
that is output with milliseconds if the milliseconds are not 0, and output 
with integral seconds otherwise. Could probably be a good compromise. The use 
cases where we really want to write ".000" are not that obvious after all.

Even

> 
> Best regards,
>      Dmitry
> 
> 07.04.2015 00:14, Even Rouault пишет:
> > Le lundi 06 avril 2015 23:11:21, Dmitriy Baryshnikov a écrit :
> >> Hi Even,
> >> 
> >> It seems to me that this is duplicating of RFC 50: OGR field subtypes.
> >> For example we have the master field type DateTime and Subtype - Year.
> >> So the internal structure for date/time representation may be adopt to
> >> such technique.
> > 
> > The subtype is defined at field definition level. In all formats we
> > currently handle we only know the date/time precision when reading
> > values (and they might have different precision between records), so
> > after having created the layer and field definitions.
> > 
> >> Best regards,
> >> 
> >>       Dmitry
> >> 
> >> 06.04.2015 15:02, Even Rouault пишет:
> >>> Le lundi 06 avril 2015 13:48:47, Even Rouault a écrit :
> >>>> Le lundi 06 avril 2015 11:32:33, Dmitriy Baryshnikov a écrit :
> >>>>> The first solution looks reasonable. But there is lack in precision
> >>>>> field - there the only time is significant:
> >>>>> 
> >>>>> ODTP_HMSm
> >>>>> ODTP_HMS
> >>>>> ODTP_HM
> >>>>> ODTP_H
> >>>> 
> >>>> As I didn't want to multiply the values in the enumeration, my intent
> >>>> was to reuse the ODTP_YMDxxxx values for OFTTime only.
> >>> 
> >>> I meant "for OFTTime too"
> >>> 
> >>>> This was what I wanted
> >>>> to intend with the precision between parenthesis in the comment of
> >>>> ODTP_YMDH "Year, month, day (if OFTDateTime) and hour"
> >>>> 
> >>>> Or perhaps, the enumeration should capture the most precise part of
> >>>> the (date)time structure  ?
> >>>> ODTP_Year
> >>>> ODTP_Month
> >>>> ODTP_Day
> >>>> ODTP_Hour
> >>>> ODTP_Minute
> >>>> ODTP_Second
> >>>> ODTP_Millisecond
> >>>> 
> >>>>> etc.
> >>>>> 
> >>>>> Best regards,
> >>>>> 
> >>>>>        Dmitry
> >>>>> 
> >>>>> 05.04.2015 22:25, Even Rouault пишет:
> >>>>>> Hi,
> >>>>>> 
> >>>>>> In an effort of revisiting http://trac.osgeo.org/gdal/ticket/2680,
> >>>>>> which is about lack of precision of the current datetime structure,
> >>>>>> I've imagined different solutions how to modify the OGRField
> >>>>>> structure, and failed to pick up one that would be the obvious
> >>>>>> solution, so opinions are welcome.
> >>>>>> 
> >>>>>> The issue is how to add (at least) microsecond accuracy to the
> >>>>>> datetime structure, as a few formats support it explicitely or
> >>>>>> implicitely : MapInfo, GPX, Atom (GeoRSS driver), GeoPackage,
> >>>>>> SQLite, PostgreSQL, CSV, GeoJSON, ODS, XLSX, KML (potentially GML
> >>>>>> too)...
> >>>>>> 
> >>>>>> Below a few potential solutions :
> >>>>>> 
> >>>>>> ---------------------------------------
> >>>>>> Solution 1) : Millisecond accuracy, second becomes a float
> >>>>>> 
> >>>>>> This is the solution I've prototyped.
> >>>>>> 
> >>>>>> typedef union {
> >>>>>> [...]
> >>>>>> 
> >>>>>>        struct {
> >>>>>>        
> >>>>>>            GInt16  Year;
> >>>>>>            GByte   Month;
> >>>>>>            GByte   Day;
> >>>>>>            GByte   Hour;
> >>>>>>            GByte   Minute;
> >>>>>>            GByte   TZFlag;
> >>>>>>            GByte   Precision; /* value in OGRDateTimePrecision */
> >>>>>>            float   Second; /* from 00.000 to 60.999 (millisecond
> >>>>>>            accuracy) */
> >>>>>>        
> >>>>>>        } Date;
> >>>>>> 
> >>>>>> } OGRField
> >>>>>> 
> >>>>>> So sub-second precision is representing with a single precision
> >>>>>> floating point number, storing both integral and decimal parts. (we
> >>>>>> could theorically have a hundredth of millisecond accuracy, 10^-5 s,
> >>>>>> since 6099999 fits on the 23 bits of the mantissa)
> >>>>>> 
> >>>>>> Another addition is the Precision field that indicates which parts
> >>>>>> of the datetime structure are significant.
> >>>>>> 
> >>>>>> /** Enumeration that defines the precision of a DateTime.
> >>>>>> 
> >>>>>>      * @since GDAL 2.0
> >>>>>>      */
> >>>>>> 
> >>>>>> typedef enum
> >>>>>> {
> >>>>>> 
> >>>>>>        ODTP_Undefined,     /**< Undefined */
> >>>>>>        ODTP_Guess,         /**< Only valid when setting through
> >>>>>>        SetField(i,year,
> >>>>>> 
> >>>>>> month...) where OGR will guess */
> >>>>>> 
> >>>>>>        ODTP_Y,             /**< Year is significant */
> >>>>>>        ODTP_YM,            /**< Year and month are significant*/
> >>>>>>        ODTP_YMD,           /**< Year, month and day are significant
> >>>>>>        */ ODTP_YMDH,          /**< Year, month, day (if
> >>>>>>        OFTDateTime) and hour are
> >>>>>> 
> >>>>>> significant */
> >>>>>> 
> >>>>>>        ODTP_YMDHM,         /**< Year, month, day (if OFTDateTime),
> >>>>>>        hour and
> >>>>>> 
> >>>>>> minute are significant */
> >>>>>> 
> >>>>>>        ODTP_YMDHMS,        /**< Year, month, day (if OFTDateTime),
> >>>>>>        hour, minute
> >>>>>> 
> >>>>>> and integral second are significant */
> >>>>>> 
> >>>>>>        ODTP_YMDHMSm,       /**< Year, month, day (if OFTDateTime),
> >>>>>>        hour, minute
> >>>>>> 
> >>>>>> and second with microseconds are significant */
> >>>>>> } OGRDateTimePrecision;
> >>>>>> 
> >>>>>> I think this is important since "2015/04/05 17:12:34" and
> >>>>>> "2015/04/05 17:12:34.000" do not really mean the same thing and it
> >>>>>> might be good to be able to preserve the original accuracy when
> >>>>>> converting between formats.
> >>>>>> 
> >>>>>> A drawback of this solution is that the size of the OGRField
> >>>>>> structure increases from 8 bytes to 12 on 32 bit builds (and remain
> >>>>>> 16 bytes on 64 bit). This is probably not that important since in
> >>>>>> most cases not that many OGRField structures are instanciated at
> >>>>>> one time (typically, you iterate over features one at a time).
> >>>>>> This could be more of a problem for use cases that involve the MEM
> >>>>>> driver, as it keep all features in memory.
> >>>>>> 
> >>>>>> Another drawback is that the change of the structure might not be
> >>>>>> directly noticed by application developers as the Second field name
> >>>>>> is preserved, but a new Precision field is added, so there's a risk
> >>>>>> that Precision is let uninitialized if the field is set through
> >>>>>> OGRFeature::SetField(int iFieldIndex, OGRField* psRawField). That
> >>>>>> could lead to unexpected formatting (but hopefully not crashes with
> >>>>>> defensive programming). However I'd think it is unlikely that many
> >>>>>> applications directly manipulate OGRField directly, instead of using
> >>>>>> the getters and setters of OGRFeature.
> >>>>>> 
> >>>>>> ---------------------------------------
> >>>>>> Solution 2) : Millisecond accuracy, second and milliseconds in
> >>>>>> distinct fields
> >>>>>> 
> >>>>>> typedef union {
> >>>>>> [...]
> >>>>>> 
> >>>>>>        struct {
> >>>>>>        
> >>>>>>            GInt16  Year;
> >>>>>>            GByte   Month;
> >>>>>>            GByte   Day;
> >>>>>>            GByte   Hour;
> >>>>>>            GByte   Minute;
> >>>>>>            GByte   TZFlag;
> >>>>>>            GByte   Precision; /* value in OGRDateTimePrecision */
> >>>>>>            GByte   Second; /* from 0 to 60 */
> >>>>>> 	
> >>>>>> 	GUInt16 Millisecond; /* from 0 to 999 */
> >>>>>> 	
> >>>>>>        } Date;
> >>>>>> 
> >>>>>> } OGRField
> >>>>>> 
> >>>>>> Same size of structure as in 1)
> >>>>>> 
> >>>>>> ---------------------------------------
> >>>>>> Solution 3) : Millisecond accuracy, pack all fields
> >>>>>> 
> >>>>>> Conceptually, this would use bit fields to avoid wasting unused bits
> >>>>>> :
> >>>>>> 
> >>>>>> typedef union {
> >>>>>> [...]
> >>>>>> 
> >>>>>>      struct {
> >>>>>>      
> >>>>>>        GInt16        Year;
> >>>>>>        GUIntBig     Month:4;
> >>>>>>        GUIntBig     Day:5;
> >>>>>>        GUIntBig     Hour:5;
> >>>>>>        GUIntBig     Minute:6;
> >>>>>>        GUIntBig     Second:6;
> >>>>>>        GUIntBig     Millisecond:10; /* 0-999 */
> >>>>>>        GUIntBig     TZFlag:8;
> >>>>>>        GUIntBig     Precision:4;
> >>>>>>     
> >>>>>>     } Date;
> >>>>>> 
> >>>>>> } OGRField;
> >>>>>> 
> >>>>>> This was proposed in the above mentionned ticket. And as there were
> >>>>>> enough remaining bits, I've also added the Precision field (and in
> >>>>>> all other solutions).
> >>>>>> 
> >>>>>> The advantage is that sizeof(mydate) remains 8 bytes on 32 bits
> >>>>>> builds.
> >>>>>> 
> >>>>>> But the C standard only defines bitfields of int/unsigned int, so
> >>>>>> this is not portable, plus the fact that the way bits are packed is
> >>>>>> not defined by the standard, so different compilers could come up
> >>>>>> with different packing. A workaround is to do the bit manipulation
> >>>>>> through macros :
> >>>>>> 
> >>>>>> typedef union {
> >>>>>> [...]
> >>>>>> 
> >>>>>>      struct {
> >>>>>> 	
> >>>>>> 	GUIntBig	opaque;
> >>>>>> 	
> >>>>>>      } Date;
> >>>>>> 
> >>>>>> } OGRField;
> >>>>>> 
> >>>>>> #define GET_BITS(x,y_bits,shift)        (int)(((x).Date.opaque >>
> >>>>>> (shift)) & ((1 << (y_bits))-1))
> >>>>>> 
> >>>>>> #define GET_YEAR(x)              (short)GET_BITS(x,16,64-16)
> >>>>>> #define GET_MONTH(x)             GET_BITS(x,4,64-16-4)
> >>>>>> #define GET_DAY(x)               GET_BITS(x,5,64-16-4-5)
> >>>>>> #define GET_HOUR(x)              GET_BITS(x,5,64-16-4-5-5)
> >>>>>> #define GET_MINUTE(x)            GET_BITS(x,6,64-16-4-5-5-6)
> >>>>>> #define GET_SECOND(x)            GET_BITS(x,6,64-16-4-5-5-6-6)
> >>>>>> #define GET_MILLISECOND(x)       GET_BITS(x,10,64-16-4-5-5-6-6-10)
> >>>>>> #define GET_TZFLAG(x)            GET_BITS(x,8,64-16-4-5-5-6-6-10-8)
> >>>>>> #define GET_PRECISION(x)        
> >>>>>> GET_BITS(x,4,64-16-4-5-5-6-6-10-8-4)
> >>>>>> 
> >>>>>> #define SET_BITS(x,y,y_bits,shift)  (x).Date.opaque =
> >>>>>> ((x).Date.opaque & (~( (GUIntBig)((1 << (y_bits))-1) << (shift) ))
> >>>>>> | ((GUIntBig)(y) << (shift)))
> >>>>>> 
> >>>>>> #define SET_YEAR(x,val)            SET_BITS(x,val,16,64-16)
> >>>>>> #define SET_MONTH(x,val)           SET_BITS(x,val,4,64-16-4)
> >>>>>> #define SET_DAY(x,val)             SET_BITS(x,val,5,64-16-4-5)
> >>>>>> #define SET_HOUR(x,val)            SET_BITS(x,val,5,64-16-4-5-5)
> >>>>>> #define SET_MINUTE(x,val)          SET_BITS(x,val,6,64-16-4-5-5-6)
> >>>>>> #define SET_SECOND(x,val)          SET_BITS(x,val,6,64-16-4-5-5-6-6)
> >>>>>> #define SET_MILLISECOND(x,val)
> >>>>>> SET_BITS(x,val,10,64-16-4-5-5-6-6-10) #define SET_TZFLAG(x,val)
> >>>>>> 
> >>>>>>     SET_BITS(x,val,8,64-16-4-5-5-6-6-10-8) #define
> >>>>>>     SET_PRECISION(x,val)
> >>>>>> 
> >>>>>> SET_BITS(x,val,4,64-16-4-5-5-6-6-10-8-4)
> >>>>>> 
> >>>>>> Main advantage: the size of OGRField remains unchanged (so 8 bytes
> >>>>>> on 32-bit builds).
> >>>>>> 
> >>>>>> Drawback: manipulation of datetime members is less natural, but
> >>>>>> there are not that many places in the GDAL code base were the
> >>>>>> OGRField.Date members are used, so it is not much that a problem.
> >>>>>> 
> >>>>>> ---------------------------------------
> >>>>>> Solution 4) : Microsecond accuracy with one field
> >>>>>> 
> >>>>>> Solution 1) used a float for second and sub-second, but a float has
> >>>>>> only 23 bits of mantissa, which is enough to represent second with
> >>>>>> millisecond accuracy, but not for microsecond (you need 26 bits for
> >>>>>> that). So use a 32-bit integer instead of a 32-bit floating point.
> >>>>>> 
> >>>>>> typedef union {
> >>>>>> [...]
> >>>>>> 
> >>>>>>        struct {
> >>>>>>        
> >>>>>>            GInt16  Year;
> >>>>>>            GByte   Month;
> >>>>>>            GByte   Day;
> >>>>>>            GByte   Hour;
> >>>>>>            GByte   Minute;
> >>>>>>            GByte   TZFlag;
> >>>>>>            GByte   Precision; /* value in OGRDateTimePrecision */
> >>>>>>            GUInt32 Microseconds; /* 00000000 to 59999999 */
> >>>>>>        
> >>>>>>        } Date;
> >>>>>> 
> >>>>>> } OGRField
> >>>>>> 
> >>>>>> Same as solution 1: sizeof(OGRField) becomes 12 bytes on 32-bit
> >>>>>> builds (and remain 16 bytes on 64-bit builds)
> >>>>>> 
> >>>>>> We would need to add an extra value in OGRDateTimePrecision to mean
> >>>>>> the microsecond accuracy.
> >>>>>> 
> >>>>>> Not really clear we need microseconds accuracy... Most formats that
> >>>>>> support subsecond accuracy use ISO 8601 representation (e.g.
> >>>>>> YYYY-MM- DDTHH:MM:SS.sssssZ) that doesn't define the maximal number
> >>>>>> of decimals beyond second. From
> >>>>>> http://www.postgresql.org/docs/9.1/static/datatype-datetime.html,
> >>>>>> PostgreSQL supports microsecond accuracy.
> >>>>>> 
> >>>>>> ---------------------------------------
> >>>>>> Solution 5) : Microsecond with 3 fields
> >>>>>> 
> >>>>>> A variant where we split second into 3 integer parts:
> >>>>>> 
> >>>>>> typedef union {
> >>>>>> [...]
> >>>>>> 
> >>>>>>        struct {
> >>>>>>        
> >>>>>>            GInt16  Year;
> >>>>>>            GByte   Month;
> >>>>>>            GByte   Day;
> >>>>>>            GByte   Hour;
> >>>>>>            GByte   Minute;
> >>>>>>            GByte   TZFlag;
> >>>>>>            GByte   Precision; /* value in OGRDateTimePrecision */
> >>>>>> 	
> >>>>>> 	GByte   Second; /* 0 to 59 */
> >>>>>> 	
> >>>>>>            GUInt16  Millisecond; /* 0 to 999 */
> >>>>>>            GUInt16 Microsecond; /* 0 to 999 */
> >>>>>>        
> >>>>>>        } Date;
> >>>>>> 
> >>>>>> } OGRField
> >>>>>> 
> >>>>>> Drawback: due to alignment, sizeof(OGRField) becomes 16 bytes on
> >>>>>> 32-bit builds (and remain 16 bytes on 64-bit builds)
> >>>>>> 
> >>>>>> ---------------------------------------
> >>>>>> Solution 6) : Nanosecond accuracy and beyond !
> >>>>>> 
> >>>>>> Now that we are using 16 bytes, why not having nanosecond accuracy ?
> >>>>>> 
> >>>>>> typedef union {
> >>>>>> [...]
> >>>>>> 
> >>>>>>        struct {
> >>>>>>        
> >>>>>>            GInt16  Year;
> >>>>>>            GByte   Month;
> >>>>>>            GByte   Day;
> >>>>>>            GByte   Hour;
> >>>>>>            GByte   Minute;
> >>>>>>            GByte   TZFlag;
> >>>>>>            GByte   Precision; /* value in OGRDateTimePrecision */
> >>>>>> 	
> >>>>>> 	double   Second; /* 0.000000000 to 60.999999999 */
> >>>>>> 	
> >>>>>>        } Date;
> >>>>>> 
> >>>>>> } OGRField
> >>>>>> 
> >>>>>> Actually we even have picosecond accuracy! (since for picoseconds,
> >>>>>> we need 46 bits and a double has 52 bits of mantissa). And if we
> >>>>>> use a 64-bit integer instead of a double, we can have femtosecond
> >>>>>> accuracy ;-)
> >>>>>> 
> >>>>>> Any preference ?
> >>>>>> 
> >>>>>> Even
> >>>>> 
> >>>>> _______________________________________________
> >>>>> gdal-dev mailing list
> >>>>> gdal-dev at lists.osgeo.org
> >>>>> http://lists.osgeo.org/mailman/listinfo/gdal-dev
> >> 
> >> _______________________________________________
> >> gdal-dev mailing list
> >> gdal-dev at lists.osgeo.org
> >> http://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com


More information about the gdal-dev mailing list