[gdal-dev] Design for sub-second accuracy in OGR ?

Mon Apr 6 15:44:41 PDT 2015

Ok, this explanation looks reasonable.

Best regards,
     Dmitry

07.04.2015 01:15, Even Rouault пишет:
> Le lundi 06 avril 2015 23:32:40, Dmitriy Baryshnikov a écrit :
>> Why not read all date/time data from records as accurate as possible?
> That's what I intended and prototyped. Drivers analyze the date/time value and
> they set it with an evolution of the existing SetField() method for
> date/times, which takes an additional OGRDateTimePrecision ePrecision member,
> that they can let to _Guess if they don't know the precision (case when it
> comes from binary format typically) or explicitely set when it comes from a
> text format.
>
>> For example for OFTDate we get date by GetFieldAsDateTime
>> <http://www.gdal.org/classOGRFeature.html#a6c5d2444407b07e07b79863c42ee7a49
>>> and time is zero.
>> It's strange to analyse data structure during reading the records as we
>> already have field definition.
> At the OGR model level yes, but there's no provision in the formats themselves
> to store the level of precision of the date/time.
>
>> We can use for old datasets type DateTime + SubType ODTP_YMDHMSm and new
>> datasets let the user to choose the subtype. Certainly some formats
>> support this new type + subtype now (i.e. Postgres/PostGIS, etc.).
> I'm not sure how your suggestion would work (I'm not sure I've understood it
> correctly), at least on the reading side. When reading, you can only know if a
> field is a Date, Time or DateTime (and some formats might even not have that
> level of distinctions) by examining the layer/table metadata. To know if a
> datetime has second or millisecond accuracy, you need to fetch records (that
> might potentially be costly in case you have many records with null values for
> that field), so this is an operation we don't generally want to do so as to
> have a GetLayerDefn() that works format.
>
> Although in quite a few of the formats I mentionned (GPX, Atom (GeoRSS
> driver), CSV in AUTODETECT_TYPES=YES mode, GeoJSON, ODS, XLSX, LIBKML)), they
> work by a preliminary ingestion/analysis phase of the whole dataset, so we
> could probably figure the maximum accuracy of all date/time records of a given
> field. But that would involve much more rework of those drivers than I've
> currently prototyped... And that wouldn't solve the problem for Postgres,
> MapInfo, SQLite and GeoPackage.
>
>> Postgres data type mapping:
>> date -> OFTDateTime + ODTP_YMD
>> time-> OFTDateTime + ODTP_HMS
>> timestamp -> OFTDateTime + ODTP_YMDHMSm
> Postgres case is probably not the best one to illustrate that accuracy concept
> here since internally it stores timestamps as uint8, so "2015/04/05 17:12:34"
> and "2015/04/05 17:12:34.000" are stored the same. Consequently, on reading
> you have to trust the "Guess" mode (which uses that simple heuristics: if the
> milliseconds are not 0, then you have millisecond accuracy, otherwise second).
> All other formats (except MapInfo) store them as a string.
>
> Oh well, if that sounds too weird/confusing to have this precision information
> at the record level, maybe we can drop and always output to the millisecond
> for text formats (although there's a potential risk that would cause issues to
> parsers that wouldn't expect a decimal second). Although I had imagined this
> precision information more as an implementation detail than something we would
> really want to advertize and that applications would have to care about (it is
> an optional parameter in the modified getters/setters of OGRFeature I
> prototyped) :
>
>      int                 GetFieldAsDateTime( int i,
>                                       int *pnYear, int *pnMonth, int *pnDay,
>                                       int *pnHour, int *pnMinute, float
> *pfSecond,
>                                       int *pnTZFlag, OGRDateTimePrecision*
> pePrecision = NULL );
>
>      void                SetField( int i, int nYear, int nMonth, int nDay,
>                                    int nHour=0, int nMinute=0, float
> fSecond=0.f,
>                                    int nTZFlag = 0, OGRDateTimePrecision
> ePrecision = ODTP_Guess );
>
> Hum, I'm just thinking we could also just implement the Guess logic in output,
> that is output with milliseconds if the milliseconds are not 0, and output
> with integral seconds otherwise. Could probably be a good compromise. The use
> cases where we really want to write ".000" are not that obvious after all.
>
> Even
>
>> Best regards,
>>       Dmitry
>>
>> 07.04.2015 00:14, Even Rouault пишет:
>>> Le lundi 06 avril 2015 23:11:21, Dmitriy Baryshnikov a écrit :
>>>> Hi Even,
>>>>
>>>> It seems to me that this is duplicating of RFC 50: OGR field subtypes.
>>>> For example we have the master field type DateTime and Subtype - Year.
>>>> So the internal structure for date/time representation may be adopt to
>>>> such technique.
>>> The subtype is defined at field definition level. In all formats we
>>> currently handle we only know the date/time precision when reading
>>> values (and they might have different precision between records), so
>>> after having created the layer and field definitions.
>>>
>>>> Best regards,
>>>>
>>>>        Dmitry
>>>>
>>>> 06.04.2015 15:02, Even Rouault пишет:
>>>>> Le lundi 06 avril 2015 13:48:47, Even Rouault a écrit :
>>>>>> Le lundi 06 avril 2015 11:32:33, Dmitriy Baryshnikov a écrit :
>>>>>>> The first solution looks reasonable. But there is lack in precision
>>>>>>> field - there the only time is significant:
>>>>>>>
>>>>>>> ODTP_HMSm
>>>>>>> ODTP_HMS
>>>>>>> ODTP_HM
>>>>>>> ODTP_H
>>>>>> As I didn't want to multiply the values in the enumeration, my intent
>>>>>> was to reuse the ODTP_YMDxxxx values for OFTTime only.
>>>>> I meant "for OFTTime too"
>>>>>
>>>>>> This was what I wanted
>>>>>> to intend with the precision between parenthesis in the comment of
>>>>>> ODTP_YMDH "Year, month, day (if OFTDateTime) and hour"
>>>>>>
>>>>>> Or perhaps, the enumeration should capture the most precise part of
>>>>>> the (date)time structure  ?
>>>>>> ODTP_Year
>>>>>> ODTP_Month
>>>>>> ODTP_Day
>>>>>> ODTP_Hour
>>>>>> ODTP_Minute
>>>>>> ODTP_Second
>>>>>> ODTP_Millisecond
>>>>>>
>>>>>>> etc.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>>         Dmitry
>>>>>>>
>>>>>>> 05.04.2015 22:25, Even Rouault пишет:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> In an effort of revisiting http://trac.osgeo.org/gdal/ticket/2680,
>>>>>>>> which is about lack of precision of the current datetime structure,
>>>>>>>> I've imagined different solutions how to modify the OGRField
>>>>>>>> structure, and failed to pick up one that would be the obvious
>>>>>>>> solution, so opinions are welcome.
>>>>>>>>
>>>>>>>> The issue is how to add (at least) microsecond accuracy to the
>>>>>>>> datetime structure, as a few formats support it explicitely or
>>>>>>>> implicitely : MapInfo, GPX, Atom (GeoRSS driver), GeoPackage,
>>>>>>>> SQLite, PostgreSQL, CSV, GeoJSON, ODS, XLSX, KML (potentially GML
>>>>>>>> too)...
>>>>>>>>
>>>>>>>> Below a few potential solutions :
>>>>>>>>
>>>>>>>> ---------------------------------------
>>>>>>>> Solution 1) : Millisecond accuracy, second becomes a float
>>>>>>>>
>>>>>>>> This is the solution I've prototyped.
>>>>>>>>
>>>>>>>> typedef union {
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>         struct {
>>>>>>>>         
>>>>>>>>             GInt16  Year;
>>>>>>>>             GByte   Month;
>>>>>>>>             GByte   Day;
>>>>>>>>             GByte   Hour;
>>>>>>>>             GByte   Minute;
>>>>>>>>             GByte   TZFlag;
>>>>>>>>             GByte   Precision; /* value in OGRDateTimePrecision */
>>>>>>>>             float   Second; /* from 00.000 to 60.999 (millisecond
>>>>>>>>             accuracy) */
>>>>>>>>         
>>>>>>>>         } Date;
>>>>>>>>
>>>>>>>> } OGRField
>>>>>>>>
>>>>>>>> So sub-second precision is representing with a single precision
>>>>>>>> floating point number, storing both integral and decimal parts. (we
>>>>>>>> could theorically have a hundredth of millisecond accuracy, 10^-5 s,
>>>>>>>> since 6099999 fits on the 23 bits of the mantissa)
>>>>>>>>
>>>>>>>> Another addition is the Precision field that indicates which parts
>>>>>>>> of the datetime structure are significant.
>>>>>>>>
>>>>>>>> /** Enumeration that defines the precision of a DateTime.
>>>>>>>>
>>>>>>>>       * @since GDAL 2.0
>>>>>>>>       */
>>>>>>>>
>>>>>>>> typedef enum
>>>>>>>> {
>>>>>>>>
>>>>>>>>         ODTP_Undefined,     /**< Undefined */
>>>>>>>>         ODTP_Guess,         /**< Only valid when setting through
>>>>>>>>         SetField(i,year,
>>>>>>>>
>>>>>>>> month...) where OGR will guess */
>>>>>>>>
>>>>>>>>         ODTP_Y,             /**< Year is significant */
>>>>>>>>         ODTP_YM,            /**< Year and month are significant*/
>>>>>>>>         ODTP_YMD,           /**< Year, month and day are significant
>>>>>>>>         */ ODTP_YMDH,          /**< Year, month, day (if
>>>>>>>>         OFTDateTime) and hour are
>>>>>>>>
>>>>>>>> significant */
>>>>>>>>
>>>>>>>>         ODTP_YMDHM,         /**< Year, month, day (if OFTDateTime),
>>>>>>>>         hour and
>>>>>>>>
>>>>>>>> minute are significant */
>>>>>>>>
>>>>>>>>         ODTP_YMDHMS,        /**< Year, month, day (if OFTDateTime),
>>>>>>>>         hour, minute
>>>>>>>>
>>>>>>>> and integral second are significant */
>>>>>>>>
>>>>>>>>         ODTP_YMDHMSm,       /**< Year, month, day (if OFTDateTime),
>>>>>>>>         hour, minute
>>>>>>>>
>>>>>>>> and second with microseconds are significant */
>>>>>>>> } OGRDateTimePrecision;
>>>>>>>>
>>>>>>>> I think this is important since "2015/04/05 17:12:34" and
>>>>>>>> "2015/04/05 17:12:34.000" do not really mean the same thing and it
>>>>>>>> might be good to be able to preserve the original accuracy when
>>>>>>>> converting between formats.
>>>>>>>>
>>>>>>>> A drawback of this solution is that the size of the OGRField
>>>>>>>> structure increases from 8 bytes to 12 on 32 bit builds (and remain
>>>>>>>> 16 bytes on 64 bit). This is probably not that important since in
>>>>>>>> most cases not that many OGRField structures are instanciated at
>>>>>>>> one time (typically, you iterate over features one at a time).
>>>>>>>> This could be more of a problem for use cases that involve the MEM
>>>>>>>> driver, as it keep all features in memory.
>>>>>>>>
>>>>>>>> Another drawback is that the change of the structure might not be
>>>>>>>> directly noticed by application developers as the Second field name
>>>>>>>> is preserved, but a new Precision field is added, so there's a risk
>>>>>>>> that Precision is let uninitialized if the field is set through
>>>>>>>> OGRFeature::SetField(int iFieldIndex, OGRField* psRawField). That
>>>>>>>> could lead to unexpected formatting (but hopefully not crashes with
>>>>>>>> defensive programming). However I'd think it is unlikely that many
>>>>>>>> applications directly manipulate OGRField directly, instead of using
>>>>>>>> the getters and setters of OGRFeature.
>>>>>>>>
>>>>>>>> ---------------------------------------
>>>>>>>> Solution 2) : Millisecond accuracy, second and milliseconds in
>>>>>>>> distinct fields
>>>>>>>>
>>>>>>>> typedef union {
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>         struct {
>>>>>>>>         
>>>>>>>>             GInt16  Year;
>>>>>>>>             GByte   Month;
>>>>>>>>             GByte   Day;
>>>>>>>>             GByte   Hour;
>>>>>>>>             GByte   Minute;
>>>>>>>>             GByte   TZFlag;
>>>>>>>>             GByte   Precision; /* value in OGRDateTimePrecision */
>>>>>>>>             GByte   Second; /* from 0 to 60 */
>>>>>>>> 	
>>>>>>>> 	GUInt16 Millisecond; /* from 0 to 999 */
>>>>>>>> 	
>>>>>>>>         } Date;
>>>>>>>>
>>>>>>>> } OGRField
>>>>>>>>
>>>>>>>> Same size of structure as in 1)
>>>>>>>>
>>>>>>>> ---------------------------------------
>>>>>>>> Solution 3) : Millisecond accuracy, pack all fields
>>>>>>>>
>>>>>>>> Conceptually, this would use bit fields to avoid wasting unused bits
>>>>>>>> :
>>>>>>>>
>>>>>>>> typedef union {
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>       struct {
>>>>>>>>       
>>>>>>>>         GInt16        Year;
>>>>>>>>         GUIntBig     Month:4;
>>>>>>>>         GUIntBig     Day:5;
>>>>>>>>         GUIntBig     Hour:5;
>>>>>>>>         GUIntBig     Minute:6;
>>>>>>>>         GUIntBig     Second:6;
>>>>>>>>         GUIntBig     Millisecond:10; /* 0-999 */
>>>>>>>>         GUIntBig     TZFlag:8;
>>>>>>>>         GUIntBig     Precision:4;
>>>>>>>>      
>>>>>>>>      } Date;
>>>>>>>>
>>>>>>>> } OGRField;
>>>>>>>>
>>>>>>>> This was proposed in the above mentionned ticket. And as there were
>>>>>>>> enough remaining bits, I've also added the Precision field (and in
>>>>>>>> all other solutions).
>>>>>>>>
>>>>>>>> The advantage is that sizeof(mydate) remains 8 bytes on 32 bits
>>>>>>>> builds.
>>>>>>>>
>>>>>>>> But the C standard only defines bitfields of int/unsigned int, so
>>>>>>>> this is not portable, plus the fact that the way bits are packed is
>>>>>>>> not defined by the standard, so different compilers could come up
>>>>>>>> with different packing. A workaround is to do the bit manipulation
>>>>>>>> through macros :
>>>>>>>>
>>>>>>>> typedef union {
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>       struct {
>>>>>>>> 	
>>>>>>>> 	GUIntBig	opaque;
>>>>>>>> 	
>>>>>>>>       } Date;
>>>>>>>>
>>>>>>>> } OGRField;
>>>>>>>>
>>>>>>>> #define GET_BITS(x,y_bits,shift)        (int)(((x).Date.opaque >>
>>>>>>>> (shift)) & ((1 << (y_bits))-1))
>>>>>>>>
>>>>>>>> #define GET_YEAR(x)              (short)GET_BITS(x,16,64-16)
>>>>>>>> #define GET_MONTH(x)             GET_BITS(x,4,64-16-4)
>>>>>>>> #define GET_DAY(x)               GET_BITS(x,5,64-16-4-5)
>>>>>>>> #define GET_HOUR(x)              GET_BITS(x,5,64-16-4-5-5)
>>>>>>>> #define GET_MINUTE(x)            GET_BITS(x,6,64-16-4-5-5-6)
>>>>>>>> #define GET_SECOND(x)            GET_BITS(x,6,64-16-4-5-5-6-6)
>>>>>>>> #define GET_MILLISECOND(x)       GET_BITS(x,10,64-16-4-5-5-6-6-10)
>>>>>>>> #define GET_TZFLAG(x)            GET_BITS(x,8,64-16-4-5-5-6-6-10-8)
>>>>>>>> #define GET_PRECISION(x)
>>>>>>>> GET_BITS(x,4,64-16-4-5-5-6-6-10-8-4)
>>>>>>>>
>>>>>>>> #define SET_BITS(x,y,y_bits,shift)  (x).Date.opaque =
>>>>>>>> ((x).Date.opaque & (~( (GUIntBig)((1 << (y_bits))-1) << (shift) ))
>>>>>>>> | ((GUIntBig)(y) << (shift)))
>>>>>>>>
>>>>>>>> #define SET_YEAR(x,val)            SET_BITS(x,val,16,64-16)
>>>>>>>> #define SET_MONTH(x,val)           SET_BITS(x,val,4,64-16-4)
>>>>>>>> #define SET_DAY(x,val)             SET_BITS(x,val,5,64-16-4-5)
>>>>>>>> #define SET_HOUR(x,val)            SET_BITS(x,val,5,64-16-4-5-5)
>>>>>>>> #define SET_MINUTE(x,val)          SET_BITS(x,val,6,64-16-4-5-5-6)
>>>>>>>> #define SET_SECOND(x,val)          SET_BITS(x,val,6,64-16-4-5-5-6-6)
>>>>>>>> #define SET_MILLISECOND(x,val)
>>>>>>>> SET_BITS(x,val,10,64-16-4-5-5-6-6-10) #define SET_TZFLAG(x,val)
>>>>>>>>
>>>>>>>>      SET_BITS(x,val,8,64-16-4-5-5-6-6-10-8) #define
>>>>>>>>      SET_PRECISION(x,val)
>>>>>>>>
>>>>>>>> SET_BITS(x,val,4,64-16-4-5-5-6-6-10-8-4)
>>>>>>>>
>>>>>>>> Main advantage: the size of OGRField remains unchanged (so 8 bytes
>>>>>>>> on 32-bit builds).
>>>>>>>>
>>>>>>>> Drawback: manipulation of datetime members is less natural, but
>>>>>>>> there are not that many places in the GDAL code base were the
>>>>>>>> OGRField.Date members are used, so it is not much that a problem.
>>>>>>>>
>>>>>>>> ---------------------------------------
>>>>>>>> Solution 4) : Microsecond accuracy with one field
>>>>>>>>
>>>>>>>> Solution 1) used a float for second and sub-second, but a float has
>>>>>>>> only 23 bits of mantissa, which is enough to represent second with
>>>>>>>> millisecond accuracy, but not for microsecond (you need 26 bits for
>>>>>>>> that). So use a 32-bit integer instead of a 32-bit floating point.
>>>>>>>>
>>>>>>>> typedef union {
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>         struct {
>>>>>>>>         
>>>>>>>>             GInt16  Year;
>>>>>>>>             GByte   Month;
>>>>>>>>             GByte   Day;
>>>>>>>>             GByte   Hour;
>>>>>>>>             GByte   Minute;
>>>>>>>>             GByte   TZFlag;
>>>>>>>>             GByte   Precision; /* value in OGRDateTimePrecision */
>>>>>>>>             GUInt32 Microseconds; /* 00000000 to 59999999 */
>>>>>>>>         
>>>>>>>>         } Date;
>>>>>>>>
>>>>>>>> } OGRField
>>>>>>>>
>>>>>>>> Same as solution 1: sizeof(OGRField) becomes 12 bytes on 32-bit
>>>>>>>> builds (and remain 16 bytes on 64-bit builds)
>>>>>>>>
>>>>>>>> We would need to add an extra value in OGRDateTimePrecision to mean
>>>>>>>> the microsecond accuracy.
>>>>>>>>
>>>>>>>> Not really clear we need microseconds accuracy... Most formats that
>>>>>>>> support subsecond accuracy use ISO 8601 representation (e.g.
>>>>>>>> YYYY-MM- DDTHH:MM:SS.sssssZ) that doesn't define the maximal number
>>>>>>>> of decimals beyond second. From
>>>>>>>> http://www.postgresql.org/docs/9.1/static/datatype-datetime.html,
>>>>>>>> PostgreSQL supports microsecond accuracy.
>>>>>>>>
>>>>>>>> ---------------------------------------
>>>>>>>> Solution 5) : Microsecond with 3 fields
>>>>>>>>
>>>>>>>> A variant where we split second into 3 integer parts:
>>>>>>>>
>>>>>>>> typedef union {
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>         struct {
>>>>>>>>         
>>>>>>>>             GInt16  Year;
>>>>>>>>             GByte   Month;
>>>>>>>>             GByte   Day;
>>>>>>>>             GByte   Hour;
>>>>>>>>             GByte   Minute;
>>>>>>>>             GByte   TZFlag;
>>>>>>>>             GByte   Precision; /* value in OGRDateTimePrecision */
>>>>>>>> 	
>>>>>>>> 	GByte   Second; /* 0 to 59 */
>>>>>>>> 	
>>>>>>>>             GUInt16  Millisecond; /* 0 to 999 */
>>>>>>>>             GUInt16 Microsecond; /* 0 to 999 */
>>>>>>>>         
>>>>>>>>         } Date;
>>>>>>>>
>>>>>>>> } OGRField
>>>>>>>>
>>>>>>>> Drawback: due to alignment, sizeof(OGRField) becomes 16 bytes on
>>>>>>>> 32-bit builds (and remain 16 bytes on 64-bit builds)
>>>>>>>>
>>>>>>>> ---------------------------------------
>>>>>>>> Solution 6) : Nanosecond accuracy and beyond !
>>>>>>>>
>>>>>>>> Now that we are using 16 bytes, why not having nanosecond accuracy ?
>>>>>>>>
>>>>>>>> typedef union {
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>         struct {
>>>>>>>>         
>>>>>>>>             GInt16  Year;
>>>>>>>>             GByte   Month;
>>>>>>>>             GByte   Day;
>>>>>>>>             GByte   Hour;
>>>>>>>>             GByte   Minute;
>>>>>>>>             GByte   TZFlag;
>>>>>>>>             GByte   Precision; /* value in OGRDateTimePrecision */
>>>>>>>> 	
>>>>>>>> 	double   Second; /* 0.000000000 to 60.999999999 */
>>>>>>>> 	
>>>>>>>>         } Date;
>>>>>>>>
>>>>>>>> } OGRField
>>>>>>>>
>>>>>>>> Actually we even have picosecond accuracy! (since for picoseconds,
>>>>>>>> we need 46 bits and a double has 52 bits of mantissa). And if we
>>>>>>>> use a 64-bit integer instead of a double, we can have femtosecond
>>>>>>>> accuracy ;-)
>>>>>>>>
>>>>>>>> Any preference ?
>>>>>>>>
>>>>>>>> Even
>>>>>>> _______________________________________________
>>>>>>> gdal-dev mailing list
>>>>>>> gdal-dev at lists.osgeo.org
>>>>>>> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>>>> _______________________________________________
>>>> gdal-dev mailing list
>>>> gdal-dev at lists.osgeo.org
>>>> http://lists.osgeo.org/mailman/listinfo/gdal-dev