[PROJ] RFC4: last chance to comment before motion

Greg Troxel gdt at lexort.com
Mon Jan 6 06:08:31 PST 2020


Even Rouault <even.rouault at spatialys.com> writes:

> Regarding how files are managed on the CDN, the idea is that a given grid 
> identified by a filename is only updated if it contains errors (in its data
> or metadata). Which is different from releasing an improved version of a 
> model, for example the successive generations of the USA, Australia or New 
> Zealand geoid models that have each their own filename.

I am really uncomfortable with a single name mapping to variable data.
This seems very much like replacing a software tarball for
foo-1.2.tar.gz with a different version.

What is driving the need to not change the name in some way?   It would
seem that some sort of micro-version could be implemented for changes.

It would also seem that the CDN download service should be essentially
providing a different form of access to datumgrid files that one could
download, and that the download service should have the same version
number scheme as the regular files.  Maybe I don't get something, but it
seems best to have a single authoritative data source, and then two
methods to get it.


> The local cache of grid chunks stores the value of a few HTTP headers (file 
> size, Last-Modified, ETag) in a table, as well as the timestamp of the last 
> time when it has checked them. When the current timestamp is > TTL value + 
> last_checked_timestamp (where TTL value defaults to one day), the cache then 
> queries again one chunk from the CDN to check if the value of those HTTP 
> headers has changed or not. If they have, then it discards all cached chunks 
> of that files, so they are retrieved again from the CDN.

That seems sensible for variable data.  From a repeatability point of
view, the notion of variable data for a name seems uncomfortable, as I
don't see any way to record what was done and to be able to do it again.

> There's no public API that exposes that logic for people who would want to
> do whole file download.

>From a packaging point of view, it seems that one should be able to
package whatever grid files there are, if one wants, and then also offer
users a choice of faulting them in dynamically, if they want.  I guess I
am really at a loss to understand why the multiple mechanisms should
cause a departure from a single versioned namespace.

Once there is a versioned namespace (where
proj-datumgrid-northamerica-1.3.tar.gz becomes
proj-datumgrid-northamerica-1.3.1.tar.gz when revised), then the
question arises of how users change their notion of what they want to
fetch.  I would want there to be a versioned directory that lists all of
the names, so that one updates to a version of that directory, and thus
all of the data is fixed.  We more or less have that versioning now with
the included grids, and the versioned extra grid files.    This seems
like a property that should not be give up lightly.

There's also the issue of formats changing.  Right now grids are
released more or less with proj, and there is a notion that proj works
with those grids.  But proj released this week might not work with grids
that are released in 3 years.  That's fine - that's not a reasonable
expectation.

> Would the following function be useful ?
>
> /** Download a file in the PROJ user-writable directory.
>  *
>  * The file will only be downloaded if it does not exist yet in the
>  * user-writable directory, or if it is determined that a more recent
>  * version exists. To determine if a more recent version exists, PROJ will
>  * use the "downloaded_files" table of its grid cache database.
>  * Consequently files manually placed in the user-writable
>  * directory without using this function would be considered as
>  * non-existing/obsolete and would be unconditionnaly downloaded again.
>  *
>  * This function can only be used if networking is enabled, and either
>  * the default curl network API or a custom one have been installed.
>  *
>  * @param ignore_ttl_setting If set to FALSE, PROJ will only check the
>  *                           recentness of an already downloaded file, if
>  *                           the delay between the last time it has been
>  *                           verified and the current time exceeds the TTL
>  *                           setting. This can save network accesses.
>  *                           If set to TRUE, PROJ will unconditionnally
>  *                           check from the server the recentness of the file.
>  * @return TRUE if the download was successful (or not needed)
>  */
>
> int proj_download_file(
>   PJ_CONTEXT* ctx,
>   const char* url_or_filename,
>   int ignore_ttl_setting,
>   int (*progress_cbk)(PJ_CONTEXT*, double pct, void* user_data),
>   void* user_data);
>
> That said, I can anticipate issues on Windows in the situation where a PROJ 
> pipeline would have a grid opened from the PROJ user-writable directory,
> and someone would call proj_download_file() on that file, and it would be
> determined that we have to update it. We would not be able to replace
> the file, because it would be already opened. So we would probably need some 
> logic to use a different local filename for the most recent version, store in 
> the database the most recent local filename, and the deprecated one(s), and do 
> cleanup when we can actually delete files... (reading a bit on the subject,
> it appears the FILE_SHARE_DELETE flag of the Win32 API OpenFile() wouldn't 
> even solve the issue, as it allows to delete a opened file, but not to create 
> a file with the same name while the old version is still opened, contrary to 
> POSIX unlink())

It might be, but I'm more concerned about the versioning issues.   If
a given name become unchangeable, following software distribution
practice, then there is no need for all the ttl stuff.

Also, the directory of available files could have a hash, allowing
validation of downloaded/stored files.


More information about the PROJ mailing list