[GRASS-dev] Re: what if: a new GRASS directory layout?
Ivan Shmakov
ivan at theory.asu.ru
Wed Apr 9 13:53:09 EDT 2008
>>>>> Glynn Clements <glynn at gclements.plus.com> writes:
>>> In-process references could be maintained by making a copy (or hard
>>> link) to the inventory, so that the GC treats it as "live". You
>>> would need some kind of clean-up mechanism to handle any copies
>>> which are left behind if a module crashes.
>> However, having GC to process all the inventories won't be efficient
>> (unless these are stored in a database's table with appropriate
>> indices.) So, I had in mind keeping a references file along with
>> each object file.
> Ah; if you're talking about back-references, one thing to bear in
> mind is permissions: you can use maps from mapsets for which you only
> have read permission, and not write permission.
Agreed.
> [This issue has already arisen with respect to reclass maps and the
> reclassed_to file. That was the first GRASS bug I ever fixed.]
> That also means that garbage collection would need to scan the entire
> location, not just individual mapsets. Actually, re-projection can
> span locations, so you would potentially need to scan the entire
> database.
OTOH, I could hardly recall a piece of software that handled the
access to a repository which is read-only to some of its
instances, but allows deletions for some other ones. All the
software that handles it well either requires a dedicated server
to manage the whole ``database'', or relies on replication.
It seems that the reasonable behaviour would be to make a
back-reference if possible, and issue a warning if not.
... Or, since I've already mentioned replication, there're a
couple more of solutions possible for the mapsets intented to be
accessed read-only by many:
* make a ``hard link'' for each of the objects in a separate
mapset, writable by the reading party;
* never remove an object.
The first solution actually mimics the ``clone'' feature of
modern DVCS (say, $ git clone produces a copy of the specified
Git repository, where most of the files are shared by means of
``hard links''.) Obviously, mirroring the mapset effectively
solves all the problems with permissions, etc., while the design
of the objects/ directory and the use of hard links ensure
efficient storage. Two points to pay special attention to are:
* all the inventories may be copied or hardlinked at the time of
mirroring effectively turning a read-only mapset into its
space-efficient copy, but then there should be a way to keep
this copy in sync with a source mapset;
* no such mirroring is currently possible precisely due to that
some files may be updated in place; thus, I believe the ``in
place'' issue has to be resolved irrespective to whether the
proposed scheme will be accepted as a whole or not.
The second solution doesn't rely on hard links and thus may be
appropriate for the systems lacking support for them. It may be
noted that the disk space occupied by the unreferenced objects
could be reclaimed if it could be ensured that no party is
active at the time of GC. E. g., GC may be scheduled to be run
as part of the OS start-up sequence.
Furthermore, this solution may be appropriate for various other
means of sharing files in a read-only manner. E. g., via HTTP.
>>> [BTW, it has been pointed out that this can reduce the maximum
>>> number of maps per mapset, as the limit on an inode's hard link
>>> count limits the maximum number of subdirectories, while there is
>>> usually no fixed limit on the number of files. E.g. on Linux'
>>> ext2fs, the maximum hard link count is 65535, so you can't have
>>> more than 65533 subdirectories.]
>> While the inventory scheme is free from hitting this limit.
> OTOH, if you don't use subdirectories, you will have many more files
> in a single directory. This can be a major performance issue on some
> filesystems.
This isn't really a problem, at least for the objects/ -- it
just has to be ensured that the distribution of the names is
sufficiently even, and then an option may be added so that the
names are split, like:
split-at: split-at: 4 split-at: 2, 4
objects/SD6Isoi2orPOu objects/SD6I/soi2orPOu objects/SD/6I/soi2orPOu
objects/IyfgXZdP3JYuu objects/Iyfg/XZdP3JYuu objects/Iy/fg/XZdP3JYuu
objects/xlRKohTgQKmJj objects/xlRK/ohTgQKmJj objects/xl/RK/ohTgQKmJj
objects/oBgUH2otF7Urb objects/oBgU/H2otF7Urb objects/oB/gU/H2otF7Urb
objects/CeX9zEZkdR9g5 objects/CeX9/zEZkdR9g5 objects/Ce/X9/zEZkdR9g5
objects/gnUNviMqfnTOx objects/gnUN/viMqfnTOx objects/gn/UN/viMqfnTOx
The source for the evenly-distributed numbers may be a good RNG,
or a kind of a checksum (e. g., SHA1) over the file contents.
More information about the grass-dev
mailing list