[GRASS-dev] Re: what if: a new GRASS directory layout?

Ivan Shmakov ivan at theory.asu.ru
Wed Apr 9 13:53:09 EDT 2008


>>>>> Glynn Clements <glynn at gclements.plus.com> writes:

 >>> In-process references could be maintained by making a copy (or hard
 >>> link) to the inventory, so that the GC treats it as "live". You
 >>> would need some kind of clean-up mechanism to handle any copies
 >>> which are left behind if a module crashes.

 >> However, having GC to process all the inventories won't be efficient
 >> (unless these are stored in a database's table with appropriate
 >> indices.)  So, I had in mind keeping a references file along with
 >> each object file.

 > Ah; if you're talking about back-references, one thing to bear in
 > mind is permissions: you can use maps from mapsets for which you only
 > have read permission, and not write permission.

	Agreed.

 > [This issue has already arisen with respect to reclass maps and the
 > reclassed_to file. That was the first GRASS bug I ever fixed.]

 > That also means that garbage collection would need to scan the entire
 > location, not just individual mapsets. Actually, re-projection can
 > span locations, so you would potentially need to scan the entire
 > database.

	OTOH, I could hardly recall a piece of software that handled the
	access to a repository which is read-only to some of its
	instances, but allows deletions for some other ones.  All the
	software that handles it well either requires a dedicated server
	to manage the whole ``database'', or relies on replication.

	It seems that the reasonable behaviour would be to make a
	back-reference if possible, and issue a warning if not.

	... Or, since I've already mentioned replication, there're a
	couple more of solutions possible for the mapsets intented to be
	accessed read-only by many:

	* make a ``hard link'' for each of the objects in a separate
	  mapset, writable by the reading party;

	* never remove an object.

	The first solution actually mimics the ``clone'' feature of
	modern DVCS (say, $ git clone produces a copy of the specified
	Git repository, where most of the files are shared by means of
	``hard links''.)  Obviously, mirroring the mapset effectively
	solves all the problems with permissions, etc., while the design
	of the objects/ directory and the use of hard links ensure
	efficient storage.  Two points to pay special attention to are:

	* all the inventories may be copied or hardlinked at the time of
	  mirroring effectively turning a read-only mapset into its
	  space-efficient copy, but then there should be a way to keep
	  this copy in sync with a source mapset;

	* no such mirroring is currently possible precisely due to that
	  some files may be updated in place; thus, I believe the ``in
	  place'' issue has to be resolved irrespective to whether the
	  proposed scheme will be accepted as a whole or not.

	The second solution doesn't rely on hard links and thus may be
	appropriate for the systems lacking support for them.  It may be
	noted that the disk space occupied by the unreferenced objects
	could be reclaimed if it could be ensured that no party is
	active at the time of GC.  E. g., GC may be scheduled to be run
	as part of the OS start-up sequence.

	Furthermore, this solution may be appropriate for various other
	means of sharing files in a read-only manner.  E. g., via HTTP.

 >>> [BTW, it has been pointed out that this can reduce the maximum
 >>> number of maps per mapset, as the limit on an inode's hard link
 >>> count limits the maximum number of subdirectories, while there is
 >>> usually no fixed limit on the number of files. E.g. on Linux'
 >>> ext2fs, the maximum hard link count is 65535, so you can't have
 >>> more than 65533 subdirectories.]

 >> While the inventory scheme is free from hitting this limit.

 > OTOH, if you don't use subdirectories, you will have many more files
 > in a single directory. This can be a major performance issue on some
 > filesystems.

	This isn't really a problem, at least for the objects/ -- it
	just has to be ensured that the distribution of the names is
	sufficiently even, and then an option may be added so that the
	names are split, like:

split-at:		split-at: 4		split-at: 2, 4
objects/SD6Isoi2orPOu	objects/SD6I/soi2orPOu	objects/SD/6I/soi2orPOu
objects/IyfgXZdP3JYuu	objects/Iyfg/XZdP3JYuu	objects/Iy/fg/XZdP3JYuu
objects/xlRKohTgQKmJj	objects/xlRK/ohTgQKmJj	objects/xl/RK/ohTgQKmJj
objects/oBgUH2otF7Urb	objects/oBgU/H2otF7Urb	objects/oB/gU/H2otF7Urb
objects/CeX9zEZkdR9g5	objects/CeX9/zEZkdR9g5	objects/Ce/X9/zEZkdR9g5
objects/gnUNviMqfnTOx	objects/gnUN/viMqfnTOx	objects/gn/UN/viMqfnTOx

	The source for the evenly-distributed numbers may be a good RNG,
	or a kind of a checksum (e. g., SHA1) over the file contents.



More information about the grass-dev mailing list