[GRASS-dev] Re: GRASS startup screen patches && Gforge

Mon Jan 22 16:32:47 EST 2007

Maris Nartiss wrote:

> > > Uhumm... Currently I use mapsets that also include dots "." As in
> > > Maa-123.450, and this has worked so far, and I'd wish for it to remain
> > > the same.
> >
> > Dots shouldn't pose any problems.
> 
> You both are right. I forgot about dots. Only exception - they should
> be banned from first place (".foo") as file/directory starting with .
> is hidden in *nixes.

Well, they're hidden by "ls", and the * wildcard in bash etc won't
match them (nor will Tcl's "glob" command unless you use
"-types hidden"), but the rest of the OS doesn't care.

Having said that, a few places in GRASS might use popen("ls ..."), and
it's quite possible that third-party scripts might use $LOCATION/*. 

> > We shouldn't impose restrictions because some operating systems might
> > not allow the filename. In particular, Windows has some strange
> > restrictions as to what isn't permitted, e.g. you can't create a file
> > or directory whose name (minus any extensions) matches the name of a
> > device (e.g. "con" or "aux").
> 
> This is discutable.
> 1) Some GRASS parts are written in optimistic way - i.e. no checks on
> mkdir call returns codes etc. I added some checks to startup screen,
> but it may be not only place with following bad coding practice. Those
> are bugs and need to be fixed.
> 2) Maybee we could add some OS specific name restrictions in GUI? User
> errors must be catched as fast as possible to provide good user
> expierience. *

OS restrictions should be handled by checking whether system calls
(mkdir() etc) actually fail, rather than trying to predict cases which
will fail and assuming that everything else will succeed.

We *need* to handle system call failures, and once we've done that, we
no longer need to try to predict them (which is an impossible task;
e.g. the list of Windows "device" names can be extended by drivers, so
we can never have an exhaustive list).

> There was already note, that pure numbers may be problematic in map
> names as in some cases they may be threated as numbers and not map
> names. It's possible to include such map names in "" in mapcalc. Any
> other place where numbers may be misinterpreted?

It's practically impossible to provide a definite "no" answer to this
type of question, as no-one has memorised the entire GRASS source
code.

The best that we can do is to decide that any such cases are bugs
which need to be fixed.

> > > > 3) Currently GRASS can NOT handle any other letter except Latin on
> > > > multi byte locales. As uni byte locales must die (except those, with
> > > > plain Latin letters only) as they are only source of problems and
> > > > nothing else (I have files with Latvian/Russian/German names in one
> > > > folder - UTF is a must). Till GRASS start support multi byte locales,
> > > > it's sane restriction. Till that happens, I can only say: "-L-LХюстон,-A
> > -Lу-A-A
> > > > -L-Lнас проблема" (I hope, I spelled it right ;)-A-A
> > >
> > > I agree UTF support is a must! I think that UTF support should be
> > > something we want to have ASAP. I don't know how much work it would be
> > > to support UTF-8, but I don't believe that it would be very much.
> 
> It may not be as easy as it sounds - I read some (a bit outdated?) doc
> [1] and if I understood this part [2] correctly, we may need to check
> ~26'000 variables and they usage to make GRASS fully UTF-8 aware. As
> minimum it requires to check ~1000 lines containing strlen().

No. strlen() is the number of *bytes* in a string, not the number of
characters. On Unix, filenames are strings of bytes, not characters. 

On Windows, every OS function which deals with strings (e.g. 
filenames) has two distinct versions: the A version (which uses bytes
interpreted according to the system codepage) and the W version (which
uses 16-bit characters in Unicode). Using the W versions is out of the
question, and the A versions require that you use the system codepage,
which is almost guaranteed *not* to be UTF-8 (there are codepages for
UTF-7 and UTF-8, but they aren't the default for any locale, and I've
never encountered a live system which used them).

GRASS should be encoding-neutral. It shouldn't care whether a filename
is in UTF-8 or any other encoding. In particular, insisting upon the
use of UTF-8 will be a significant impediment to adoption in locales
which don't use the Latin alphabet, as other encodings are
sufficiently well entrenched that not supporting the encoding amounts
to not supporting the locale at all. E.g. saying "supports Japanese
via UTF-8" is almost the same as saying "doesn't support Japanese". It
also rules out the use of anything except ASCII (which is a subset of
both UTF-8 and all common codepages) on Windows.

> > For the core functionality, it should just be an issue of extending
> > G_legal_filename() to allow all bytes in the range 128-255 in map
> > names.
> 
> This is beyound my understanding. Sorry, no help from me. I can only
> test Your code, as I'm on UTF-8 locale.

GRASS uses the following function to determine whether map, mapset etc
names are "legal":

int G_legal_filename (char *s)
{
    if (*s == '.' || *s == 0) {
	fprintf(stderr, _("Illegal filename.  Cannot be '.' or 'NULL'\n"));
	return -1;
    }

    for (; *s; s++)
	if (*s == '/' || *s == '"' || *s == '\'' || *s <= ' ' || 
            *s == '@' || *s == ',' || *s == '=' || *s == '*' || *s > 0176) {
		fprintf(stderr, _("Illegal filename. Character <%c> not allowed.\n"), *s);
	    return -1;
	}

    return 1;
}

IOW, we prohibit all control characters, spaces, certain ASCII
"punctuation" characters which are significant to GRASS, and all
non-ASCII (8-bit) characters. Eliminating the last restriction would
allow UTF-8 to be used in map (etc) names, as UTF-8 encodes all
non-ASCII characters using only bytes in the range 128-255.

> > Specific subsystems may have problems, e.g. curses cannot handle
> > multi-byte encodings.
> 
> Ncurses since version 5.3 can handle UTF-8.
> "The normal ncurses libraries support 8-bit characters. The ncurses
> library can also be configured (--enable-widec) to support
> wide-characters (for instance Unicode and the UTF-8 encoding). The
> corresponding wide-character ncursesw libraries are source-compatible
> with the normal applications. That is, applications must be compiled
> and linked against the ncursesw library." [3]

And everything which uses it must use wchar_t rather than bytes. As
this would conflict with other curses implementations, it would have
to be made conditional, which would be messy. I'd rather just
encourage the stuff which uses curses (i.e. vask) to be replaced
sooner rather than later.

-- 
Glynn Clements <glynn at gclements.plus.com>