[GRASS-dev] r.in.xyz: really big ints

Glynn Clements glynn at gclements.plus.com
Fri Aug 29 15:53:55 EDT 2008


Hamish wrote:

> a user has just successfully imported a 92GB LIDAR data file with r.in.xyz
> (2.4 billion data points; 4.5hrs). This has exposed a cosmetic bug, the
> number of points processed is reported to the user as -1871174186.
> 
> The raster output is fine AFAIK, but the broken status message ain't a
> good look.
> 
> wc -l reports 2,424,200,605 points which is bigger than (IIUC) the 32bit
> limit for c90 int of 2,147,483,648. I do not know if the hardware/OS/build
> was 32 or 64 bit. The "line" variable is defined simply as "int".
> 
> Apparently `wc` on their system can count higher than 2^31, can we?

"unsigned int" goes up to 2^32-1.

> Is it as simple as replacing printf %d with %u ?? (seems to work)

Yes.

> In that case should the variable be defined as "unsigned int" for
> correctness? (%u seems to work correctly with plain signed int in a
> little test program I wrote)


The nature of two's-complement representation means that whether a
variable is declared as "int" or "unsigned int" doesn't actually have
that much effect upon most arithmetic operations. It mainly affects
division, comparisons[1] and right shifts[2].

[1] If you compare a signed value to an unsigned value, the signed
value will be cast to unsigned. Comparing <0 or >=0 will always be
false or true respectively.

[2] shifting a signed value will shift in copies of the topmost bit,
while shifting an unsigned value will shift in zeroes.


> Then we wait for the first 160GB dataset...
> 
> I could rewrite it to store the number of lines as a double and printf
> %.0f, but hope for a cleaner solution.

You can use "long", which will typically be 64 bits on a 64-bit system
(but not Windows, where "long" is 32 bits even on the 64-bit versions,
to maintain binary compatibility).

Every version of gcc in widespread use supports "long long", as does
C99. This will be 64 bits even on 32-bit systems. OTOH, we might need
to support platforms which don't support this.

If LFS is enabled, off_t may be 64 bits (and if it isn't, you're
likely to have trouble importing a 92GiB file; Linux simply won't let
you open a file >2GiB unless you use the LFS features). This saves you
the trouble of having to perform explicit checks for "long long", and
conditionalising the variable definition. You still need to
conditionalise the printf() format, though, e.g.:

	const char *fmt = sizeof(off_t) > sizeof(long) ? "%lld" : "%ld";

Using double is certainly the easiest solution. That can represent
integers up to 2^53 exactly, after which .... well it doesn't really
matter; 2^53 is just short of 10^16.

> side idea:
> Would it be possible to add a flag to g.version to report some build
> info? Like: 32/64 bits, endianness, build date, svn checkout date (if
> applicable), `uname -a` of build machine, LFS, nls, and in general 
> ./configure feature report stuff, ...

If you can figure out a command to print the information, you can call
it from the Makefile with $(shell ...) and add a -D flag, e.g.:

	UNAME := $(shell uname -a)

	CFLAGS += -DUNAME=$(UNAME)

OTOH, if you want a lot of information, it would be better to make the
configure script store it in config.h.

Or you could just add:

	system("\"$GISBASE/etc/grocat\" < \"$GISBASE/include/Make/Platform.make\"");
	system("\"$GISBASE/etc/grocat\" < \"$GISBASE/include/grass/config.h\"");

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the grass-dev mailing list