[Gdal-dev] Problems translating BSB files

Thu Jan 19 16:47:52 EST 2006

Folks -

I now have GDAL correctly (I think) reading the NOAA-distributed BSB files.  I implemented the code as described below, using the scanline table for finding the start of each line of data and ran into two other issues:

1. I needed to handle the correct byte ordering of the four-byte scanline pointers.

2. There is something funny in the data but it doesn't appear to be the funny stuff described in the (original) GDAL code.  In each of the NOAA files the scanline data has a pattern of extra bytes at the end of each scan line.  In one sample (with an IFM/palette size of 4) the first lines are:

01 90 E2 00 00 02 00
02 90 E2 00 00 03 00
03 90 E2 00 00 04 00

The original GDAL code starts parsing at line 1 (byte 01 at the beginning), reads the byte sequence 90 E2 00 (which means 12,545 pixels of color 2), then 00 as the "end of row" null byte.  It then reads the byte 02, which looks like it's supposed to be the start of line 2.  But the next byte is 00, which is wrong, and there is some code to implement a "special hack to skip over extra zeros in some files".  The 00 is skipped over and the next 02 (at the beginning of the second line of hex bytes above) is read and treated as the REAL start of line 2.

However, if you use the presumably more reliable scanline index table at the end of the file, you'll see that the lines break as I have broken them in the text above.  The "02 00" at the end of the first line of text is not the start of line 2 with an extra 00 following it, but unexplained extra data at the end of line 1.  Line 1 as parsed above correctly produces 12,545 pixels as required, so it's not missing image data. .  It appears to consistently follow that pattern of being the byte starting the next line, followed by 00.  In fact (I just looked a little further) when the line numbering goes above 127 and takes more than one byte, the "extra" data continues to follow the pattern.  It appears as if each line is ending with the line number of the next line (and the data is repeated because the next line begins with its own line number).  However, this does not hold up throughout the file, as when I inspect the scan lines at the very end they do NOT show any extra bytes and behave just as expected.  The extra data seems to be innocuous as long as you (a) stop parsing when you're expected to stop and when you have the pixels you need and (b) use the scanline index to find the start of each line.

I need to do a little more testing and code cleanup before sending some draft bsb_read.c changes along.

	- Ed

Ed McNierney
President and Chief Mapmaker
TopoZone.com / Maps a la carte, Inc.
73 Princeton Street, Suite 305
North Chelmsford, MA  01863
ed at topozone.com
(978) 251-4242 

-----Original Message-----
From: gdal-dev-bounces at lists.maptools.org [mailto:gdal-dev-bounces at lists.maptools.org] On Behalf Of Ed McNierney
Sent: Thursday, January 19, 2006 3:06 PM
To: Frank Warmerdam; Eric Dönges
Cc: gdal-dev at lists.maptools.org
Subject: RE: [Gdal-dev] Problems translating BSB files

Frank & Eric -

Well I'm old enough to edit a binary file or two, too.....

Thanks very much for the help.  I located a very helpful bit of code on SourceForce called libbsb for reading and writing BSB files, written by Stuart Cunningham <stuart_hc at users.sourceforge.net>, and I read patent 5,727,090 as referenced in the header comments in the GDAL file bsb_read.c.  With a little inspection of my sample data, I now know a lot more about BSB files than I did yesterday!  Here follow my observations.

1. I also noticed the code in bsb_read.c which is fooled by a 0x1A 0x1A 0x00 sequence, and which needs a fix along the lines of what Frank mentioned.  In the patent description this sequence is described as "The header is followed by three binary values. The first is 1AH, which the DOS TYPE command will treat as an end-of-file marker. A zero is used to separate file segments or image offsets. The value of the image format is the start of the binary graphic data."  This is a little vaguely worded - I can't quite be certain whether the "value of the image format" byte is supposed to be counted in the "three binary values" or not.  All 5 of the BSB files I have downloaded from the NOAA chart distribution site contain the sequence 0x1A 0x1A 0x00 at the end of the header (after the last CR/LF) and before the "value of the image format" byte.  The bsb_read.c code makes reference to an example file "optech/World.kap" that has the sequence 0x1A 0x0D 0x0A 0x1A 0x00 after the CR/LF at!
  the end of the header.  This does not seem consistent with the patent description of the format.  However, in all those (six) cases a corrected method for finding 0x1A 0x00 will correctly locate the beginning of the binary image data.  It is possible, however, that the safest implementation is "look for the first 0x00 after finding the first 0x1A", which would handle both the current NOAA samples and the "optech/World.kap" file.

2. The next byte is the "value of the image format" byte and it appears (from the examples) that this is supposed to be the ASCII character "3" (0x33) for image format "3", as Frank surmised below.  From a reading of the patent language, it might be intending to say the same thing there, too.  Since this is redundant information, we might want to ignore it or tolerate mismatches (see below).

3. The libbsb utility bsb2tif will correctly (apparently) convert the current NOAA BSB files to TIF images - at least, they're reasonable-looking TIFF images of NOAA charts with no obvious visual problems.  This utility reports a warning that the bit depth from the IFM tag does not match the data read from the file.  This is because the libbsb code rather overconfidently simply reads and discards two bytes where it's expecting to find 0x1A 0x00 and then loads the third byte as the image format value - it doesn't even bother to inspect those two bytes.  As a result, it reads 0x1A 0x1A and then reads 0x00 and compares it to the image format (my samples use formats 3 and 4) and reports the mismatch.

4. This mismatch does not bother the libbsb code because it uses the scanline index table to locate each line of image data, while the GDAL code iterates through the image data to find each line.  The libbsb code is more robust since a bad scanline can get GDAL irrecoverably lost, while the libbsb code should (theoretically) only mess up the remainder of each line.  The scanline index table is also described in the patent and is quite simple.  All offsets are four-byte values.  The very last four bytes of the entire .KAP file are the offset of the start of the scanline index table.  Each entry in that table is four bytes; the first four bytes are the offset in the file (the entire file, including header data) of the start of the data for scanline 1, the next four bytes are the offset for the start of scanline 2, etc.

5. Therefore, the libbsb code, after complaining about the 0x00 byte not matching, jumps to the scanline index table, which tells it to move to the start of the image data, jumping over the 0x33 byte that I believe is the image format byte it was looking for.

The run-length encoding seems to be as expected by GDAL, and I manually calculated a few scan lines and found the results to match what bsb2tif created in the TIF output file.

I have not yet had a chance to commit these observations to code.  Frank, I didn't encounter anything matching your problem of a "still corrupt image" (you seem to have gotten to the image-reading spot correctly), so there may be yet more problems.  GDAL includes some fudge code for dealing with defective scanlines, and I'm still not sure whether there are errors in the understanding of the spec or just bogus BSB files out there.  I may attempt to implement the scanline index mechanism to see if that helps GDAL get through the file.

Thanks again to both of you for the assistance, and I'll report back with more news.

	- Ed

Ed McNierney
President and Chief Mapmaker
TopoZone.com / Maps a la carte, Inc.
73 Princeton Street, Suite 305
North Chelmsford, MA  01863
ed at topozone.com
(978) 251-4242   

-----Original Message-----
From: gdal-dev-bounces at lists.maptools.org [mailto:gdal-dev-bounces at lists.maptools.org] On Behalf Of Frank Warmerdam
Sent: Thursday, January 19, 2006 8:16 AM
To: Eric Dönges
Cc: gdal-dev at lists.maptools.org
Subject: Re: [Gdal-dev] Problems translating BSB files

On 1/19/06, Eric Dönges <eric.doenges at gmx.net> wrote:
> Ed, I think the problem is the following code in bsb_read.c (please 
> note that this is from a fairly old version of GDAL, since I have 
> extensively rewritten BSB support - unfortunately, I cannot share this 
> code with the world at large because the necessary information to do 
> the rewrite was obtained under NDA from MapTech - so this might not be 
> exactly like this in recent GDAL code):
>
>      {
>          int    nSkipped = 0;
>
>          while( nSkipped < 100
>                && (BSBGetc( fp, bNO1 ) != 0x1A || BSBGetc( fp,
> bNO1 ) != 0x00) )
>              nSkipped++;
>
>          if( nSkipped == 100 )
>          {
>              BSBClose( psInfo );
>              CPLError( CE_Failure, CPLE_AppDefined,
>                        "Failed to find compressed data segment of BSB 
> file." );
>              return NULL;
>          }
>      }
>
>
> The file in question (83116_1.KAP) look like this in the hexdump:
>
> 00000f50  32 34 33 31 0d 0a 1a 1a  00 33 01 a0 9e 04 00 02  | 
> 2431.....3......|
>
> Note the two 0x1a directly following each other. So what happens is 
> that in the while loop above, a BSBGetc is executed, which fetches 
> 0x1a, which means the second BSBGetc in the || clause is executed, 
> which also fetches a 0x1a. Since this is not a zero, the test is true 
> and nSkipped is incremented. In the next run through the loop, BSBGetc 
> gets a zero, and then we never find the 0x1a 0x00 sequence.

Eric,

I thought of that, and modified a local copy of bsb_read.c to identify the 0x1a 0x00 properly.  Next I discovered that the next byte, which should be the number of bits (often 0x04 or 0x03) was crazy (0x33).
I took a bit of a jump-of-intuition and guessed that 0x33 (ASCII '3') should have been binary 0x03 and tried to operate on that basis.
But this still produced a corrupt image, even though it did get a bit further.

It was at this point that I gave up, under the assumption that there were significant things I was missing.

Good work identifying the 0x1A 0x00 issue though!  You are a binary file dumping, reverse engineering fellow after my own heart.

Best regards,
--
---------------------------------------+--------------------------------
---------------------------------------+------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | Geospatial Programmer for Rent

_______________________________________________
Gdal-dev mailing list
Gdal-dev at lists.maptools.org
http://lists.maptools.org/mailman/listinfo/gdal-dev

_______________________________________________
Gdal-dev mailing list
Gdal-dev at lists.maptools.org
http://lists.maptools.org/mailman/listinfo/gdal-dev