[postgis-users] Extracting variable information from netcdf, imported as raster to a table
Manaswini Ganjam
manu.ganjam at gmail.com
Thu Nov 16 20:31:47 PST 2023
Regina,
Thank you for the suggestions, the code looks clean and the computation was
much better, but I still have the same issue, the values are unrealistic,
this happened to me before as well (when I tried to run st_value for one
raster band, with some x and y). Find the enclosed screenshot.
[image: image.png]
I have test run the code for one prcp file and the prcp values are
negative, and the no data value for this dataset is -32768 (most of the
values are -32767 or -32750) and I have not preprocessed this data. Do we
need to manually unpack this data? If that is the case, how do I access the
scale_factor and add_offset attributes packed with netcdf?
This information is from the climate data readme file:The data has been
packed into short integers (2 bytes instead of 4 byte reals) to save space.
You must unpack that data to get the correct floating point representation
of the data. Each netCDF variable that has been packed has an add_offset
and scale_factor attribute associated with it. Some software automatically
unpacks the data when it is read.
The formula to unpack the data is:
unpacked value = add_offset + ( (packed value) * scale_factor )
Thank you,
Manaswini
On Thu, 16 Nov 2023 at 22:02, Regina Obe <lr at pcorp.us> wrote:
> One more optimization. I think you can get rid of the FOREACH band_number
> too and reduce all that to this query
>
>
>
> So all this goes
>
>
>
> band_numbers := ARRAY(SELECT generate_series(1, 366));
>
> -- Loop through all bands within the file
> FOREACH band_number IN ARRAY band_numbers LOOP
>
> :
>
> LOOP
>
>
>
> Use ST_NumBands instead of relying on your rasters all having 366 bands
> https://postgis.net/docs/RT_ST_NumBands.html
>
> So replace all the above with this:
>
>
>
> INSERT INTO extracted_data_bbox (year, year_day, lat, lon,
> prcp, tmax, tmin, observation_time)
> SELECT year_from_filename, n.band_number, ST_Y(sc.geom),
> ST_X(sc.geom), CASE WHEN variable_name = 'prcp' THEN sc.val ELSE NULL END,
> CASE WHEN variable_name = 'tmax' THEN sc.val ELSE NULL
> END,
> CASE WHEN variable_name = 'tmin' THEN sc.val ELSE NULL
> END, observation_time
>
> FROM generate_series(1,
> ST_NumBands(raster_record.clipped_raster) ) AS n(band_number),
> ST_PixelAsCentroids(raster_record.clipped_raster, n.band_number, true) AS
> sc;
>
>
>
> That will solve the issue of you going over the band count of your raster,
> which might be an issue you are running into
>
>
>
>
>
> *From:* Regina Obe <lr at pcorp.us>
> *Sent:* Thursday, November 16, 2023 9:46 PM
> *To:* 'Manaswini Ganjam' <manu.ganjam at gmail.com>
> *Cc:* 'PostGIS Users Discussion' <postgis-users at lists.osgeo.org>
> *Subject:* RE: [postgis-users] Extracting variable information from
> netcdf, imported as raster to a table
>
>
>
> I don’t know much about netCDF but I would assume GDAL would handle the
> packing unraveling behind the scenes.
>
>
>
> Some things I notice wrong
>
>
>
> https://postgis.net/docs/RT_ST_PixelAsCentroids.html is a set returning
> function, so you can’t just stuff it in a record variable. It has to go in
> a table.
>
> So I’d get rid of this line
>
>
>
> SELECT * FROM ST_PixelAsCentroids(raster_record.clipped_raster,
> band_number, true) INTO pixel_data;
>
> I’d also scrap this, because ultimately you want to work on all centroids
> not the first one that falls out of the tree
>
> EXECUTE format('SELECT ST_Value($1, $2, $3, true)',
> raster_record.clipped_raster, band_number, centroid_point) INTO
> variable_value USING raster_record.clipped_raster, band_number,
> centroid_point;
>
>
>
> I’d change this
>
>
> -- Insert the extracted data into the extracted_data_bbox table
> INSERT INTO extracted_data_bbox (year, year_day, lat, lon,
> prcp, tmax, tmin, observation_time)
> VALUES (year_from_filename, band_number, ST_Y(centroid_point),
> ST_X(centroid_point), CASE WHEN variable_name = 'prcp' THEN variable_value
> ELSE NULL END,
> CASE WHEN variable_name = 'tmax' THEN variable_value
> ELSE NULL END,
> CASE WHEN variable_name = 'tmin' THEN variable_value
> ELSE NULL END, observation_time);
>
>
>
> To this:
>
>
>
>
>
>
> * INSERT INTO extracted_data_bbox (year, year_day, lat, lon,
> prcp, tmax, tmin, observation_time) SELECT year_from_filename,
> band_number, ST_Y(sc.geom), ST_X(sc.geom), CASE WHEN variable_name = 'prcp'
> THEN sc.val ELSE NULL END, CASE WHEN variable_name =
> 'tmax' THEN sc.val ELSE NULL END, CASE WHEN
> variable_name = 'tmin' THEN sc.val ELSE NULL END, observation_time*
>
> * FROM ST_PixelAsCentroids(raster_record.clipped_raster,
> band_number, true) AS sc;*
>
>
>
>
>
>
>
>
>
> *From:* Manaswini Ganjam <manu.ganjam at gmail.com>
> *Sent:* Thursday, November 16, 2023 1:19 PM
> *To:* Regina Obe <lr at pcorp.us>
> *Cc:* PostGIS Users Discussion <postgis-users at lists.osgeo.org>
> *Subject:* Re: [postgis-users] Extracting variable information from
> netcdf, imported as raster to a table
>
>
>
> Thank you, Regina,
>
>
>
> I apologize, I have shared the older version of my code, I have tried
> ST_pixelascentroid and ST_value and the issue was I messed up the
> parameters of centroids and that caused the errors now I corrected that as
> well. I think the code below is a good representation of my approach, I
> would like to iterate for every raster in a table, for every band and for
> every lat and lon and extract st value, and insert it to a table,
>
>
>
> psql:/home/manaswini/MEGA/bbgc_uw_rpackage/rproject/raster2pgsql/extract_table_queryyyy.sql:66:
> NOTICE: Invalid band index (must use 1-based). Returning empty set for all
> the coordinates and bands. What could be the issue.
>
>
>
> Is this because my data is packed? the readme file of the data has this
> information: Details about the data packing:
>
> The data has been packed into short integers (2 bytes instead of 4 byte reals) to save space. You must unpack that data to get the correct floating point representation of the data. Each netCDF variable that has been packed has an add_offset and scale_factor attribute associated with it. Some software automatically unpacks the data when it is read.
>
>
>
> The formula to unpack the data is:
>
>
>
> unpacked value = add_offset + ( (packed value) * scale_factor )
>
>
>
> For more information see here:
>
> https://www.unidata.ucar.edu/software/netcdf/workshops/2010/bestpractices/Packing.html
>
>
>
> There's also another attribute called "missing_value". In this case all the -32768 values you see are missing. Only the grid points outside the downscaling domain is given the missing data value.
>
>
>
> The packing saves a lot of space, that is why the data is packed.
>
>
>
>
>
> -- Create the precipitation_temperature_data table
> DROP TABLE IF EXISTS extracted_data_bbox;
> CREATE TABLE extracted_data_bbox (
> id SERIAL PRIMARY KEY,
> year integer,
> year_day integer,
> lat double precision,
> lon double precision,
> prcp double precision,
> tmax double precision,
> tmin double precision,
> observation_time timestamp
> );
>
> -- Loop through all records in gfdl_03_bbox
> DO $$
> DECLARE
> raster_record RECORD;
> year_from_filename integer;
> observation_time timestamp;
> centroid_point geometry;
> variable_name text;
> variable_value double precision;
> band_number integer;
> band_numbers integer[];
> pixel_data RECORD;
> BEGIN
> FOR raster_record IN (SELECT * FROM gfdl_03_bbox) LOOP
> SELECT regexp_replace(raster_record.filename, '.*_(\d{4})[^0-9]+',
> '\1')::integer INTO year_from_filename;
>
> -- Determine the variable name based on the filename
> IF raster_record.filename ~ 'prcp' THEN
> variable_name := 'prcp';
> ELSIF raster_record.filename ~ 'tmax' THEN
> variable_name := 'tmax';
> ELSIF raster_record.filename ~ 'tmin' THEN
> variable_name := 'tmin';
> ELSE
> RAISE EXCEPTION 'Unknown variable in filename: %',
> raster_record.filename;
> END IF;
>
> band_numbers := ARRAY(SELECT generate_series(1, 366));
>
> -- Loop through all bands within the file
> FOREACH band_number IN ARRAY band_numbers LOOP
> -- Print the band number to the PostgreSQL log
> RAISE NOTICE 'Band Number: %', band_number;
>
> observation_time := MAKE_DATE(year_from_filename, 1, 1) +
> (band_number - 1) * interval '1 day';
>
> SELECT * FROM
> ST_PixelAsCentroids(raster_record.clipped_raster, band_number, true) INTO
> pixel_data;
>
> -- Extract the centroid point from pixel_data
> centroid_point := pixel_data.geom;
>
> -- Extract the variable value based on the variable name
> EXECUTE format('SELECT ST_Value($1, $2, $3, true)',
> raster_record.clipped_raster, band_number, centroid_point) INTO
> variable_value USING raster_record.clipped_raster, band_number,
> centroid_point;
>
> -- Insert the extracted data into the extracted_data_bbox table
> INSERT INTO extracted_data_bbox (year, year_day, lat, lon,
> prcp, tmax, tmin, observation_time)
> VALUES (year_from_filename, band_number, ST_Y(centroid_point),
> ST_X(centroid_point), CASE WHEN variable_name = 'prcp' THEN variable_value
> ELSE NULL END,
> CASE WHEN variable_name = 'tmax' THEN variable_value
> ELSE NULL END,
> CASE WHEN variable_name = 'tmin' THEN variable_value
> ELSE NULL END, observation_time);
> END LOOP;
> END LOOP;
> END;
> $$;
>
> I have tried the following query to check values for a raster band : the
> output is again null
> psql:/home/manaswini/MEGA/bbgc_uw_rpackage/rproject/raster2pgsql/test_query_fff.sql:22:
> NOTICE: Band: 20, X: 1, Y: 135, Value: <NULL>
>
> :
>
> DO $$
> DECLARE
> band_number integer := 20; -- Replace with the desired band number
> x_coord integer;
> y_coord integer;
> pixel_value double precision;
> srid integer := 4326; -- Replace with the correct SRID for your data
> BEGIN
> -- Loop through all x and y coordinates in the raster band
> FOR x_coord IN (SELECT generate_series(1,
> ST_Width(rast.clipped_raster)) FROM gfdl_03_bbox AS rast WHERE
> rast.filename = 'prcp_03_2034.nc') LOOP
> FOR y_coord IN (SELECT generate_series(1,
> ST_Height(rast.clipped_raster)) FROM gfdl_03_bbox AS rast WHERE
> rast.filename = 'prcp_03_2034.nc') LOOP
> -- Get the pixel value at the current x and y coordinates
> SELECT ST_Value(ST_SetSRID(rast.clipped_raster, srid),
> band_number, ST_SetSRID(ST_Point(x_coord, y_coord), srid)) INTO pixel_value
> FROM gfdl_03_bbox AS rast
> WHERE rast.filename = 'prcp_03_2034.nc';
>
> -- Print or use the pixel value as needed
> RAISE NOTICE 'Band: %, X: %, Y: %, Value: %', band_number,
> x_coord, y_coord, pixel_value;
> END LOOP;
> END LOOP;
> END;
>
> Thank you,
>
> Manaswini
>
>
>
>
>
> On Wed, 15 Nov 2023 at 20:42, Regina Obe <lr at pcorp.us> wrote:
>
> I didn’t realize that netCDF also has a vector component.
>
>
>
> https://gdal.org/drivers/vector/netcdf.html#vector-netcdf
>
>
>
> To read the vector component, you’d use the ogr_fdw extension to read that
> rather than postgis_raster extension.
>
>
>
> Details here - https://github.com/pramsey/pgsql-ogr-fdw
>
>
>
> So perhaps that is what your python is doing reading the vector
> component. I don’t know if netcdf mixes those in one data set or not since
> I have no experience using netcdf.
>
>
>
> I recall you said you are using the OSGeo Live 16 distribution. I just
> checked the OSGeoLive 16 does include ogr_fdw
>
>
>
> So do
>
>
>
> CREATE EXTENSION ogr_fdw;
>
>
>
>
>
> The list of supported formats you can see with this query:
>
>
>
> SELECT name FROM unnest(ogr_fdw_drivers()) AS f(name) ORDER BY name;
>
>
>
> Which for osgeolive 16, I see netCDF listed
>
>
>
>
>
>
>
>
>
>
>
> *From:* Regina Obe <lr at pcorp.us>
> *Sent:* Wednesday, November 15, 2023 6:19 PM
> *To:* 'PostGIS Users Discussion' <postgis-users at lists.osgeo.org>
> *Cc:* 'Manaswini Ganjam' <manu.ganjam at gmail.com>
> *Subject:* RE: [postgis-users] Extracting variable information from
> netcdf, imported as raster to a table
>
>
>
> Just confirming some stuff, since I’m not completely following:
>
>
>
> Raster_record.rast column is of type raster correct? IF so ST_X and ST_Y
> won’t work since those are for geometry types.
>
>
>
> Also ST_Value(raster_record.rast, band_number), won’t work either since
> that expects as input a geometry or x,y on the raster you want the value
> you.
>
> I would think you would have gotten an error with that, which makes me
> feel I’m missing something critical.
>
>
>
> If you want to extract all the pixels in a raster, you’d do something like
> https://postgis.net/docs/RT_ST_PixelAsPoints.html
>
>
>
> SELECT pp.x, pp.y, pp.val, ST_X(pp.geom) AS lon, ST_Y(pp.geom) AS lat
>
> FROM raster_record,
>
> ST_PixelAsPoints(raster_record.rast, 1) AS pp
>
>
>
>
>
>
>
>
> *From:* postgis-users <postgis-users-bounces at lists.osgeo.org> *On Behalf
> Of *Manaswini Ganjam via postgis-users
> *Sent:* Wednesday, November 15, 2023 2:01 PM
> *To:* postgis-users at lists.osgeo.org
> *Cc:* Manaswini Ganjam <manu.ganjam at gmail.com>
> *Subject:* [postgis-users] Extracting variable information from netcdf,
> imported as raster to a table
>
>
>
> Hi,
>
> I have been trying to download s3 cloud stored gridded climate data and
> generate tables with variables, lat, lon and timestamp (year, yearday). To
> achieve this I used raster2pgsql and imported multiple netcdf files into a
> database table.
>
>
>
> Question: How to achieve the extraction of variables using postgis? I
> tried using ST_value, ST_pixelaspoints but I was getting errors, mainly due
> to the format in which netcdfs are stored in the database (the error says
> can't load some characters like 00E30100082...), I even tried changing
> the datatype to float but still did not work. I mean it is probably not
> simple like selecting a variable from the netcdf. I have enclosed my sql
> query below:
>
>
>
> -- Iterate through all raster files in the table
> FOR raster_record IN (SELECT * FROM gfdl_03_prcp) LOOP
> -- Determine the year from the raster file name, assuming the
> format is 'prcp_03_1950.nc'
> SELECT
> regexp_replace(raster_record.filename, '.*_(\d{4})\.nc',
> '\1')::integer
> INTO
> year;
>
> -- Calculate the start date of the year
> year_start := (year || '-01-01')::date;
>
> -- Determine if the year is a leap year
> is_leap_year := EXTRACT(ISODOW FROM (year_start + interval '1
> year')) = 7;
>
> -- Set the number of bands for the year (365 for non-leap years,
> 366 for leap years)
> FOR band_number IN 1..(CASE WHEN is_leap_year THEN 366 ELSE 365
> END) LOOP
> -- Calculate the observation_time using the year and band
> number
> observation_time := year_start + (band_number - 1) * interval
> '1 day';
>
> -- Extract X (lon) and Y (lat) coordinates from the raster
> SELECT
> ST_X(raster_record.rast) AS lon,
> ST_Y(raster_record.rast) AS lat
> INTO
> lon,
> lat;
>
> -- Insert the lat, lon, prcp, and observation_time into the
> extracted_values table
> INSERT INTO extracted_values (lat, lon, prcp, observation_time)
> VALUES
> (lat, lon, ST_Value(raster_record.rast, band_number),
> observation_time);
>
> -- Increment the counter
> counter := counter + 1;
>
> -- Commit the transaction periodically in batches
> IF counter % batch_size = 0 THEN
> COMMIT;
> END IF;
> END LOOP;
> END LOOP;
>
>
>
>
> The metadata for the two files is as follows:
>
>
>
> File from database:
>
> {'NC_GLOBAL#Conventions': 'CF-1.5',
>
> 'NC_GLOBAL#GDAL': 'GDAL 3.6.4, released 2023/04/17',
>
> 'NC_GLOBAL#history': 'Wed Nov 15 13:32:13 2023: GDAL CreateCopy( not_clipped_prcp.nc, ... )'}
>
> File before loading into the database:
>
> {'lat#units': 'degrees_north',
>
> 'lon#units': 'degrees_east',
>
> 'NC_GLOBAL#title': 'Daily statistically downscaled CMIP5 data for the United States and southern Canada east of the Rocky Mountains, version 1.0, realization 1, 0.1x0.1 degree spatial resolution.',
>
> 'NETCDF_DIM_EXTRA': '{time}',
>
> 'NETCDF_DIM_time_DEF': '{366,4}',
>
> 'NETCDF_DIM_time_VALUES': '{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14....362,363,364,365}',
>
> 'prcp#add_offset': '819.17499',
>
> 'prcp#long_name': 'daily precipitation accumulation',
>
> 'prcp#missing_value': '-32768',
>
> 'prcp#scale_factor': '0.025',
>
> 'prcp#units': 'mm',
>
> 'prcp#_FillValue': '-32768',
>
> 'time#units': 'days since 1952-1-1 0:0:0.0'}
>
>
>
> In case this information is useful: Previously I used python to extract variable information and generate a csv or table using this variable information, and the code is enclosed below. In the code I extracted variable values using lon = dataset.variables['lon'][:] and iterated for loops to write them all in csv.
>
> Python code:
>
> import netCDF4 as nc
>
> # Step 1: Read the NetCDF file
> filename = "/home/manaswini/prcp_03_1950.nc"
> dataset = nc.Dataset(filename)
> dataset.set_auto_mask(False)
> dataset.set_auto_scale(True)
>
>
>
> lon = dataset.variables['lon'][:]
> lat = dataset.variables['lat'][:]
> time = dataset.variables['time'][:]
> prcp = dataset.variables['prcp'][:]
>
>
>
> import numpy as np
> import csv
>
> # csv_buffer
> csv_buffer = open('output.csv', 'w', newline='')
> csv_writer = csv.writer(csv_buffer)
>
> # Iterate through grid points and write to CSV buffer
> for i in enumerate(lon):
> for j in enumerate(lat):
> for k in enumerate(time):
> csv_writer.writerow([lat[j], lon[i], prcp[i][j][k], year[k],
> yearday[k]])
>
>
> # Close the CSV buffer
> csv_buffer.close()
>
>
>
> Thank you,
>
> Manaswini Ganjam
>
>
>
>
> --
>
> Manaswini Ganjam
>
--
Manaswini Ganjam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20231116/79e824b1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 232418 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20231116/79e824b1/attachment.png>
More information about the postgis-users
mailing list