[postgis-users] Extracting variable information from netcdf, imported as raster to a table

Regina Obe lr at pcorp.us
Thu Nov 16 18:46:29 PST 2023


I don’t know much about netCDF but I would assume GDAL would handle the packing unraveling behind the scenes.

 

Some things I notice wrong

 

https://postgis.net/docs/RT_ST_PixelAsCentroids.html  is a set returning function, so you can’t just stuff it in a record variable.  It has to go in a table.

So I’d get rid of this line

 

          SELECT * FROM ST_PixelAsCentroids(raster_record.clipped_raster, band_number, true) INTO pixel_data;



I’d also scrap this, because ultimately you want to work on all centroids not the first one that falls out of the tree

EXECUTE format('SELECT ST_Value($1, $2, $3, true)', raster_record.clipped_raster, band_number, centroid_point) INTO variable_value USING raster_record.clipped_raster, band_number, centroid_point;



 

I’d change this


            -- Insert the extracted data into the extracted_data_bbox table
            INSERT INTO extracted_data_bbox (year, year_day, lat, lon, prcp, tmax, tmin, observation_time)
            VALUES (year_from_filename, band_number, ST_Y(centroid_point), ST_X(centroid_point), CASE WHEN variable_name = 'prcp' THEN variable_value ELSE NULL END,
                    CASE WHEN variable_name = 'tmax' THEN variable_value ELSE NULL END,
                    CASE WHEN variable_name = 'tmin' THEN variable_value ELSE NULL END, observation_time);

 

To this:

 

            INSERT INTO extracted_data_bbox (year, year_day, lat, lon, prcp, tmax, tmin, observation_time)
            SELECT year_from_filename, band_number, ST_Y(sc.geom), ST_X(sc.geom), CASE WHEN variable_name = 'prcp' THEN sc.val ELSE NULL END,
                    CASE WHEN variable_name = 'tmax' THEN sc.val ELSE NULL END,
                    CASE WHEN variable_name = 'tmin' THEN sc.val ELSE NULL END, observation_time

           FROM  ST_PixelAsCentroids(raster_record.clipped_raster, band_number, true) AS sc;

 

 

 

 

From: Manaswini Ganjam <manu.ganjam at gmail.com> 
Sent: Thursday, November 16, 2023 1:19 PM
To: Regina Obe <lr at pcorp.us>
Cc: PostGIS Users Discussion <postgis-users at lists.osgeo.org>
Subject: Re: [postgis-users] Extracting variable information from netcdf, imported as raster to a table

 

Thank you, Regina,

 

I apologize, I have shared the older version of my code, I have tried ST_pixelascentroid and ST_value and the issue was I messed up the parameters of centroids and that caused the errors now I corrected that as well. I think the code below is a good representation of my approach, I would like to iterate for every raster in a table, for every band and for every lat and lon and extract st value, and insert it to a table, 

 

psql:/home/manaswini/MEGA/bbgc_uw_rpackage/rproject/raster2pgsql/extract_table_queryyyy.sql:66: NOTICE:  Invalid band index (must use 1-based). Returning empty set for all the coordinates and bands. What could be the issue. 

 

Is this because my data is packed? the readme file of the data has this information: Details about the data packing:

The data has been packed into short integers (2 bytes instead of 4 byte reals) to save space. You must unpack that data to get the correct floating point representation of the data. Each netCDF variable that has been packed has an add_offset and scale_factor attribute associated with it. Some software automatically unpacks the data when it is read.
 
The formula to unpack the data is:
 
unpacked value = add_offset + ( (packed value) * scale_factor )
 
For more information see here:
https://www.unidata.ucar.edu/software/netcdf/workshops/2010/bestpractices/Packing.html
 
There's also another attribute called "missing_value". In this case all the -32768 values you see are missing. Only the grid points outside the downscaling domain is given the missing data value.
 
The packing saves a lot of space, that is why the data is packed.

 

 

-- Create the precipitation_temperature_data table
DROP TABLE IF EXISTS extracted_data_bbox;
CREATE TABLE extracted_data_bbox (
    id SERIAL PRIMARY KEY,
    year integer,
    year_day integer,
    lat double precision,
    lon double precision,
    prcp double precision,
    tmax double precision,
    tmin double precision,
    observation_time timestamp
);

-- Loop through all records in gfdl_03_bbox
DO $$
DECLARE
    raster_record RECORD;
    year_from_filename integer;
    observation_time timestamp;
    centroid_point geometry;
    variable_name text;
    variable_value double precision;
    band_number integer;
    band_numbers integer[];
    pixel_data RECORD;
BEGIN
    FOR raster_record IN (SELECT * FROM gfdl_03_bbox) LOOP
        SELECT regexp_replace(raster_record.filename, '.*_(\d{4})[^0-9]+', '\1')::integer INTO year_from_filename;

        -- Determine the variable name based on the filename
        IF raster_record.filename ~ 'prcp' THEN
            variable_name := 'prcp';
        ELSIF raster_record.filename ~ 'tmax' THEN
            variable_name := 'tmax';
        ELSIF raster_record.filename ~ 'tmin' THEN
            variable_name := 'tmin';
        ELSE
            RAISE EXCEPTION 'Unknown variable in filename: %', raster_record.filename;
        END IF;

        band_numbers := ARRAY(SELECT generate_series(1, 366));

        -- Loop through all bands within the file
        FOREACH band_number IN ARRAY band_numbers LOOP
            -- Print the band number to the PostgreSQL log
            RAISE NOTICE 'Band Number: %', band_number;

            observation_time := MAKE_DATE(year_from_filename, 1, 1) + (band_number - 1) * interval '1 day';

            SELECT * FROM ST_PixelAsCentroids(raster_record.clipped_raster, band_number, true) INTO pixel_data;

            -- Extract the centroid point from pixel_data
            centroid_point := pixel_data.geom;

            -- Extract the variable value based on the variable name
            EXECUTE format('SELECT ST_Value($1, $2, $3, true)', raster_record.clipped_raster, band_number, centroid_point) INTO variable_value USING raster_record.clipped_raster, band_number, centroid_point;

            -- Insert the extracted data into the extracted_data_bbox table
            INSERT INTO extracted_data_bbox (year, year_day, lat, lon, prcp, tmax, tmin, observation_time)
            VALUES (year_from_filename, band_number, ST_Y(centroid_point), ST_X(centroid_point), CASE WHEN variable_name = 'prcp' THEN variable_value ELSE NULL END,
                    CASE WHEN variable_name = 'tmax' THEN variable_value ELSE NULL END,
                    CASE WHEN variable_name = 'tmin' THEN variable_value ELSE NULL END, observation_time);
        END LOOP;
    END LOOP;
END;
$$;

I have tried the following query to check values for a raster band : the output is again null psql:/home/manaswini/MEGA/bbgc_uw_rpackage/rproject/raster2pgsql/test_query_fff.sql:22: NOTICE:  Band: 20, X: 1, Y: 135, Value: <NULL>

:

DO $$
DECLARE
    band_number integer := 20; -- Replace with the desired band number
    x_coord integer;
    y_coord integer;
    pixel_value double precision;
    srid integer := 4326; -- Replace with the correct SRID for your data
BEGIN
    -- Loop through all x and y coordinates in the raster band
    FOR x_coord IN (SELECT generate_series(1, ST_Width(rast.clipped_raster)) FROM gfdl_03_bbox AS rast WHERE rast.filename = 'prcp_03_2034.nc <http://prcp_03_2034.nc> ') LOOP
        FOR y_coord IN (SELECT generate_series(1, ST_Height(rast.clipped_raster)) FROM gfdl_03_bbox AS rast WHERE rast.filename = 'prcp_03_2034.nc <http://prcp_03_2034.nc> ') LOOP
            -- Get the pixel value at the current x and y coordinates
            SELECT ST_Value(ST_SetSRID(rast.clipped_raster, srid), band_number, ST_SetSRID(ST_Point(x_coord, y_coord), srid)) INTO pixel_value
            FROM gfdl_03_bbox AS rast
            WHERE rast.filename = 'prcp_03_2034.nc <http://prcp_03_2034.nc> ';

            -- Print or use the pixel value as needed
            RAISE NOTICE 'Band: %, X: %, Y: %, Value: %', band_number, x_coord, y_coord, pixel_value;
        END LOOP;
    END LOOP;
END;

Thank you,

Manaswini

 

 

On Wed, 15 Nov 2023 at 20:42, Regina Obe <lr at pcorp.us <mailto:lr at pcorp.us> > wrote:

I didn’t realize that netCDF also has a vector component.

 

https://gdal.org/drivers/vector/netcdf.html#vector-netcdf

 

To read the vector component, you’d use the ogr_fdw extension to read that rather than postgis_raster extension.

 

Details here -  https://github.com/pramsey/pgsql-ogr-fdw

 

So perhaps that is what your python is doing reading the vector component.  I don’t know if netcdf mixes those in one data set or not since I have no experience using netcdf.

 

I recall you said you are using the OSGeo Live 16 distribution.  I just checked the OSGeoLive 16 does include ogr_fdw 

 

So do 

 

CREATE EXTENSION ogr_fdw;

 

 

The list of supported formats you can see with this query:

 

SELECT name FROM unnest(ogr_fdw_drivers()) AS f(name) ORDER BY name;

 

Which for osgeolive 16, I see netCDF listed

 

 

 

 

 

From: Regina Obe <lr at pcorp.us <mailto:lr at pcorp.us> > 
Sent: Wednesday, November 15, 2023 6:19 PM
To: 'PostGIS Users Discussion' <postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org> >
Cc: 'Manaswini Ganjam' <manu.ganjam at gmail.com <mailto:manu.ganjam at gmail.com> >
Subject: RE: [postgis-users] Extracting variable information from netcdf, imported as raster to a table

 

Just confirming some stuff, since I’m not completely following:

 

Raster_record.rast column is of type raster correct?  IF so ST_X and ST_Y won’t work since those are for geometry types.

 

Also ST_Value(raster_record.rast, band_number), won’t work either since that expects as input a geometry or x,y on the raster you want the value you.

I would think you would have gotten an error with that, which makes me feel I’m missing something critical.

 

If you want to extract all the pixels in a raster, you’d do something like https://postgis.net/docs/RT_ST_PixelAsPoints.html

 

SELECT pp.x, pp.y, pp.val, ST_X(pp.geom) AS lon, ST_Y(pp.geom) AS lat

FROM raster_record, 

ST_PixelAsPoints(raster_record.rast, 1) AS pp

 


           

 

From: postgis-users <postgis-users-bounces at lists.osgeo.org <mailto:postgis-users-bounces at lists.osgeo.org> > On Behalf Of Manaswini Ganjam via postgis-users
Sent: Wednesday, November 15, 2023 2:01 PM
To: postgis-users at lists.osgeo.org <mailto:postgis-users at lists.osgeo.org> 
Cc: Manaswini Ganjam <manu.ganjam at gmail.com <mailto:manu.ganjam at gmail.com> >
Subject: [postgis-users] Extracting variable information from netcdf, imported as raster to a table

 

Hi, 

I have been trying to download s3 cloud stored gridded climate data and generate tables with variables, lat, lon and timestamp (year, yearday). To achieve this I used raster2pgsql and imported multiple netcdf files into a database table. 

 

Question: How to achieve the extraction of variables using postgis? I tried using ST_value, ST_pixelaspoints but I was getting errors, mainly due to the format in which netcdfs are stored in the database (the error says can't load some characters like 00E30100082...), I even tried changing the datatype to float but still did not work. I mean it is probably not simple like selecting a variable from the netcdf. I have enclosed my sql query below: 

 

  -- Iterate through all raster files in the table
    FOR raster_record IN (SELECT * FROM gfdl_03_prcp) LOOP
        -- Determine the year from the raster file name, assuming the format is 'prcp_03_1950.nc <http://prcp_03_1950.nc> '
        SELECT
            regexp_replace(raster_record.filename, '.*_(\d{4})\.nc', '\1')::integer
        INTO
            year;
        
        -- Calculate the start date of the year
        year_start := (year || '-01-01')::date;
        
        -- Determine if the year is a leap year
        is_leap_year := EXTRACT(ISODOW FROM (year_start + interval '1 year')) = 7;
        
        -- Set the number of bands for the year (365 for non-leap years, 366 for leap years)
        FOR band_number IN 1..(CASE WHEN is_leap_year THEN 366 ELSE 365 END) LOOP
            -- Calculate the observation_time using the year and band number
            observation_time := year_start + (band_number - 1) * interval '1 day';
            
            -- Extract X (lon) and Y (lat) coordinates from the raster
            SELECT
                ST_X(raster_record.rast) AS lon,
                ST_Y(raster_record.rast) AS lat
            INTO
                lon,
                lat;
            
            -- Insert the lat, lon, prcp, and observation_time into the extracted_values table
            INSERT INTO extracted_values (lat, lon, prcp, observation_time)
            VALUES
                (lat, lon, ST_Value(raster_record.rast, band_number), observation_time);
            
            -- Increment the counter
            counter := counter + 1;
            
            -- Commit the transaction periodically in batches
            IF counter % batch_size = 0 THEN
                COMMIT;
            END IF;
        END LOOP;
    END LOOP;
    

 

The metadata for the two files is as follows: 

 

File from database:

{'NC_GLOBAL#Conventions': 'CF-1.5',
 'NC_GLOBAL#GDAL': 'GDAL 3.6.4, released 2023/04/17',
 'NC_GLOBAL#history': 'Wed Nov 15 13:32:13 2023: GDAL CreateCopy( not_clipped_prcp.nc <http://not_clipped_prcp.nc> , ... )'}
File before loading into the database:
{'lat#units': 'degrees_north',
 'lon#units': 'degrees_east',
 'NC_GLOBAL#title': 'Daily statistically downscaled CMIP5 data for the United States and southern Canada east of the Rocky Mountains, version 1.0, realization 1, 0.1x0.1 degree spatial resolution.',
 'NETCDF_DIM_EXTRA': '{time}',
 'NETCDF_DIM_time_DEF': '{366,4}',
 'NETCDF_DIM_time_VALUES': '{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14....362,363,364,365}',
 'prcp#add_offset': '819.17499',
 'prcp#long_name': 'daily precipitation accumulation',
 'prcp#missing_value': '-32768',
 'prcp#scale_factor': '0.025',
 'prcp#units': 'mm',
 'prcp#_FillValue': '-32768',
 'time#units': 'days since 1952-1-1 0:0:0.0'}
 
In case this information is useful: Previously I used python to extract variable information and generate a csv or table using this variable information, and the code is enclosed below. In the code I extracted variable values using lon = dataset.variables['lon'][:] and iterated for loops to write them all in csv. 

Python code:

import netCDF4 as nc

# Step 1: Read the NetCDF file
filename = "/home/manaswini/prcp_03_1950.nc <http://prcp_03_1950.nc> "
dataset = nc.Dataset(filename)
dataset.set_auto_mask(False)
dataset.set_auto_scale(True)

 

lon = dataset.variables['lon'][:]
lat = dataset.variables['lat'][:]
time = dataset.variables['time'][:]
prcp = dataset.variables['prcp'][:]

 

import numpy as np
import csv

# csv_buffer
csv_buffer = open('output.csv', 'w', newline='')
csv_writer = csv.writer(csv_buffer)

# Iterate through grid points and write to CSV buffer
for i in enumerate(lon):
    for j in enumerate(lat):
        for k in enumerate(time):
         csv_writer.writerow([lat[j], lon[i], prcp[i][j][k], year[k], yearday[k]])


# Close the CSV buffer
csv_buffer.close()

 

Thank you, 

Manaswini Ganjam




 

-- 

Manaswini Ganjam

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20231116/870dbe56/attachment.htm>


More information about the postgis-users mailing list