[postgis-users] joining "nearest" data

Obe, Regina robe.dnd at cityofboston.gov
Thu Feb 15 05:35:53 PST 2007


Kirk sorry - I just realized in my suggestions I didn't take into
consideration the join needed between weather and solar so I shudder to
think how many records that query is returning.  I don't think you can
join by obs_id because they could be in 2 different stations so you'll
probably need to join by the date field and make sure your date fields
are indexed.  
 
You should always run an explain on your query first to see what its
doing?  Do you have an index on your site point geometry?  Also make
sure to have one on the date and year fields.  Also as a note - not sure
if it applies for PostGreSQL but in my experience with databases - you
should never use a text field as a key as you have with obs_id.  You
should instead use a varchar or char.
 
My gut feeling tells me adding a geometry to the weather and solar
tables would not help in this case because you have so many more records
in those than you probably have in sites.  Given what you said, I'm
wondering if creating a cache table of the closest site with solar data
and weather data for a given year would be a better approach
 
So your table would be called say site_year and for each unique
station_id, obs_year would give you the station_id of the closest
weather site and closest solar site.
 
station_id, obs_year, ref_weather_station_id, ref_solar_station_id,
thepoint_meter, longname, lat, lon
 
Step 1 - fill in all the entries that have data for the same station and
leave nulls where there is no data for that year
INSERT INTO site_year(station_id, obs_year, ref_weather_station_id,
ref_solar_station_id)
SELECT CASE(WHEN sol.station_id IS NULL THEN w.station_id ELSE
sol.station_id) As station_id, (CASE WHEN sol.year IS NULL THEN w.year
ELSE sol.year) As obs_year, w.station_id As ref_weather_station_id,
sol.station_id As ref_sol_station_id
FROM (SELECT DISTINCT solar.station_id, solar.year FROM solar) sol FULL
JOIN (SELECT DISTINCT weather.station_id, weather.year FROM weather) w
ON sol.station_id = w.station_id AND sol.year= w.year
 
Step 2 - UPDATE site_year SET thepoint_meter = sites.thepoint_year,
longname = sites.longname, lat = sites.lat, lon = site.lon FROM sites
WHERE sites.station_id = site_year.station_id
Step 3 - CREATE INDEX idx_site_year_thepoint_meter ON site_year USING
gist(thepoint_meter);
            CREATE INDEX idx_site_year_obs_year ON site_year USING
btree(obs_year);
            ALTER TABLE site_year ADD CONSTRAINT pk_site_year PRIMARY
KEY(station_id, obs_year);
 
Step 4 - Now fill in the ref_solar_station_id of those that have no
solar data for that year - here I'm trying to use the point meter index
by assuming that you have no site that qualifies as closest whose
bounding box is more than 30 Km from its closest site bounding box.  I
know this one is pretty hairy so I probably screwed up since I couldn't
test it out.  
            UPDATE site_year SET ref_solar_station_id =
sol2.ref_solar_station_id
            FROM  (SELECT solmin.station_id, solmin.obs_year,
sol.ref_solar_station_id FROM
                                (SELECT s1.station_id,
s1.thepoint_meter, min(distance(s1.thepoint_meter, s2.thepoint_meter))
As mindist, s1.obs_year  
                                FROM site_year  s1 INNER JOIN site_year
s2 On s1.obs_year = s2.obs_year
                            WHERE s2.ref_solar_station_id IS NOT NULL
AND s1.ref_solar_station_id IS NULL
                                    AND expand(s2.thepoint_meter, 30000)
&& s1.thepoint_meter 
                                        GROUP BY s1.station_id,
s1.obs_year, s1.thepoint_meter) solmin
                            INNER JOIN site_year  sol ON  sol.obs_year =
solmin.obs_year AND (expand(solmin.thepoint_meter, 30000) &&
sol.thepoint_meter  AND distance(solmin.thepoint_meter,
sol.thepoint_meter) = solmin.mindist) ) sol2
            WHERE site_year.station_id = sol2.station_id AND
site_year.obs_year = sol2.obs_year
 
Step 5 - Do the same as Step 4 except for ref_weather_station_id
 
Step 6 - 
            CREATE INDEX idx_site_year_ref_solar_station_id ON site_year
USING btree(ref_solar_station_id);
        (you'll probably want another index on ref_weather_station_id)
 
Step 7 -  and your final query - hopefully I didn't screw it up too
badly
 
SELECT CASE WHEN w.station_id = site_year.station_id THEN w.obs_id ELSE
s.obs_id END As obs_id,  site_year.station_id, site_year.longname,
site_year.lat, 
site_year.lon, site_year.thepoint_meter,
       w.date, w.year, w.month, w.day, w.doy, w.precip, w.tmin, w.tmax, 
w.tmean, w.snowfall, w.snowdepth,
      s.date, s.year, s.month, s.day, s.doy, s.rad, s.dss, s.par
       FROM site_year INNER JOIN solar s ON
site_year.ref_solar_station_id = s.station_id  AND site_year.obs_year =
s.year
                INNER JOIN weather w ON site_year.ref_weather_station_id
= w.station_id AND site_year.obs_year = w.year
                        AND s.date = w.date

        
                                
                                

 
 
 
________________________________

From: postgis-users-bounces at postgis.refractions.net on behalf of Kirk
Wythers
Sent: Wed 2/14/2007 4:43 PM
To: PostGIS Users Discussion
Subject: Re: [postgis-users] joining "nearest" data




On Feb 14, 2007, at 3:04 PM, Obe, Regina wrote:

> Yes sorry for the typo.  Oh I kind of see what you are doing now 
> with the observe id after rereading what you had wrote.  So the 
> query may be a little trickier  than I had proposed because I 
> realize now that solars are observation points.
>
> Do you always take these observations from the same sites -- e.g. 
> for some days would you have solar data for a particular site and 
> then for a different day no solar data for that site - but solar 
> data for a different site or the sites are always consistent?  
> Sorry if that's  a stupid question - I'm not too familiar with how 
> this kind of data is collected.

No - not stupid at all. These data are from NOAA. The sites are 
always the same. For example site A, B, C. But they are historic data 
going back to the turn of the century. So that means that some sites 
may only have a certain period of observations (say 1940 - 1970, then 
nothing), some sites have only weather data, some sites have only 
solar data, and some sites have both.

Consider these three sites: A B C

A has weather and solar from 1900 - 2007 (but they are in different 
tables weather and solar)
B has weather from 1951 - 2007 (in the weather table)
C has solar from 1951 - 2007 (in the solar table... but it is pretty 
close location to B)

The goal would be to join the two tables with site A showing weather 
and solar data, Site B showing the site B weather and the solar from 
site C since it is the closest available solar data.





> If the sites you get solar data change from day to day, then you 
> might have to do a compound key join in the mindist query with date 
> and station_id.

No... one site should resolve as being the closest to another site... 
oh... I see the problem, what if one site has the closest available 
solar for some number of years, say up to 1980, then another site 
appears in the record with solar dat from 1990 - present that is 
actually closer than the first site.

This is getting complicated!

>
> Regarding your question about views.  Yes they are almost always 
> slower than caching the data in a table.  I usually try the view 
> out first though since it provides more flexibility and see how 
> slow it is and then resort to a static table if it is too slow.  
> Sometimes the speed difference is not that noticeable especially if 
> you have a clustered index on the station_id field.  Sometimes I 
> think the view can actually be faster because in this case your 
> station id table is probably much smaller than your weather solar 
> data so the number of geometries you'd be going thru would be 
> smaller and the planner would probably come out with a better 
> plan.  If the geometries are repeated in your solar/weather table, 
> then it may perceive those geometries as being different when they 
> are the same so may actually be doing more processing work when 
> doing distance checks.

I asked because original suggestion (with the corrected typos) is 
still running (several hours now) and I even limited it to one 
station and one year where there was both weather and solar data

WHERE sites.station_id = 212248 AND w.year = 1948;

Would this be easier if I created new solar and weather tables that 
included the geometry colum in addition to the station_id columns. 
Something like:

CREATE TABLE solar_full
(
   obs_id text NOT NULL,
   station_id integer,
   longname character varying,
   lat double precision,
   lon double precision,
   thepoint_meter geometry,
   date date,
   "year" smallint,
   "month" smallint,
   "day" smallint,
   doy smallint,
   rad real,
   dss smallint,
   par real,
CONSTRAINT solar_full_pkey PRIMARY KEY (obs_id)
)
WITHOUT OIDS;




>
> ________________________________
>
> From: postgis-users-bounces at postgis.refractions.net on behalf of 
> Kirk Wythers
> Sent: Wed 2/14/2007 3:12 PM
> To: PostGIS Users Discussion
> Subject: Re: [postgis-users] joining "nearest" data
>
>
>
> On Feb 14, 2007, at 6:20 AM, Obe, Regina wrote:
>
>
>      
>       SELECT w.obs_id,  sites.station_id, sites.longname, sites.lat, 
> sites.lon, sites.thepoint_meter,
>       w.date, w.year, w.month, w.day, w.doy, w.precip, w.tmin, w.tmax,

> w.tmean, w.snowfall, w.snowdepth,
>       s.date, s.year, s.month, s.day, s.doy, s.rad, s.dss, s.par
>       FROM sites INNER JOIN solar s ON sites.station_id = s.station_id

> LEFT OUTER JOIN weather w ON w.station_id = sites.station_id
>       UNION
>      
>       SELECT w.obs_id,  sites.station_id, sites.longname, sites.lat, 
> sites.lon, sites.thepoint_meter,
>       w.date, w.year, w.month, w.day, w.doy, w.precip, w.tmin, w.tmax,

> w.tmean, w.snowfall, w.snowdepth,
>       s.date, s.year, s.month, s.day, s.doy, s.rad, s.dss, s.par
>       FROM sites INNER JOIN
>       (SELECT s2.station_id, min(distance(s2.thepoint_meter, 
> sol.thepoint_meter))  As mindist
>       FROM station s2 LEFT JOIN solar nosols ON s2.station_id = 
> nosols.station_id CROSS JOIN vwsolar sol
>
>
> Regina, did you mean FROM sites s2 LEFT JOIN here? There is no 
> "station" table, nor have we aliased one
>
>
>
>      
>      
>      
>       WHERE nosols.station_id IS NULL
>       GROUP BY s2.station_id) solmin
>           ON sites.station_id = solmin.station_id INNER JOIN vwsolar s

> ON distance(sites.thepoint_meter, s.thepoint_meter) = 
> solmin.mindistance LEFT JOIN weather w ON w.station_id = 
> sites.station_id)
>
>
>
>
>
> -----------------------------------------
> The substance of this message, including any attachments, may be
> confidential, legally privileged and/or exempt from disclosure
> pursuant to Massachusetts law. It is intended
> solely for the addressee. If you received this in error, please
> contact the sender and delete the material from any computer.
> <winmail.dat>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
postgis-users at postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 19767 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20070215/590872b8/attachment.bin>


More information about the postgis-users mailing list