[postgis-users] joining "nearest" data

Kirk Wythers kwythers at umn.edu
Thu Feb 15 16:17:28 PST 2007


On Feb 15, 2007, at 7:35 AM, Obe, Regina wrote:

> Kirk sorry - I just realized in my suggestions I didn't take into
> consideration the join needed between weather and solar so I  
> shudder to
> think how many records that query is returning.  I don't think you can
> join by obs_id because they could be in 2 different stations so you'll
> probably need to join by the date field and make sure your date fields
> are indexed.

obs_id is a concatenation between station_id and date, therefore  
there should be a unique obs_id for row in both the  weather table  
and the solar table. However, they probably are not indexed. I will  
look into that tonight and index them if they are not.


>
> You should always run an explain on your query first to see what its
> doing?

Thanks, I will start doing that.


> Do you have an index on your site point geometry?

yes

> Also make
> sure to have one on the date and year fields.

I'll do that too.

> Also as a note - not sure
> if it applies for PostGreSQL but in my experience with databases - you
> should never use a text field as a key as you have with obs_id.  You
> should instead use a varchar or char.

I am trying to remember how I combined the station_id and date to get  
obs_id. Unfortunately I am drawing a blank right now. My guess is  
that text was what postgresql defaulted to. For example the obs_id  
field looks like this:

21224801-01-1946

I will try changing it to varchar

>
> My gut feeling tells me adding a geometry to the weather and solar
> tables would not help in this case because you have so many more  
> records
> in those than you probably have in sites.

That is true. There are less that 500 sites, and more that 7 million  
rows in weather, and a quarter million rows in solar

> Given what you said, I'm
> wondering if creating a cache table of the closest site with solar  
> data
> and weather data for a given year would be a better approach

This sounds promising. Cache tables? I have some reading to do.

>
> So your table would be called say site_year and for each unique
> station_id, obs_year would give you the station_id of the closest
> weather site and closest solar site.

I think this sounds like the path I started down when I created  
obs_id (by combining station_id and date)

>
> station_id, obs_year, ref_weather_station_id, ref_solar_station_id,
> thepoint_meter, longname, lat, lon

I will give this all a try tonight.

>
> Step 1 - fill in all the entries that have data for the same  
> station and
> leave nulls where there is no data for that year
> INSERT INTO site_year(station_id, obs_year, ref_weather_station_id,
> ref_solar_station_id)
> SELECT CASE(WHEN sol.station_id IS NULL THEN w.station_id ELSE
> sol.station_id) As station_id, (CASE WHEN sol.year IS NULL THEN w.year
> ELSE sol.year) As obs_year, w.station_id As ref_weather_station_id,
> sol.station_id As ref_sol_station_id
> FROM (SELECT DISTINCT solar.station_id, solar.year FROM solar) sol  
> FULL
> JOIN (SELECT DISTINCT weather.station_id, weather.year FROM weather) w
> ON sol.station_id = w.station_id AND sol.year= w.year
>
> Step 2 - UPDATE site_year SET thepoint_meter = sites.thepoint_year,
> longname = sites.longname, lat = sites.lat, lon = site.lon FROM sites
> WHERE sites.station_id = site_year.station_id
> Step 3 - CREATE INDEX idx_site_year_thepoint_meter ON site_year USING
> gist(thepoint_meter);
>             CREATE INDEX idx_site_year_obs_year ON site_year USING
> btree(obs_year);
>             ALTER TABLE site_year ADD CONSTRAINT pk_site_year PRIMARY
> KEY(station_id, obs_year);
>
> Step 4 - Now fill in the ref_solar_station_id of those that have no
> solar data for that year - here I'm trying to use the point meter  
> index
> by assuming that you have no site that qualifies as closest whose
> bounding box is more than 30 Km from its closest site bounding box.  I
> know this one is pretty hairy so I probably screwed up since I  
> couldn't
> test it out.
>             UPDATE site_year SET ref_solar_station_id =
> sol2.ref_solar_station_id
>             FROM  (SELECT solmin.station_id, solmin.obs_year,
> sol.ref_solar_station_id FROM
>                                 (SELECT s1.station_id,
> s1.thepoint_meter, min(distance(s1.thepoint_meter, s2.thepoint_meter))
> As mindist, s1.obs_year
>                                 FROM site_year  s1 INNER JOIN  
> site_year
> s2 On s1.obs_year = s2.obs_year
>                             WHERE s2.ref_solar_station_id IS NOT NULL
> AND s1.ref_solar_station_id IS NULL
>                                     AND expand(s2.thepoint_meter,  
> 30000)
> && s1.thepoint_meter
>                                         GROUP BY s1.station_id,
> s1.obs_year, s1.thepoint_meter) solmin
>                             INNER JOIN site_year  sol ON   
> sol.obs_year =
> solmin.obs_year AND (expand(solmin.thepoint_meter, 30000) &&
> sol.thepoint_meter  AND distance(solmin.thepoint_meter,
> sol.thepoint_meter) = solmin.mindist) ) sol2
>             WHERE site_year.station_id = sol2.station_id AND
> site_year.obs_year = sol2.obs_year
>
> Step 5 - Do the same as Step 4 except for ref_weather_station_id
>
> Step 6 -
>             CREATE INDEX idx_site_year_ref_solar_station_id ON  
> site_year
> USING btree(ref_solar_station_id);
>         (you'll probably want another index on ref_weather_station_id)
>
> Step 7 -  and your final query - hopefully I didn't screw it up too
> badly
>
> SELECT CASE WHEN w.station_id = site_year.station_id THEN w.obs_id  
> ELSE
> s.obs_id END As obs_id,  site_year.station_id, site_year.longname,
> site_year.lat,
> site_year.lon, site_year.thepoint_meter,
>        w.date, w.year, w.month, w.day, w.doy, w.precip, w.tmin,  
> w.tmax,
> w.tmean, w.snowfall, w.snowdepth,
>       s.date, s.year, s.month, s.day, s.doy, s.rad, s.dss, s.par
>        FROM site_year INNER JOIN solar s ON
> site_year.ref_solar_station_id = s.station_id  AND  
> site_year.obs_year =
> s.year
>                 INNER JOIN weather w ON  
> site_year.ref_weather_station_id
> = w.station_id AND site_year.obs_year = w.year
>                         AND s.date = w.date
>
>
>
>
>
>
>
>
> ________________________________
>
> From: postgis-users-bounces at postgis.refractions.net on behalf of Kirk
> Wythers
> Sent: Wed 2/14/2007 4:43 PM
> To: PostGIS Users Discussion
> Subject: Re: [postgis-users] joining "nearest" data
>
>
>
>
> On Feb 14, 2007, at 3:04 PM, Obe, Regina wrote:
>
>> Yes sorry for the typo.  Oh I kind of see what you are doing now
>> with the observe id after rereading what you had wrote.  So the
>> query may be a little trickier  than I had proposed because I
>> realize now that solars are observation points.
>>
>> Do you always take these observations from the same sites -- e.g.
>> for some days would you have solar data for a particular site and
>> then for a different day no solar data for that site - but solar
>> data for a different site or the sites are always consistent?
>> Sorry if that's  a stupid question - I'm not too familiar with how
>> this kind of data is collected.
>
> No - not stupid at all. These data are from NOAA. The sites are
> always the same. For example site A, B, C. But they are historic data
> going back to the turn of the century. So that means that some sites
> may only have a certain period of observations (say 1940 - 1970, then
> nothing), some sites have only weather data, some sites have only
> solar data, and some sites have both.
>
> Consider these three sites: A B C
>
> A has weather and solar from 1900 - 2007 (but they are in different
> tables weather and solar)
> B has weather from 1951 - 2007 (in the weather table)
> C has solar from 1951 - 2007 (in the solar table... but it is pretty
> close location to B)
>
> The goal would be to join the two tables with site A showing weather
> and solar data, Site B showing the site B weather and the solar from
> site C since it is the closest available solar data.
>
>
>
>
>
>> If the sites you get solar data change from day to day, then you
>> might have to do a compound key join in the mindist query with date
>> and station_id.
>
> No... one site should resolve as being the closest to another site...
> oh... I see the problem, what if one site has the closest available
> solar for some number of years, say up to 1980, then another site
> appears in the record with solar dat from 1990 - present that is
> actually closer than the first site.
>
> This is getting complicated!
>
>>
>> Regarding your question about views.  Yes they are almost always
>> slower than caching the data in a table.  I usually try the view
>> out first though since it provides more flexibility and see how
>> slow it is and then resort to a static table if it is too slow.
>> Sometimes the speed difference is not that noticeable especially if
>> you have a clustered index on the station_id field.  Sometimes I
>> think the view can actually be faster because in this case your
>> station id table is probably much smaller than your weather solar
>> data so the number of geometries you'd be going thru would be
>> smaller and the planner would probably come out with a better
>> plan.  If the geometries are repeated in your solar/weather table,
>> then it may perceive those geometries as being different when they
>> are the same so may actually be doing more processing work when
>> doing distance checks.
>
> I asked because original suggestion (with the corrected typos) is
> still running (several hours now) and I even limited it to one
> station and one year where there was both weather and solar data
>
> WHERE sites.station_id = 212248 AND w.year = 1948;
>
> Would this be easier if I created new solar and weather tables that
> included the geometry colum in addition to the station_id columns.
> Something like:
>
> CREATE TABLE solar_full
> (
>    obs_id text NOT NULL,
>    station_id integer,
>    longname character varying,
>    lat double precision,
>    lon double precision,
>    thepoint_meter geometry,
>    date date,
>    "year" smallint,
>    "month" smallint,
>    "day" smallint,
>    doy smallint,
>    rad real,
>    dss smallint,
>    par real,
> CONSTRAINT solar_full_pkey PRIMARY KEY (obs_id)
> )
> WITHOUT OIDS;
>
>
>
>
>>
>> ________________________________
>>
>> From: postgis-users-bounces at postgis.refractions.net on behalf of
>> Kirk Wythers
>> Sent: Wed 2/14/2007 3:12 PM
>> To: PostGIS Users Discussion
>> Subject: Re: [postgis-users] joining "nearest" data
>>
>>
>>
>> On Feb 14, 2007, at 6:20 AM, Obe, Regina wrote:
>>
>>
>>
>>       SELECT w.obs_id,  sites.station_id, sites.longname, sites.lat,
>> sites.lon, sites.thepoint_meter,
>>       w.date, w.year, w.month, w.day, w.doy, w.precip, w.tmin,  
>> w.tmax,
>
>> w.tmean, w.snowfall, w.snowdepth,
>>       s.date, s.year, s.month, s.day, s.doy, s.rad, s.dss, s.par
>>       FROM sites INNER JOIN solar s ON sites.station_id =  
>> s.station_id
>
>> LEFT OUTER JOIN weather w ON w.station_id = sites.station_id
>>       UNION
>>
>>       SELECT w.obs_id,  sites.station_id, sites.longname, sites.lat,
>> sites.lon, sites.thepoint_meter,
>>       w.date, w.year, w.month, w.day, w.doy, w.precip, w.tmin,  
>> w.tmax,
>
>> w.tmean, w.snowfall, w.snowdepth,
>>       s.date, s.year, s.month, s.day, s.doy, s.rad, s.dss, s.par
>>       FROM sites INNER JOIN
>>       (SELECT s2.station_id, min(distance(s2.thepoint_meter,
>> sol.thepoint_meter))  As mindist
>>       FROM station s2 LEFT JOIN solar nosols ON s2.station_id =
>> nosols.station_id CROSS JOIN vwsolar sol
>>
>>
>> Regina, did you mean FROM sites s2 LEFT JOIN here? There is no
>> "station" table, nor have we aliased one
>>
>>
>>
>>
>>
>>
>>       WHERE nosols.station_id IS NULL
>>       GROUP BY s2.station_id) solmin
>>           ON sites.station_id = solmin.station_id INNER JOIN  
>> vwsolar s
>
>> ON distance(sites.thepoint_meter, s.thepoint_meter) =
>> solmin.mindistance LEFT JOIN weather w ON w.station_id =
>> sites.station_id)
>>
>>
>>
>>
>>
>> -----------------------------------------
>> The substance of this message, including any attachments, may be
>> confidential, legally privileged and/or exempt from disclosure
>> pursuant to Massachusetts law. It is intended
>> solely for the addressee. If you received this in error, please
>> contact the sender and delete the material from any computer.
>> <winmail.dat>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
> <winmail.dat>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users




More information about the postgis-users mailing list