Geocoding with PAGC Primer (long)
Sampson, David
dsampson at NRCan.gc.ca
Fri Sep 22 13:28:47 EDT 2006
Geocoding with PAGC
Backgrounder
* The moment we started to leak word out about PAGC we quickly
realized that data issues swamped software issues.
* Like everything in this area, software and data issues go
hand-in-hand
* There is a lot of art in geocoding.
* It turns out that a 90% success rate is sort of the industry
standard in terms of geocoding addresses.
* Commercial companies really boast about doing better than this,
and they typically do it by resorting to the use of FSA centroids (which
can be wildly off)
Priority: Augmented RNF
* PAGC tools work with augmented data that is currently locked up
under copyright. Effort to produce augmented data free of copyright
restrictions
* Key to the kingdom
* data issues are huge.
* with FSAs and Cities/towns
* Once we get enough information into the StatsCan RNF, it could
be the basis for developing an open source standardizer
Labour intensive aspects:
1. determining what road segments to keep in an RNF that contains
only the FSA boundary roads for multi-FSA cities and creating the
finished layers for those cities
2. Determining what other linear features to use that are not
roads. Ie railroads, rivers, etc
3. augmenting the Atlas of Canada populated place data to include
FSAs.
Populated place names with FSA
Urban:
1. populated place names like Ottawa have multiple FSA's per
populated place
2. urban areas with multiple FSAs relying only on the RNF is nearly
sufficient for creating FSA polygons.
3. The other piece of information that is needed are the names of
the roads that typically act as FSA boundaries in these areas
4. although railways and bike paths are also used as FSA boundaries
in a few instances
5. This information is provided by Canada Post in their document
entitled "Canada.pdf,"
6. we can get most of the boundaries by removing all but the FSA
boundary street segments from the RNF via ogr2ogr
Rural
1. Rural locations may have multiple populated areas per FSA
2. What will work pretty well is using the populated places point
file from the Atlas of Canada and then adding a field to this file that
gives the FSA of each populated place
3. do this for populated places that have a population category
value of 1, places in population category 2 and above are likely to have
multiple postal codes).
4. Once this is done, the Census CSD (or maybe even the Census CD)
polygon layer can be used to determine the rural FSA a CSD or CD falls
into, and then doing a dissolve to merge the polygons that have the same
assigned FSA.
Milestone Issues
* we aren't in a position to create the FSA data until StatsCan
releases the 2006 RNF and CSD layers, which should be in the next few
weeks 9as of September 19, 2006).
* In the mean time, we have enough data (for Ottawa) to figure out
if I'm right on how to proceed
* With the side benefit of creating a fully augmented RNF for
Ottawa
FAQ's
1. Lists without postal codes or city info require manual
intervention: As for the POSTAL issue, some sort of local identifier is
needed. A CITY should suffice, and I will talk to Walter about the
possibility of requiring either one of the two.
2. Completely blowing away this postal check would be unwise,
however, since you know someone is going to try to geocode things based
on the RNF for the entire country. In that instance, only a POSTAL field
would solve the problem, hence why it is really important to create an
augmented RNF for Canada.
3. You should not have to touch either the the rules or the
gazetteer files. PAGC was built with StatsCan (and US TIGER/Line) in
mind.
4. Don't get too nervous about the errors in the road index build
error file either.
5. Creating smaller RNF road files that don't have multi-part
lines: In terms of the road segment problem, I think a safer approach is
to use the polygon to attach a flag variable to the attribute table of
the road layer, and then use ogr2ogr, with a -where "flag=1" like option
to select only those road segments that are in (or are on the border of)
the Ottawa polygon. This approach is extremely unlikely to "damage" a
road segment since ogr will extract the full segment intact. It does
mean that the Ottawa polygon will have a few "whiskers" (road segments
that partially lie outside the Ottawa polygon). Using the whisker
analogy, it appears that Open Jump snipped an "ingrown hair," and that
causes problems. I've been working on attaching the needed flag to the
RNF attribute table over the past few days (working on it for slightly
less than an hour a day), and should have something tonight or tomorrow.
6. PAGC chokes on line 960: It turns out the error is caused by the
961st road segment (which is given number 960 since counting starts at 0
in C in the error file).
7. Polylines are topologicaly ok but chokes PAGC: Polylines
shouldn't have bounding boxes, ring direction, and so on. This road
segment has these extra attributes (which means it is actually being
written as a polygon or multi part polyline, rather than a single
polyline), hence PAGC's complaining about it having too many parts .
8. what about FSA centroids from Geoconnections: The resulting FSA
centroid file is more of a dog's breakfast than one would think it would
be. In the case of BC, data is unavailable for two FSAs (V1K in Merritt
and V8B in Squamish), at least one FSA is put in the wrong community
(the one I know about is V1H which shows up in Prince George, but should
be in Vernon, several hundred km away), and the province field has the
wrong data in three cases. In addition, the rural FSA centroids are
problematic since there is a single centroid for areas that are enormous
(multiple hundreds if not thousands of square kilometres in size).
Finally, I get a few centroids that wind up in the Georgia Strait. This
can happen since the centroid of a polygon may fall outside the polygon
itself. All in all the FSA centroid file is useful, but it has a lot of
kinks.
Supplemental Products
1. FSA Centroids
2.
Derived Data Products:
1. intersection and cross roads
2. fully augmented RNF for Ottawa
3. an address standardizer database
4. a street intersection layer
Future Software Development
1. Address standardization: developed for direct marketing mail
campaigns to clean-up dirty addresses
2. Already have written R scripts to extract road intersections
from an RNF and attach attributes (like the names of streets that form
the intersection) which forms the input data of intersection geocoding
3. Walter already has code in PAGC for doing point matching,
4. Recode PAGC from command prompt program to a library, and build
command line program to call library. Allows for inclusion into other
GIS software packages.
Resource Links:
Canada post FSA Maps
http://www.canadapost.ca/common/tools/pg/fsamaps/pdf/Canada.pdf
Wikipedia Canadian FSA Lists:
http://en.wikipedia.org/wiki/List_of_A_Postal_Codes_of_Canada
Free or no licence issues
http://www.postalcodelookup.ca/
Unknown Licence issues
http://www.postescanada.ca/cpc2/addrm/hh/current/indexp/tpALL-e.asp
Other geocoders: (web )
http://geocoder.ca/
http://www.batchgeocode.com/
http://geoservices.cgdi.ca/postalcode/sample.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osgeo.org/pipermail/can_rnf/attachments/20060922/dbd9421c/attachment.html
More information about the Can_rnf
mailing list