[postgis-devel] PAGC Address Standardizer some thoughts on how toorganize

Sat Jul 5 06:52:48 PDT 2014

On 7/5/2014 2:41 AM, Paragon Corporation wrote:
>
>
>
>
>>>>
>>>> 1) Create folder in extensions of our repo and move the
>>>> address_standardizer
>>>> extension files to their
>>>> I'd still like it to be able to be built separately if people wish
>>>> (similar
>>>> to how we have liblwgeom I think) and my only reservation with
>>>> breaking out
>>>> like this is that it makes it less compact.
>
>
>> I would defer to strk on this, but can we use a symlink to get the file
>> to appear in two places under svn?
>
> Let's stay away from symlinks - they don't behave well under windows.  I
> think for now we'll
> Just leave the extension files wher they are and decide what to do later.
> Perhaps just copy the control files as part of our extension make scripts.

I'm ok with this. The reason I suggested symlinks (knowing that windows 
does not handle them) is that I thought I read the SVN handles them 
internally and when you checkout on windows it makes a copy of the file 
and keeps information that it is a symlink, so if you make changes to 
the file and commit them, it is smart enough to apply the changes to the 
original file and not the copy. I have tried this yet so I may not have 
understood it correctly.

> I was the one who started the extension folder primarily because when I was
> formulating the extension install
> Process, it was easier not to affect the rest of the code base while I was
> fleshing things out.
>
> That said the only issue is the documentation auto comments generation that
> I would like address_standardizer to
> Have similar to the other extensions we have.  As well as the versioning
> plumming that all postgis extensions share.

I'm ok with doing whatever is right. There is a lot of details about how 
postgis works on the development and release processes that I'm not yet 
familiar with.

> Let's see how things go keeping things where they are unless someone has a
> better idea.

Agreed.

>>
>>>
>>> 4) Build separate extensions for the custom gaz/lex/rules currently
>>> present
>>> and add more. Right now to run the packaged dictionaries you need to
>>> run the
>>> lex,gaz,rules.sql files which is cumbersome from a newbie stand-point.
>>> This one I'm actually thinking just rolling the current one in the base
>>> extension and then having extensions for custom ones. Since at least US
>>> people will just use the base one or if they are using tiger geocoder the
>>> tiger geocoder one already packaged with tiger geocoder extension.
>>
>>     this is where things get muddy ... Like so many software projects, a
>> broad generalized archtecture ends up covering a
>> common use case, and the rest is then in the way or collects dust as
>> focus narrows. It *is* great to have a generalized address parsing
>> engine.. but how this lib got here is,
>> its been difficult to modernize and put sufficient time into a small
>> niche utility - Steve told me so..
>> A "pragmatic" move would be to tightly configure the lex/gaz/etc to the
>> TIGER Geocoder
>> and ship it.. but, not using the capacity of the lib. On the other hand,
>> if the generalized,
>> multinational promise is pursued, who is going to build it out? Where
>> are the OSM people ?
>> I am interested sure but this is dense going.. Steve and Regina but are
>> there enough hands ?
>> no clear answers here...
>
>> OK, I have thoughts on this, along the following lines which have to do
>> with the longer term. I think we should have multiple packages for the
>> set of gaz/lex/rules, that can be used for different data sets.
> Agree
>
>>   I would keep data you have for the
> Tiger Geocoder with that application and keep the generic files that
> came with the address standardizer as a more generic set that other
> people can use to make custom changes to.
>
> So question then is do we just include a sample as part of the extension, or
> we create an example
> Extension that demonstrates the concept of the files.
>
> I'm thinking just packaging it along with the main will be easier, but maybe
> call the tables
>
> sample_lex, sample_gaz, sample_rules
>
>
> Or something so pepole know these get overwritten if they upgrade and they
> should build their own.
> It will also make writing the examples in doco easier if we have sample
> tables people can reference to see how it works.

I don't like the idea of including them with as part of the address 
standardizer extension for the following reasons:

1. calling them sample_lex, sample_gaz, sample_rules just loads junk 
that can not easily be removed because it is bundled with the extension

2. we might have multiple lex, gaz, rules for different countries/data 
sets like, Tiger, Canada, UK, France, Germany, etc or for other use cases.

Would it be ok if we created an address_standardizer_sample_data 
extension that loaded sample_lex, sample_gaz, sample_rules? this would 
make sense for documentation and testing. And we could extend that to 
create additional packages in the future if we decide to do that.

I would bundle the Tiger Geocoder files with that extension.

The rationale here is that files that are part of an application should 
get loaded with that application, but if I'm loading just the address 
standardizer then I should choose which data files I want to load if any 
because I am likely building my own application and may be loading my 
own versions of the data files.

>>> My personal take is -- change the TIGER Geocoder for 2.2+ and break
>>> compatability ..
>>> Whatever is convenient.. very much unlike the overall PostGIS project,
>>> there are
>>> few if any  'production systems' depending on the details, and damn them
>>> anyway if they whine
>>
>
>> I don't have a strong opinion on this one, but my take would be to leave
>> the current setup as is. We have talked about a total rewrite of the
>> Tiger Geocoder to make it more generic and to follow more of the ideas
>> that I have put into my geocoder. This would be the place to make
>> breaking changes. My current geocoder uses 95+% of its code and I can
>> load Tiger, Navteq, or Canada data into it. Longer term I would like to
>> extend this to be able to load Navteq or TeleAtlas or other data for
>> Western Europe, but we need to make some changes to address standardizer
>> and parser to handle accents and parse input in non-English countries.
>> This would give us a Geocoder capability that would be on par with
>> Oracle Spatial's Geocoder.
>
> I like that idea better - start with a clean slate then I won't be tempted
> to salvage anything that shouldn't be salvaged.

I originally looked the the Tiger Geocoder but it is too tied to legacy 
Tiger structure. After dealing with the data loading to a more abstract 
table structure that could be used with address standardizer it only 
took a week to implement the core geocoding engine. So a clean slate 
rewrite could greatly simplify the code and make it much easier to 
support and enhance. I'll try to start a white paper that discuss how I 
did this and that might be a good starting point for discussing the 
design and rewrite for a future release.

Lots of good ideas!

Thanks,
   -Steve

> Thanks,
> Regina
>
>
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel
>