[postgis-tickets] [PostGIS] #1075: Too much info in address breaks geocoder

Tue Jan 29 11:49:00 PST 2013

#1075: Too much info in address breaks geocoder
----------------------------+-----------------------------------------------
 Reporter:  mikepease       |       Owner:  robe         
     Type:  defect          |      Status:  new          
 Priority:  medium          |   Milestone:  PostGIS 2.1.0
Component:  tiger geocoder  |     Version:  trunk        
 Keywords:                  |  
----------------------------+-----------------------------------------------

Comment(by woodbri):

 I was building on ming64 and it was crashing there, but it might be that
 my build environment is not clean. Let me know if it works or not on
 ming64.

 Regarding abbreviations, of more generally how things get standardized, if
 you look at the lex and gaz tables they have columns:
 {{{
   id serial NOT NULL,
   seq integer,
   word character varying,    -- word to find in input text
   stdword character varying, -- word to standardize it to
   token integer,             -- token classification for the word
 }}}

 Input symbols are classified as (see:"pagc_api.h", these are a mix of
 input and output symbols):

 {{{
 #define NUMBER 0
 #define WORD 1
 #define TYPE 2

 #define ROAD 6
 #define STOPWORD 7

 #define DASH 9
 #define CITY 10
 #define PROV 11
 #define NATION 12
 #define AMPERS 13

 #define ORD 15

 #define SINGLE 18
 #define BUILDH 19
 #define MILE 20
 #define DOUBLE 21
 #define DIRECT 22
 #define MIXED 23
 #define BUILDT 24
 #define FRACT 25
 #define PCT 26
 #define PCH 27
 #define QUINT 28
 #define QUAD 29
 }}}

 It is possible for a word to be classified as multiple tokens which is ok.
 For each token when there are multiple tokens you get all the possible
 combinations. So if a you got something like:

 {{{
 23 A Street  -->  0,[1, 18], 2
 }}}

 and this would get evaluated as two sequences of tokens:

 {{{
 0, 1, 2
 0, 18, 2
 }}}

 The evaluation code uses the rules tables to transform input sequences to
 output sequences and based on probabilities assigned to the rules and
 scores them for most likely sequence. This is done in gamma.c.

 Anyway you can added to the lexicon and gazeteer and to the rules as
 needed and while the tables I provided are a good starting point for Tiger
 they are not perfect and I generally find it worth while using two tables:

 tiger data table -> standardized table

 Then I can look at what records did not standardize to determine if new
 record in the lex or gaz tables are needed or new rules are needed. Then
 all my queries are done off the standardized table except when I need to
 address ranges or geometry which I get by a join to the tiger table using
 the gid which is common to both.

-- 
Ticket URL: <http://trac.osgeo.org/postgis/ticket/1075#comment:5>
PostGIS <http://trac.osgeo.org/postgis/>
The PostGIS Trac is used for bug, enhancement & task tracking, a user and developer wiki, and a view into the subversion code repository of PostGIS project.