[postgis-tickets] [PostGIS] #1075: Too much info in address breaks geocoder
PostGIS
trac at osgeo.org
Tue Jan 29 11:49:00 PST 2013
#1075: Too much info in address breaks geocoder
----------------------------+-----------------------------------------------
Reporter: mikepease | Owner: robe
Type: defect | Status: new
Priority: medium | Milestone: PostGIS 2.1.0
Component: tiger geocoder | Version: trunk
Keywords: |
----------------------------+-----------------------------------------------
Comment(by woodbri):
I was building on ming64 and it was crashing there, but it might be that
my build environment is not clean. Let me know if it works or not on
ming64.
Regarding abbreviations, of more generally how things get standardized, if
you look at the lex and gaz tables they have columns:
{{{
id serial NOT NULL,
seq integer,
word character varying, -- word to find in input text
stdword character varying, -- word to standardize it to
token integer, -- token classification for the word
}}}
Input symbols are classified as (see:"pagc_api.h", these are a mix of
input and output symbols):
{{{
#define NUMBER 0
#define WORD 1
#define TYPE 2
#define ROAD 6
#define STOPWORD 7
#define DASH 9
#define CITY 10
#define PROV 11
#define NATION 12
#define AMPERS 13
#define ORD 15
#define SINGLE 18
#define BUILDH 19
#define MILE 20
#define DOUBLE 21
#define DIRECT 22
#define MIXED 23
#define BUILDT 24
#define FRACT 25
#define PCT 26
#define PCH 27
#define QUINT 28
#define QUAD 29
}}}
It is possible for a word to be classified as multiple tokens which is ok.
For each token when there are multiple tokens you get all the possible
combinations. So if a you got something like:
{{{
23 A Street --> 0,[1, 18], 2
}}}
and this would get evaluated as two sequences of tokens:
{{{
0, 1, 2
0, 18, 2
}}}
The evaluation code uses the rules tables to transform input sequences to
output sequences and based on probabilities assigned to the rules and
scores them for most likely sequence. This is done in gamma.c.
Anyway you can added to the lexicon and gazeteer and to the rules as
needed and while the tables I provided are a good starting point for Tiger
they are not perfect and I generally find it worth while using two tables:
tiger data table -> standardized table
Then I can look at what records did not standardize to determine if new
record in the lex or gaz tables are needed or new rules are needed. Then
all my queries are done off the standardized table except when I need to
address ranges or geometry which I get by a join to the tiger table using
the gid which is common to both.
--
Ticket URL: <http://trac.osgeo.org/postgis/ticket/1075#comment:5>
PostGIS <http://trac.osgeo.org/postgis/>
The PostGIS Trac is used for bug, enhancement & task tracking, a user and developer wiki, and a view into the subversion code repository of PostGIS project.
More information about the postgis-tickets
mailing list