[postgis-devel] shp2pgsql transactions

Mon Oct 19 09:56:13 PDT 2009

Stephen Frost wrote:
> Michael,
> 
> * Michael Orlitzky (michael at orlitzky.com) wrote:
>> I am attempting to import the TIGER/Line road data, and have noticed  
>> that the line identifiers (tlid) are duplicated across county  
>> boundaries. 
> 
> I'm getting ready to load TIGER 2009 myself actually.
> 
>> The result is that some roads and their associated  
>> geometries are present in the database multiple times.
> 
> Is everything identical between them?  I presume so.

Everything except the county FIPS id (countyfp). The same issue probably 
occurs across state lines too, but I haven't spotted any amongst the 
county ones, which are numerous.

>> I imagine this  
>> will cause problems in the future for e.g. k-shortest path, and so would  
>> like to eliminate the duplicates. I see two options:
>>
>> 1  Find and eliminate the duplicates in the DB. Would be terribly slow
>>    with enough data.
>>
>> 2  Prevent the duplicates from being inserted with a unique index. Also
>>    slow, but better than the first option.
>>
>> Of these, the second seems more desirable. But, to do so, I would need  
>> to insert the rows one at a time outside of a transaction.
> 
> There's no such thing as 'outside of a transaction'.  What you would
> actually be doing is performing a transaction with *every* insert, which
> would be *horribly* slow.

Each INSERT is slower, yes, but my completely wild guess is that overall 
it's faster. With a unique index, it checks each INSERT for collisions. 
However, at any given point there are fewer records in the table, since 
the duplicates never make it into the table in the first place. The 
index itself also expedites the lookup.

>> Right now,  
>> I'm simply filtering the shp2pgsql output with sed. This works, but is  
>> slower than necessary.
> 
> If you've pulled out the begin/ends using sed, it's going to go *alot*
> slower.
> 
>> Would there be interest in a feature request or patch to make the  
>> transactions optional?
> 
> Probably not. :)
> 
> My recommendation would be to load the data into separate tables, then
> use 'select distinct on (tlid) * from (select * from table1 union all
> select * from table2 ...)'.  This will force a sort/unique based on tlid
> into the new table.

You can add the indices to each table, but you can't make Postgres use 
them for the big SELECT.

> Loading the data into a table which doesn't have any indexes on it will
> go alot faster too..

Of course, but you make up for it when you're doing the SELECT DISTINCT. 
Both ways are slow, but when you never insert the duplicates in the 
first place, you keep the number of rows down. On the other hand, 
Postgres can probably optimize the distinct query a bit. But then the 
amount of data you can load is CPU/RAM bound...

> Unfortunately, I don't see a way to convince PG to use an existing index
> with a distinct-on type query.  Ideally, we could create indexes on each
> of the smaller tables and PG would do an in-order index traversal of
> each table into a merge-join type node.

Right. That's why I think unique inserts might be faster. I should 
probably just test both ways.