[postgis-users] Slow delete of duplicate rows

Stephen Woodbridge woodbri at swoodbridge.com
Mon Mar 5 11:01:45 PST 2012


Thanks Paul.

One more thought on this.

Don't forget to add any constraints and indexes back onto the new table 
AFTER it is populated as this will be much faster. You might also need 
to run select populate_geometry_columns(); if you need these updated 
after you do the rename.

Doing the rename in a transaction is atomic as far as other processes 
and queries are concerned, so this method works well in production 
environments when you need to reload a table and don't want the old 
table to be offline while you are doing the updates or reload.

-Steve

On 3/5/2012 1:40 PM, Paul Ramsey wrote:
> This will be the fastest approach since it involves a new table
> creation instead of re-writing an existing table. Good call Stephen.
>
> P.
>
> On Mon, Mar 5, 2012 at 8:17 AM, Stephen Woodbridge
> <woodbri at swoodbridge.com>  wrote:
>> On 3/5/2012 11:12 AM, DrYSG wrote:
>>>
>>> I have a large table (20M records) but mostly short text fields. There are
>>> duplicates that I am trying to remove.  I have a bigseriel index that I
>>> unique, but duplicates in another field.
>>>
>>> I have an 8 core, 24GB memory computer with RAID disks.
>>>
>>> This request has been running for 70 hours (is it safe to kill it?).
>>
>>
>> Yes, it is safe.
>>
>>
>>> How can I make this run faster? This is a one time processing task, but it
>>> is taking a long time.
>>
>>
>> CREATE TABLE portal.new_metatdata AS
>> select distinct on (data_object.unique_id) * FROM portal.metadata;
>>
>> Or something of this ilk should be faster because it only need to do a sort
>> on data_object.unique_id and then an insert. After you have verified the
>> results you can do:
>>
>> BEGIN;
>> ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
>> ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
>> COMMIT;
>>
>> Other people might have other ideas.
>> -Steve
>>
>>
>>> DELETE FROM portal.metadata
>>> WHERE idx NOT IN
>>> (
>>>      SELECT MIN(idx)
>>>      FROM portal.metadata
>>>      GROUP BY "data_object.unique_id"
>>> );
>>>
>>> CREATE TABLE
>>>      metadata
>>>      (
>>>          data_object.unique_id CHARACTER(64) NOT NULL,
>>>          size_bytes BIGINT,
>>>          object_date TIMESTAMP(6) WITHOUT TIME ZONE,
>>>          object_type CHARACTER VARYING(25),
>>>          classification CHARACTER VARYING(7),
>>>          object_source CHARACTER VARYING(50),
>>>          object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
>>>          clevel INTEGER,
>>>          fsctlh CHARACTER VARYING(50),
>>>          oname CHARACTER VARYING(40),
>>>          description CHARACTER VARYING(80),
>>>          horizontal_datum CHARACTER VARYING(20),
>>>          do_location CHARACTER VARYING(200),
>>>          elevation_ft INTEGER,
>>>          location USER-DEFINED,
>>>          idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass)
>>> NOT
>>> NULL,
>>>          bbox CHARACTER VARYING(160),
>>>          CONSTRAINT MetaDataKey PRIMARY KEY (idx)
>>>      )
>>>
>>> --
>>> View this message in context:
>>> http://postgis.17.n6.nabble.com/Slow-delete-of-duplicate-rows-tp4548251p4548251.html
>>> Sent from the PostGIS - User mailing list archive at Nabble.com.
>>> _______________________________________________
>>> postgis-users mailing list
>>> postgis-users at postgis.refractions.net
>>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users




More information about the postgis-users mailing list