[postgis-users] Slow delete of duplicate rows

Nicolas Ribot nicolas.ribot at gmail.com
Mon Mar 5 09:15:16 PST 2012


Hi,

Try the "NOT EXISTS" construct. I found it is order of magnitude
faster than "NOT IN".

It should be something like (test the query before using it, i'm not
sure of it :)

DELETE FROM portal.metadata
WHERE NOT EXISTS
(
   SELECT MIN(idx)
   FROM portal.metadata
   GROUP BY "data_object.unique_id"
);

Nicolas

On 5 March 2012 17:17, Stephen Woodbridge <woodbri at swoodbridge.com> wrote:
> On 3/5/2012 11:12 AM, DrYSG wrote:
>>
>> I have a large table (20M records) but mostly short text fields. There are
>> duplicates that I am trying to remove.  I have a bigseriel index that I
>> unique, but duplicates in another field.
>>
>> I have an 8 core, 24GB memory computer with RAID disks.
>>
>> This request has been running for 70 hours (is it safe to kill it?).
>
>
> Yes, it is safe.
>
>
>> How can I make this run faster? This is a one time processing task, but it
>> is taking a long time.
>
>
> CREATE TABLE portal.new_metatdata AS
> select distinct on (data_object.unique_id) * FROM portal.metadata;
>
> Or something of this ilk should be faster because it only need to do a sort
> on data_object.unique_id and then an insert. After you have verified the
> results you can do:
>
> BEGIN;
> ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
> ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
> COMMIT;
>
> Other people might have other ideas.
> -Steve
>
>
>> DELETE FROM portal.metadata
>> WHERE idx NOT IN
>> (
>>     SELECT MIN(idx)
>>     FROM portal.metadata
>>     GROUP BY "data_object.unique_id"
>> );
>>
>> CREATE TABLE
>>     metadata
>>     (
>>         data_object.unique_id CHARACTER(64) NOT NULL,
>>         size_bytes BIGINT,
>>         object_date TIMESTAMP(6) WITHOUT TIME ZONE,
>>         object_type CHARACTER VARYING(25),
>>         classification CHARACTER VARYING(7),
>>         object_source CHARACTER VARYING(50),
>>         object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
>>         clevel INTEGER,
>>         fsctlh CHARACTER VARYING(50),
>>         oname CHARACTER VARYING(40),
>>         description CHARACTER VARYING(80),
>>         horizontal_datum CHARACTER VARYING(20),
>>         do_location CHARACTER VARYING(200),
>>         elevation_ft INTEGER,
>>         location USER-DEFINED,
>>         idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass)
>> NOT
>> NULL,
>>         bbox CHARACTER VARYING(160),
>>         CONSTRAINT MetaDataKey PRIMARY KEY (idx)
>>     )
>>
>> --
>> View this message in context:
>> http://postgis.17.n6.nabble.com/Slow-delete-of-duplicate-rows-tp4548251p4548251.html
>> Sent from the PostGIS - User mailing list archive at Nabble.com.
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users



More information about the postgis-users mailing list