[postgis-users] Slow delete of duplicate rows
Nicolas Ribot
nicolas.ribot at gmail.com
Mon Mar 5 09:15:16 PST 2012
Hi,
Try the "NOT EXISTS" construct. I found it is order of magnitude
faster than "NOT IN".
It should be something like (test the query before using it, i'm not
sure of it :)
DELETE FROM portal.metadata
WHERE NOT EXISTS
(
SELECT MIN(idx)
FROM portal.metadata
GROUP BY "data_object.unique_id"
);
Nicolas
On 5 March 2012 17:17, Stephen Woodbridge <woodbri at swoodbridge.com> wrote:
> On 3/5/2012 11:12 AM, DrYSG wrote:
>>
>> I have a large table (20M records) but mostly short text fields. There are
>> duplicates that I am trying to remove. I have a bigseriel index that I
>> unique, but duplicates in another field.
>>
>> I have an 8 core, 24GB memory computer with RAID disks.
>>
>> This request has been running for 70 hours (is it safe to kill it?).
>
>
> Yes, it is safe.
>
>
>> How can I make this run faster? This is a one time processing task, but it
>> is taking a long time.
>
>
> CREATE TABLE portal.new_metatdata AS
> select distinct on (data_object.unique_id) * FROM portal.metadata;
>
> Or something of this ilk should be faster because it only need to do a sort
> on data_object.unique_id and then an insert. After you have verified the
> results you can do:
>
> BEGIN;
> ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
> ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
> COMMIT;
>
> Other people might have other ideas.
> -Steve
>
>
>> DELETE FROM portal.metadata
>> WHERE idx NOT IN
>> (
>> SELECT MIN(idx)
>> FROM portal.metadata
>> GROUP BY "data_object.unique_id"
>> );
>>
>> CREATE TABLE
>> metadata
>> (
>> data_object.unique_id CHARACTER(64) NOT NULL,
>> size_bytes BIGINT,
>> object_date TIMESTAMP(6) WITHOUT TIME ZONE,
>> object_type CHARACTER VARYING(25),
>> classification CHARACTER VARYING(7),
>> object_source CHARACTER VARYING(50),
>> object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
>> clevel INTEGER,
>> fsctlh CHARACTER VARYING(50),
>> oname CHARACTER VARYING(40),
>> description CHARACTER VARYING(80),
>> horizontal_datum CHARACTER VARYING(20),
>> do_location CHARACTER VARYING(200),
>> elevation_ft INTEGER,
>> location USER-DEFINED,
>> idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass)
>> NOT
>> NULL,
>> bbox CHARACTER VARYING(160),
>> CONSTRAINT MetaDataKey PRIMARY KEY (idx)
>> )
>>
>> --
>> View this message in context:
>> http://postgis.17.n6.nabble.com/Slow-delete-of-duplicate-rows-tp4548251p4548251.html
>> Sent from the PostGIS - User mailing list archive at Nabble.com.
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
More information about the postgis-users
mailing list