How to remove duplicate lines from a flat file using SSIS?

Question

How to remove duplicate lines from a flat file using SSIS?

Let me first say that the ability to take 17 million records from a flat file by clicking on the database on the remote box and having it for 7 minutes is awesome. SSIS is really fantastic. But now that I have this data, how do I remove duplicates?

Even better, I want to take a flat file, remove duplicates from the flat file and put them back into another flat file.

I think about:

Data Flow Task

File source (with connection to the corresponding file)
Container for container
A script container containing some logic to indicate if another line exists

Work hard and everyone on this site is incredibly knowledgeable.

Update: I found this link can help answer this question

+6

sql-server duplicate-removal duplicates ssis business-intelligence

Ryankeeter Sep 29 '08 at 21:27

source share

9 answers

Craig warren · Answer 1 · 2009-03-06T14:02:36+0000

Use the Sort component.

Just select the fields into which you want to sort the loaded lines, and in the lower left corner you will see a checkbox for deleting duplicates. In this field, any rows whose duplicates are based only on sorting criteria are deleted; therefore, in the example below, the rows will be considered duplicate if we are only sorted by the first field:

1 | sample A |
1 | sample B |

Timothy Lee Russell · Answer 2 · 2008-09-29T21:48:15+0000

SSIS , , Select Distinct Rank , , , .

- , SQL, . , , script , 17 , ... .

Devin · Answer 3 · 2009-06-10T22:05:06+0000

→ ( , ) →

Hector Sosa Jr · Answer 4 · 2008-09-29T22:10:09+0000

, , . , . , , .

, , , , .

, , :

SET NOCOUNT ON

DECLARE @email varchar(100)

SET @email = ''

SET @emailid = (SELECT min(email) from StagingTable WITH (NOLOCK) WHERE email > @email)

WHILE @emailid IS NOT NULL
BEGIN

    -- Do INSERT statement based on the email
    INSERT StagingTable2 (Email)
    FROM StagingTable WITH (NOLOCK) 
    WHERE email = @email

    SET @emailid = (SELECT min(email) from StagingTable WITH (NOLOCK) WHERE email > @email)

END

, CURSOR, . , , , . SELECT , INSERT. .

AJ. · Answer 5 · 2008-09-30T12:32:25+0000

, unix, sort:

sort -u inputfile > outputfile

, Windows , :

( , , ).

, , , ignore_dup_key. .

CREATE UNIQUE INDEX idx1 ON TABLE (col1, col2, ...) WITH IGNORE_DUP_KEY

Christian Loris · Answer 6 · 2008-12-07T05:27:38+0000

- , . . " " . .

Mohit · Answer 7 · 2011-10-11T09:42:15+0000

. SSIS DFS ( ), .

Registered User · Answer 8 · 2009-10-08T17:23:32+0000

, . - , , , SSIS. , - , , . :

.
, , , ..
. NOT EXISTS NOT IN. . MERGE .
. , , . CTE ROW_NUMBER(), , :

.

WITH    
    sample_records 
    (       email_address
        ,   entry_date
        ,   row_identifier
    )
    AS
    (
            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   1
        UNION ALL

            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:01'
                    ,   2

        UNION ALL

            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:02'
                    ,   3

        UNION ALL

            SELECT      'the_other_test@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   4

        UNION ALL

            SELECT      'the_other_test@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   5
    )
,   filter_records 
    (       email_address
        ,   entry_date
        ,   row_identifier
        ,   sequential_order
        ,   reverse_order
    )
    AS
    (
        SELECT  email_address
            ,   entry_date
            ,   row_identifier
            ,   'sequential_order'  = ROW_NUMBER() OVER (
                                        PARTITION BY    email_address 
                                        ORDER BY        row_identifier ASC)
            ,   'reverse_order'     = ROW_NUMBER() OVER (
                                        PARTITION BY    email_address
                                        ORDER BY        row_identifier DESC)
        FROM    sample_records
    )
    SELECT      email_address
            ,   entry_date
            ,   row_identifier
    FROM        filter_records
    WHERE       reverse_order = 1
    ORDER BY    email_address;

, , . , , MERGE, INSERT .

ScotSQLBob · Answer 9 · 2010-07-29T12:46:06+0000

Found this page the link text , which is worth paying attention to, although it may take too much time with 17 million entries

How to remove duplicate lines from a flat file using SSIS?

More articles: