How to remove duplicate lines from a flat file using SSIS?

Let me first say that the ability to take 17 million records from a flat file by clicking on the database on the remote box and having it for 7 minutes is awesome. SSIS is really fantastic. But now that I have this data, how do I remove duplicates?

Even better, I want to take a flat file, remove duplicates from the flat file and put them back into another flat file.

I think about:

Data Flow Task

  • File source (with connection to the corresponding file)
  • Container for container
  • A script container containing some logic to indicate if another line exists

Work hard and everyone on this site is incredibly knowledgeable.

Update: I found this link can help answer this question

+6
source share
9 answers

Use the Sort component.

Just select the fields into which you want to sort the loaded lines, and in the lower left corner you will see a checkbox for deleting duplicates. In this field, any rows whose duplicates are based only on sorting criteria are deleted; therefore, in the example below, the rows will be considered duplicate if we are only sorted by the first field:

1 | sample A |
1 | sample B |
+22
source

SSIS , , Select Distinct Rank , , , .

- , SQL, . , , script , 17 , ... .

+6

→ ( , ) →

+4

, , . , . , , .

, , , , .

, , :

SET NOCOUNT ON

DECLARE @email varchar(100)

SET @email = ''

SET @emailid = (SELECT min(email) from StagingTable WITH (NOLOCK) WHERE email > @email)

WHILE @emailid IS NOT NULL
BEGIN

    -- Do INSERT statement based on the email
    INSERT StagingTable2 (Email)
    FROM StagingTable WITH (NOLOCK) 
    WHERE email = @email

    SET @emailid = (SELECT min(email) from StagingTable WITH (NOLOCK) WHERE email > @email)

END

, CURSOR, . , , , . SELECT , INSERT. .

+2

, unix, sort:

sort -u inputfile > outputfile

, Windows , :

( , , ).

, , , ignore_dup_key. .

CREATE UNIQUE INDEX idx1 ON TABLE (col1, col2, ...) WITH IGNORE_DUP_KEY
+2

- , . . " " . .

+2

. SSIS DFS ( ), .

+2

, . - , , , SSIS. , - , , . :

  • .
  • , , , ..
  • . NOT EXISTS NOT IN. . MERGE .
  • . , , . CTE ROW_NUMBER(), , :

.

WITH    
    sample_records 
    (       email_address
        ,   entry_date
        ,   row_identifier
    )
    AS
    (
            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   1
        UNION ALL

            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:01'
                    ,   2

        UNION ALL

            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:02'
                    ,   3

        UNION ALL

            SELECT      'the_other_test@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   4

        UNION ALL

            SELECT      'the_other_test@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   5
    )
,   filter_records 
    (       email_address
        ,   entry_date
        ,   row_identifier
        ,   sequential_order
        ,   reverse_order
    )
    AS
    (
        SELECT  email_address
            ,   entry_date
            ,   row_identifier
            ,   'sequential_order'  = ROW_NUMBER() OVER (
                                        PARTITION BY    email_address 
                                        ORDER BY        row_identifier ASC)
            ,   'reverse_order'     = ROW_NUMBER() OVER (
                                        PARTITION BY    email_address
                                        ORDER BY        row_identifier DESC)
        FROM    sample_records
    )
    SELECT      email_address
            ,   entry_date
            ,   row_identifier
    FROM        filter_records
    WHERE       reverse_order = 1
    ORDER BY    email_address;

, , . , , MERGE, INSERT .

+1

Found this page the link text , which is worth paying attention to, although it may take too much time with 17 million entries

+1
source

All Articles