Data Change Detection

So the story is this:

- I have many files (rather large, about 25 GB) that are in a specific format and should be imported into the data warehouse

- these files are constantly updated with data, sometimes new, sometimes the same data

- I'm trying to figure out an algorithm on how I can determine if something has changed for a particular line in a file to minimize the time taken to update the database

- the way it works right now is that every time I delete all the data in the database and then reimport, but this will not work anymore, since I need a timestamp when the item changed.

- files contain lines and numbers (names, orders, prices, etc.)

The only solutions I could think of:

- calculate the hash for each row from the database, which it compares with the hash of the row from the file and if they differ in updating the database

- keep 2 copies of files, previous and current, and make a difference with them (which is probably faster than updating db) and based on these db updates.

Since the amount of data is very large for huge, at the moment I'm kind of like an option. In the end, I will get rid of the files and the data will be transferred directly to the database, but the problem still remains.

Any advice would be appreciated.

+5
source share
4 answers

Definition of a problem as understood .

Let's say your file contains

ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40

As you said, the line can be added / updated, so the file becomes

ID,Name,Age
1,Jim,20    -- to be discarded 
2,Tim,35    -- to be updated
3,Kim,40    -- to be discarded 
4,Zim,30    --  to be inserted 

, / 2 sql- 1 , sql.

  • .
  • [ - - ] .

- [, ] , ID - , - hash [ , hazelcast].

Batch Framework . [ , ], ID . .

 If (ID present)
--- compare hash 
---found same then discard it
—found different create an update sql 
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue

, spring -batch hazelcast.

http://www.hazelcast.com/

http://static.springframework.org/spring-batch/

, .

+3

, , -?

- .

, , - , , . , , .

+1

, , , O (n), n ~ 25 .

, .

25GB , .

1.
, ? , , , , ( ).

2. ,
, , . , # 1, , . , , ( - , ).

, / /. - , ( , , YMMV).

()

  • # 1 # 2 , ,
  • , 25- , , /, - ( , ), / ( / ).
  • diff , diff , . ( diff, -H --minimal , .. , , iIR O (n log n), , , O (n), , )
+1

, , ? WriteFile, . .

-, : , , ? ( 2- , ).

0

All Articles