first tip: in your case there is no need to have check_file , since list => change it to set () - then this should be better (explanation at the end).
If you need to have pieces, perhaps such a procedure is enough:
def split_to_chunks(wholeFileList): s = ssdeep() calculated_chunks = [] for someFileId in wholeFileList: for chunk in calculated_chunks: if s.compare(chunk[0], someFileId) > threshold: chunk.append(someFileId) break else:
after that you can filter the result: groups = filter (lambda x: len (x)> 1, result) remains = filter (lambda x: len (x) == 1, result)
NOTE. This algorithm assumes that the first element of the chunk is βbasicβ. A good result strongly depends on the behavior of ssdeep (I can imagine a strange question: how transitive is ssdeep?) If this similarity then there should be ...
In the worst case, if the score of any s.compare pair (fileId1, fileId2) does not satisfy the threshold condition (then the complexity is n ^ 2, so in your case 1.3 million * 1.3 million).
There is no easy way to optimize this case. Imagine a situation where s.compare (file1, file2) is always close to 0, then (as I understand it) even you know that s.compare (A, B) is very low and s.compare (B, C) is very but you still can't say anything about s.compare (A, C) =>, so you need to have n * n operations.
Another note: suppose you use too many structures and too many lists, for example:
set_lst = set_lst.difference(check_file)
This command creates a new set () and all the elements from set_lst and check_file HAVE that need to be touched at least once, and because check_file is a list, so there is no way to optimize the difference function, and it got complexity: len (check_file) * log (len (set_lst))
Basically: if these structures grow (in 1.3 million people), then your computer must perform much more calculations. If you use check_file = set () instead of [] (list), then the complexity of this should be: len (set_lst) + len (check_file)
Same thing with checking if an item is in a python list (array):
if tup1 in check_file:
because check_file is a list β in case tup1 is not in the list, your processor needs to compare tup1 with all elements, so the complexity of this is len (check_file) If you change check_file to set, then the complexity of this will be around log2 (len (check_file )) Let's make the difference more obvious, let's say len (* check_file *) = 1 million. How many comparisons do you need?
set: log2 (1mln) = log2 (1,000,000) ~ 20
: len (check_file) = 1mln