Problem with grouping similar files using Multi and ssdep Python processing

I am trying to process files larger than 1.3 million using hyphens (SSDEEP) http://code.google.com/p/pyssdeep

What he does is this. it generates hashes (1.3 mil generated in 3-6 minutes) and then compares each other to get similar results. The comparison is very quick, but just starting one process will not make it all finish. Therefore, we added the Pipon Multiprocessing module so that everything is done.

The result is 1.3 million text files made in 30 minutes. using 18 cores (Quad Xeon processors, total 24 CPUS)

Here's how each process works:

  • Generate SSDEEP amounts.
  • Divide this list of sums into 5,000 groups of pieces.
  • Compare each piece 1 against 5000 for 18 processes: 18 iterations are compared for 18 sums.
  • Group results based on similarity score (default is 75)
  • Files that have already been checked for the next iteration are deleted.
  • Start with the next file, which is <75% for the next group.
  • Repeat until all groups are completed.
  • If there are files that are not included (not like files), they are added to the remaining list.

When all processed ones are executed, the rest of the files are combined and compared with each other recursively until the result remains.

The problem is that the list of files is split into smaller (5000) files. There are files that are included in the first part of 5000, but are not included in another group, which makes the groups incomplete.

If I run without chunking, it takes a very long time to complete the loop. more than 18 hours and not done. don't know how long.

Please advise me.

Used modules: multiprocessing.Pool, ssdep python

def ssdpComparer(lst, threshold): s = ssdeep() check_file = [] result_data = [] lst1 = lst set_lst = set(lst) print '>>>START' for tup1 in lst1: if tup1 in check_file: continue for tup2 in set_lst: score = s.compare(tup1[0], tup2[0]) if score >= threshold: result_data.append((score, tup1[2], tup2[2])) #Score, GroupID, FileID check_file.append(tup2) set_lst = set_lst.difference(check_file) print """####### DONE #######""" remain_lst = set(lst).difference(check_file) return (result_data, remain_lst) def parallelProcessing(tochunk_list, total_processes, threshold, source_path, mode, REMAINING_LEN = 0): result = [] remainining = [] pooled_lst = [] pair = [] chunks_toprocess = [] print 'Total Files:', len(tochunk_list) if mode == MODE_INTENSIVE: chunks_toprocess = groupWithBlockID(tochunk_list) #blockID chunks elif mode == MODE_THOROUGH: chunks_toprocess = groupSafeLimit(tochunk_list, TOTAL_PROCESSES) #Chunks by processes elif mode == MODE_FAST: chunks_toprocess = groupSafeLimit(tochunk_list) #5000 chunks print 'No. of files group to process: %d' % (len(chunks_toprocess)) pool_obj = Pool(processes = total_processes, initializer = poolInitializer, initargs = [None, threshold, source_path, mode]) pooled_lst = pool_obj.map(matchingProcess, chunks_toprocess) #chunks_toprocess tmp_rs, tmp_rm = getResultAndRemainingLists(pooled_lst) result += tmp_rs remainining += tmp_rm print 'RESULT LEN: %s, REMAINING LEN: %s, PRL: %s' % (len(result), len(remainining), REMAINING_LEN) tmp_r_len = len(remainining) if tmp_r_len != REMAINING_LEN and len(result) > 0 : result += parallelProcessing(remainining, total_processes, threshold, source_path, mode, tmp_r_len) else: result += [('','', rf[2]) for rf in remainining] return result def getResultAndRemainingLists(pooled_lst): g_result = [] g_remaining = [] for tup_result in pooled_lst: tmp_result, tmp_remaining = tup_result g_result += tmp_result if tmp_remaining: g_remaining += tmp_remaining return (g_result, g_remaining) 
+4
source share
1 answer

first tip: in your case there is no need to have check_file , since list => change it to set () - then this should be better (explanation at the end).

If you need to have pieces, perhaps such a procedure is enough:

 def split_to_chunks(wholeFileList): s = ssdeep() calculated_chunks = [] for someFileId in wholeFileList: for chunk in calculated_chunks: if s.compare(chunk[0], someFileId) > threshold: chunk.append(someFileId) break else: # important: this else is on 'for ' level # so if there was no 'break' so someFileId is a base for new chunk: calculated_chunks.append( [someFileId] ) return calculated_chunks 

after that you can filter the result: groups = filter (lambda x: len (x)> 1, result) remains = filter (lambda x: len (x) == 1, result)

NOTE. This algorithm assumes that the first element of the chunk is β€œbasic”. A good result strongly depends on the behavior of ssdeep (I can imagine a strange question: how transitive is ssdeep?) If this similarity then there should be ...

In the worst case, if the score of any s.compare pair (fileId1, fileId2) does not satisfy the threshold condition (then the complexity is n ^ 2, so in your case 1.3 million * 1.3 million).

There is no easy way to optimize this case. Imagine a situation where s.compare (file1, file2) is always close to 0, then (as I understand it) even you know that s.compare (A, B) is very low and s.compare (B, C) is very but you still can't say anything about s.compare (A, C) =>, so you need to have n * n operations.

Another note: suppose you use too many structures and too many lists, for example:

 set_lst = set_lst.difference(check_file) 

This command creates a new set () and all the elements from set_lst and check_file HAVE that need to be touched at least once, and because check_file is a list, so there is no way to optimize the difference function, and it got complexity: len (check_file) * log (len (set_lst))

Basically: if these structures grow (in 1.3 million people), then your computer must perform much more calculations. If you use check_file = set () instead of [] (list), then the complexity of this should be: len (set_lst) + len (check_file)

Same thing with checking if an item is in a python list (array):

 if tup1 in check_file: 

because check_file is a list β†’ in case tup1 is not in the list, your processor needs to compare tup1 with all elements, so the complexity of this is len (check_file) If you change check_file to set, then the complexity of this will be around log2 (len (check_file )) Let's make the difference more obvious, let's say len (* check_file *) = 1 million. How many comparisons do you need?

set: log2 (1mln) = log2 (1,000,000) ~ 20

: len (check_file) = 1mln

0
source

Source: https://habr.com/ru/post/1413192/


All Articles