An effective way to find the difference between two very large arrays?

I have a problem with efficiency and algorithms when it comes to determining the difference between two very large arrays. I hope someone with a good understanding of the algorithms can point me in the right direction how to solve this, since my current implementations take a lot of time.

Problem:

I have two very large arrays. One contains a list of emails with invalid domain names, and the other contains a mixed list, which I need to check against the first array.

accounts_with_failed_email_domains = [279,000 records in here] unchecked_account_domains = [149,000 records in here] 

What I need to do is look at the list of unchecked_account_domains and then compare each entry to see if there is a match in accounts_with_failed_email_domains . I need to insert all matches between lists in a separate array, which will be processed later.

How can I effectively write something that can be quickly verified through these accounts. Here is what I have tried so far.

 unchecked_account_domains = [really big array] unchecked_account_domains = unchecked_account_domains.sort accounts_with_failed_email_domains = [another huge array].sort unchecked_account_domains.keep_if do |email| accounts_with_failed_email_domains.any? { |failed_email| email == failed_email } end # Count to see how many accounts are left puts unchecked_account_domains.count 

This implementation above works forever. Here is the second attempt, which still turned out to be no better.

 unchecked_account_domains = [really big array] unchecked_account_domains = unchecked_account_domains.sort accounts_with_failed_email_domains = [another huge array].sort unchecked_account_domains.each do |email| accounts_with_failed_email_domains.bsearch do |failed_email| final_check << email if email == failed_email end end # Count to see how many accounts are left puts final_check.count 

bsearch seemed promising, but I'm sure I am not using it correctly. Also, I tried to consider this question comparing large lists , but this is in python and I cannot find the Ruby equivalent for set . Does anyone have any ideas on how to solve this?

+6
source share
4 answers

It seems you can use Array#- :

 result = unchecked_account_domains - accounts_with_failed_email_domains 
+6
source

I didnโ€™t have a new solution here because the good answers have already been accepted. However, I wanted to see if there is a performance difference between the two code-based solutions.

This answer is the benchmark for identifying performance differences when using Array#- and two uses of Set#include? . First test Set#include? always performs the specified conversion, and the second converts once and saves the set for subsequent searches.

Here is the code that runs each test 50 times:

 require 'set' require 'benchmark' string = 'asdfghjkl' Times = 50 a = 279_000.times.map {|n| "#{n}#{string}" } b = 149_000.times.map {|n| "#{n*2}#{string}" } puts RUBY_DESCRIPTION puts "============================================================" puts "Running tests for trimming strings" Benchmark.bm(20) do |x| x.report("Array#-:") { Times.times {|n| a - b } } x.report("Set#include? #1:") do Times.times do |n| d = [] c = Set.new(b) a.each {|email| d << email if c.include?(email) } end end x.report("Set#include? #2:") do c = Set.new(b) Times.times do |n| d = [] a.each {|email| d << email if c.include?(email) } end end end 

Here are the results:

 ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin14] ============================================================ Running tests for trimming strings user system total real Array#-: 12.350000 0.250000 12.600000 ( 13.001546) Set#include? #1: 16.090000 0.330000 16.420000 ( 17.196469) Set#include? #2: 8.250000 0.100000 8.350000 ( 8.726609) 

Obviously, for an easy comparison of differences, use the Array#- approach. However, if you need to do this type of thing several times, preconverting a set makes a huge difference and works better than Array#- . The cost of converting an array to a set is quite high (comparatively), but as soon as you have a Set, it compares the differences much faster.

+6
source

A Set would be useful here if you know that the array contains unique elements (or you are not worried about losing duplicates, which I donโ€™t think you are), so just take your large array and do:

 require 'set' unchecked_account_domains = [really big array] accounts_with_failed_email_domains = Set.new([another huge array]) final_check = [] unchecked_account_domains.each do |email| final_check << email if accounts_with_failed_email_domain.include?(email) # .include? on a set is in O(1) look up time end 
+1
source

Converting an array of failed emails into a set (I think the Ruby .to_set , read about it in the Ruby docs). Then check each of the unverified emails for typing using .include? .

The reason it works forever is the mileage of all or most of the list for each check. The set class must have a hash list, making queries much faster.

0
source

All Articles