I have the following script:
I need to find out a unique list of identifiers in a very large set.
So, for example, I have 6000 identifier arrays (a list of followers), each of which can have a size from 1 to 25000 (their list is followers).
I want to get a unique list of identifiers in all these arrays of identifiers (unique followers of followers). As soon as this is done, I need to subtract another list (the list of subscribers of other persons) from the identifiers and get the final bill.
The final set of unique identifiers is expanded to 60,000,000 records. In ruby, when adding arrays to a large array, it begins to get very slowly around two million. When added to a set, it first takes 0.1 seconds, then it increases to 4 seconds in 2 million (no, where should I go).
I wrote a test program in java and it does all this in less than a minute.
Perhaps I am doing it inefficiently in the ruby, otherwise there is another way. Since my main code is proprietary, I wrote a simple test program to simulate a problem:
big_array = [] loop_counter = 0 start_time = Time.now # final target size of the big array while big_array.length < 60000000 loop_counter+=1 # target size of one persons follower list random_size_of_followers = rand(5000) follower_list = [] follower_counter = 0 while follower_counter < random_size_of_followers follower_counter+=1 # make ids very large so we get good spread and only some amt of dupes follower_id = rand(240000000) + 100000 follower_list << follower_id end # combine the big list with this list big_array = big_array | follower_list end_time = Time.now # every 100 iterations check where we are and how long each loop and combine takes. if loop_counter % 100 == 0 elapsed_time = end_time - start_time average_time = elapsed_time.to_f/loop_counter.to_f puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}" start_time = Time.now end end
Any suggestions, is it time to switch to jruby and move something like java?
performance ruby jruby
Joelio
source share