Manipulation of a large massif is very slow in ruby

Question

Manipulation of a large massif is very slow in ruby

I have the following script:

I need to find out a unique list of identifiers in a very large set.

So, for example, I have 6000 identifier arrays (a list of followers), each of which can have a size from 1 to 25000 (their list is followers).

I want to get a unique list of identifiers in all these arrays of identifiers (unique followers of followers). As soon as this is done, I need to subtract another list (the list of subscribers of other persons) from the identifiers and get the final bill.

The final set of unique identifiers is expanded to 60,000,000 records. In ruby, when adding arrays to a large array, it begins to get very slowly around two million. When added to a set, it first takes 0.1 seconds, then it increases to 4 seconds in 2 million (no, where should I go).

I wrote a test program in java and it does all this in less than a minute.

Perhaps I am doing it inefficiently in the ruby, otherwise there is another way. Since my main code is proprietary, I wrote a simple test program to simulate a problem:

big_array = [] loop_counter = 0 start_time = Time.now # final target size of the big array while big_array.length < 60000000 loop_counter+=1 # target size of one persons follower list random_size_of_followers = rand(5000) follower_list = [] follower_counter = 0 while follower_counter < random_size_of_followers follower_counter+=1 # make ids very large so we get good spread and only some amt of dupes follower_id = rand(240000000) + 100000 follower_list << follower_id end # combine the big list with this list big_array = big_array | follower_list end_time = Time.now # every 100 iterations check where we are and how long each loop and combine takes. if loop_counter % 100 == 0 elapsed_time = end_time - start_time average_time = elapsed_time.to_f/loop_counter.to_f puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}" start_time = Time.now end end

Any suggestions, is it time to switch to jruby and move something like java?

+8

performance ruby jruby

Joelio Oct 20 '11 at 15:12

source share

2 answers

Here is an example of processing unique objects using an array, hash, and set:

 require 'benchmark' require 'set' require 'random_token' n = 10000 Benchmark.bm(7) do |x| x.report("array:") do created_tokens = [] while created_tokens.size < n token = RandomToken.gen(10) if created_tokens.include?(token) next else created_tokens << token end end results = created_tokens end x.report("hash:") do created_tokens_hash = {} while created_tokens_hash.size < n token = RandomToken.gen(10) created_tokens_hash[token] = true end results = created_tokens_hash.keys end x.report("set:") do created_tokens_set = Set.new while created_tokens_set.size < n token = RandomToken.gen(10) created_tokens_set << token end results = created_tokens_set.to_a end end

and their landmark:

  user system total real array: 8.860000 0.050000 8.910000 ( 9.112402) hash: 2.030000 0.010000 2.040000 ( 2.062945) set: 2.000000 0.000000 2.000000 ( 2.037125)

works:

ruby 處理 unique 物件

+1

Sibevin wang Dec 04 '14 at 4:05

source share

tadman · Accepted Answer · 2011-10-20T15:31:06+0000

The method you use is terribly inefficient, so it is not surprising that this is slow. When you try to track unique things, Array requires more processing than the Hash equivalent.

Here is a simple refactoring that speeds up around 100x:

 all_followers = { } loop_counter = 0 start_time = Time.now while (all_followers.length < 60000000) # target size of one persons follower list follower_list = [] rand(5000).times do follower_id = rand(240000000) + 100000 follower_list << follower_id all_followers[follower_id] = true end end_time = Time.now # every 100 iterations check where we are and how long each loop and combine takes. loop_counter += 1 if (loop_counter % 100 == 0) elapsed_time = end_time - start_time average_time = elapsed_time.to_f/loop_counter.to_f puts "average time for loop is #{average_time}, total size of all_followers is #{all_followers.length}" start_time = Time.now end end

The good thing about a hash is that it is impossible to duplicate. If you need to list all subscribers at any time, use all_followers.keys to get the identifiers.

Hashes take up more memory than their Array equivalents, but this is the price you have to pay for performance. I also suspect that one of the major consumers of memory is a lot of separate lists of followers that are generated and apparently never used, so maybe you can skip this step completely.

The main thing is that the operator Array | not very effective, especially when working on very large arrays.

Manipulation of a large massif is very slow in ruby

More articles: