Work with a large data object between ruby ​​processes

I have a Ruby hash that reaches about 10 megabytes if it is written in a file using Marshal.dump. After gzip compression, it is about 500 kilobytes.

Iterating and changing this hash is very fast in ruby ​​(fractions of a millisecond). Even copying is very fast.

The problem is that I need to split the data in this hash between the Ruby on Rails processes. To do this using the Rails cache (file_store or memcached), I need to first specify the Marshal.dump file, however this leads to a delay of 1000 milliseconds when serializing the file and 400 millisecond delay when it is serialized.

Ideally, I would like to be able to save and load this hash from every process in less than 100 milliseconds.

One idea is to create a new Ruby process to store this hash, which provides an API to other processes to modify or process the data inside it, but I want to avoid this if I'm not sure there are no other ways to quickly access this object.

Is there a way by which I can more directly share this hash between processes without the need for serialization or deserialization?

Here is the code that I use to generate a hash similar to the one I'm working with:

@a = [] 0.upto(500) do |r| @a[r] = [] 0.upto(10_000) do |c| if rand(10) == 0 @a[r][c] = 1 # 10% chance of being 1 else @a[r][c] = 0 end end end @c = Marshal.dump(@a) # 1000 milliseconds Marshal.load(@c) # 400 milliseconds 

Update:

Since my initial question did not receive many answers, I assume that there is no solution as easy as I would have hoped.

I am currently considering two options:

  • Create a Sinatra application to store this hash with an API to modify / access it.
  • Create a C application to do the same as # 1, but much faster.

The scope of my problem has increased, so the hash may be larger than my original example. So # 2 may be necessary. But I have no idea where to start from the point of writing a C application that provides the appropriate API.

A good walk through how best to implement # 1 or # 2 can get a better answer.

Update 2

I ended up implementing this as a standalone application written in Ruby 1.9, which has a DRb interface for communicating with application instances. I use Daemons Stone to create DRb instances when the web server starts. When launched, the DRb application downloads the necessary data from the database, and then contacts the client to return results and stay up to date with the latest events. Now it works well. Thanks for the help!

+6
performance c ruby ruby-on-rails serialization
source share
6 answers

The synatra application will work, but {un} serialization and HTML parsing can affect performance compared to the DRb service.

Here is an example based on your example in the relevant question. I use a hash instead of an array, so you can use user IDs as indexes. Thus, there is no need to store both the table of interests and the table of user identifiers on the server. Note that the interest table is “transposed” compared to your example, which you want anyway, so it can be updated in one call.

 # server.rb require 'drb' class InterestServer < Hash include DRbUndumped # don't send the data over! def closest(cur_user_id) cur_interests = fetch(cur_user_id) selected_interests = cur_interests.each_index.select{|i| cur_interests[i]} scores = map do |user_id, interests| nb_match = selected_interests.count{|i| interests[i] } [nb_match, user_id] end scores.sort! end end DRb.start_service nil, InterestServer.new puts DRb.uri DRb.thread.join # client.rb uri = ARGV.shift require 'drb' DRb.start_service interest_server = DRbObject.new nil, uri USERS_COUNT = 10_000 INTERESTS_COUNT = 500 # Mock users users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} } # Initial send over user interests users.each do |user| interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 } end # query at will puts interest_server.closest(users.first[:id]).inspect # update, say there a new user: new_user = {:id => 42} users << new_user # This guy is interested in everything! interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true } puts interest_server.closest(users.first[:id])[-2,2].inspect # Will output our first user and this new user which both match perfectly 

To start the terminal, start the server and give it as an argument to the client:

 $ ruby server.rb druby://mal.lan:51630 $ ruby client.rb druby://mal.lan:51630 [[0, 100035], ...] [[45, 42], [45, 178902]] 
+3
source share

This may be too obvious, but if you sacrifice a small rate of access to members of your hash, the traditional database will give you much more constant access to time values. You can start at this point and then add caching to see if you can get enough speed from it. It will be a little easier than using Sinatra or another tool.

+2
source share

be careful with memcache, it has some restrictions on the size of the object (2 mb or so)

Try using MongoDB as storage. This is pretty fast and you can display any data structure in it.

0
source share

If it’s reasonable to wrap a monster hash in a method call, you can simply represent it using DRb — run a small daemon that starts the DRb server with the hash as the front object — other processes can make requests from it using what makes up RPC.

Moreover, is there a different approach to your problem? Not knowing what you're trying to do, it's hard to say for sure - but maybe a trick or a Bloom filter? Or even a beautiful conjugate bitfield is likely to save you quite a bit of space.

0
source share

Have you considered increasing the maximum size of a memcache object?

Version greater than 1.4.2

 memcached -I 11m #giving yourself an extra MB in space 

or in previous versions, changing the value of POWER_BLOCK in slabs.c and recompiling.

0
source share

How to store data in Memcache instead of storing a hash in Memcache? Using your code above:

 @a = [] 0.upto(500) do |r| @a[r] = [] 0.upto(10_000) do |c| key = "#{r}:#{c}" if rand(10) == 0 Cache.set(key, 1) # 10% chance of being 1 else Cache.set(key, 0) end end end 

It will be fast and you don’t have to worry about serialization, and all your systems will have access to it. I asked in a comment on the main post about data access, you will need to be creative, but this should be easy to do.

0
source share

All Articles