How to optimize ActiveRecord find_in_batches query?

I am using Rails 4.0.0 and Ruby 2.0.0. My Post model (like on blogs) is associated with a user with a username combination of user_name, first_name, last_name. I would like to migrate the data so that the messages are associated with users with a foreign key, which is the user ID.

I have about 11 million posts in the posts table.

I am running the code below to transfer data using the rake task on a Linux server. However, my task continues to β€œkill” the wound, presumably because of the rake task, in particular, below the code, consuming too much memory.

I found that decreasing batch_size to 20 and increasing sleep(10) to sleep(60) allows the task to work longer, updating more records in general, without being killed, but taking significantly longer.

How can I optimize this code for speed and memory usage?

 Post.where(user_id: nil).find_in_batches(batch_size: 1000) do |posts| puts "*** Updating batch beginning with post #{posts.first.id}..." sleep(10) # Hopefully, saving some memory usage. posts.each do |post| begin user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name) post.update(user_id: user.id) rescue NoMethodError => error # user could be nil, so user.id will raise a NoMethodError puts "No user found." end end puts "*** Finished batch." end 
+6
source share
5 answers

Do all the work in the database, which is WAY faster than moving data back and forth.

This can be done using ActiveRecord. Of course, PLEASE check this out before decoupling it with important data.

 Post .where(user_id: nil) .joins("inner join users on posts.user_name = users.user_name") .update_all("posts.user_id = users.id") 

In addition, if messages have an index on user_id and users have an index on user_name , this will help speed up the execution of this particular request.

+9
source

Check out the #uncached method on AR models. In principle, to optimize queries, AR will cache a lot of query data, as #find_in_batches does, but this makes it difficult to run large processing scripts.

 Post.uncached do # perform all your heavy query magic here end 

Ultimately, if that doesn't work, consider using the mysql2 to avoid AR overhead if you are not dependent on any callbacks / business logic in the update.

+2
source

If a connection is possible, I would go with the z5h approach. Otherwise, you can add an index to the user model (possibly in a separate migration), as well as skip checks, callbacks, etc. when updating each message:

 add_index :users, [:user_name, :first_name, :last_name] # Speed up search queries Post.where(user_id: nil).find_each do |post| if user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name) post.update_columns(user_id: user.id) # ...to skip validations and callbacks. end end 

Note that find_each equivalent to find_in_batches + iterations for each post, but maybe not faster (see the Rails Guides on the Active Record Query Interface )

Good luck

+2
source

By combining the other answers, I was able to join the tables and update several columns in batches of 1000 rows with a decrease in speed and without my process, which was killed by the server.

This combines the approach that, it seemed to me, works best by saving the code in the ActiveRecord API as much as possible.

 Post.uncached do Post.where(user_id: nil, organization_id: nil).find_each do |posts| puts "** Updating batch beginning with post #{posts.first.id}..." # Update 1000 records at once posts.map!(&:id) # posts is an array, not a relation Post.where(id: posts). joins("INNER JOIN users ON (posts.user_name = users.user_name)"). joins("INNER JOIN organizations ON (organizations.id = users.organization_id)"). update_all("posts.user_id = users.id, posts.organization_id = organizations.id") puts "** Finished batch." end end 
0
source

Add updated new boolean attribute

 Post.where(updated: false).find_in_batches(batch_size: 1000) do |posts| ActiveRecord::Base.transaction do puts "*** Updating batch beginning with post #{posts.first.id}..." posts.each do |post| user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name) if user post.update_columns(user_id: user.id, updated: true) else post.update_columns(updated: true) end end puts "*** Finished batch." end end 
0
source

All Articles