Ruby: how to split a file into several files of a certain size

I want to split a txt file into several files, where each file contains no more than 5 MB. I know there are tools for this, but I need this for a project and want to do it in Ruby. Also, I prefer to do this with File.open in a block context, if possible, but I fail: o (

#!/usr/bin/env ruby require 'pp' MAX_BYTES = 5_000_000 file_num = 0 bytes = 0 File.open("test.txt", 'r') do |data_in| File.open("#{file_num}.txt", 'w') do |data_out| data_in.each_line do |line| data_out.puts line bytes += line.length if bytes > MAX_BYTES bytes = 0 file_num += 1 # next file end end end end 

This work, but I do not think it is elegant. Also, I'm still wondering if this can be done with File.open only in a block context.

 #!/usr/bin/env ruby require 'pp' MAX_BYTES = 5_000_000 file_num = 0 bytes = 0 File.open("test.txt", 'r') do |data_in| data_out = File.open("#{file_num}.txt", 'w') data_in.each_line do |line| data_out = File.open("#{file_num}.txt", 'w') unless data_out.respond_to? :write data_out.puts line bytes += line.length if bytes > MAX_BYTES bytes = 0 file_num += 1 data_out.close end end data_out.close if data_out.respond_to? :close end 

Greetings

Martin

+7
source share
4 answers

[Updated] I wrote a short version without any auxiliary variables and put everything in the method:

 def chunker f_in, out_pref, chunksize = 1_073_741_824 File.open(f_in,"r") do |fh_in| until fh_in.eof? File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out| fh_out << fh_in.read(chunksize) end end end end chunker "inputfile.txt", "output_prefix" (, desired_chunk_size) 

Instead of a linear loop, you can use .read(length) and loop only for the EOF token and file cursor.

This ensures that short files will never be larger than the desired size.

On the other hand, he never cares about line breaks ( \n )!

Numbers for chunk files will be generated from integer division of the current file cursor position using chunksize formatted with "% 05d", resulting in 5-digit numbers with a leading zero ( 00001 ).

This is only possible because .read(chunksize) . In the second example below, it cannot be used!

Update: Line break detection

If you really need complete lines with \n , use this modified piece of code:

 def chunker f_in, out_pref, chunksize = 1_073_741_824 outfilenum = 1 File.open(f_in,"r") do |fh_in| until fh_in.eof? File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out| line = "" while fh_out.size <= (chunksize-line.length) && !fh_in.eof? line = fh_in.readline fh_out << line end end outfilenum += 1 end end end 

I had to introduce the helper variable line , because I want the hard file size to always be lower than the chunksize limit! If you do not do this advanced scan, you will also receive file sizes above the limit. The while statement only successfully checks the next iteration step when the line is already written. (Working with .ungetc or other complex calculations will make the code more unreadable and no shorter than this example.)

Unfortunately, you must have a second EOF check, because the last iteration of the blocks will basically result in a smaller fragment.

Two auxiliary variables are also needed: line described above, outfilenum needed, because the resulting file sizes basically do not match the exact chunksize .

+13
source

For files of any size, split will be faster than Ruby built from scratch, even taking into account the launch of a separate executable. It also encodes that you do not need to write, debug, or maintain:

 system("split -C 1M -d test.txt ''") 

Possible options:

  • -C 1M Put lines in the amount of not more than 1M in each fragment
  • -d Use decimal suffixes in output file names
  • test.txt Input file name
  • '' Use empty output file prefix

If you are not on Windows, this is the way to go.

+11
source

This code really works, it is simple and uses an array that makes it faster:

 #!/usr/bin/env ruby data = Array.new() MAX_BYTES = 3500 MAX_LINES = 32 lineNum = 0 file_num = 0 bytes = 0 filename = 'W:/IN/tangoZ.txt_100.TXT' r = File.exist?(filename) puts 'File exists =' + r.to_s + ' ' + filename file=File.open(filename,"r") line_count = file.readlines.size file_size = File.size(filename).to_f / 1024000 puts 'Total lines=' + line_count.to_s + ' size=' + file_size.to_s + ' Mb' puts ' ' file = File.open(filename,"r") #puts '1 File open read ' + filename file.each{|line| bytes += line.length lineNum += 1 data << line if bytes > MAX_BYTES then # if lineNum > MAX_LINES then bytes = 0 file_num += 1 #puts '_2 File open write ' + file_num.to_s + ' lines ' + lineNum.to_s File.open("#{file_num}.txt", 'w') {|f| f.write data.join} data.clear lineNum = 0 end } ## write leftovers file_num += 1 #puts '__3 File open write FINAL' + file_num.to_s + ' lines ' + lineNum.to_s File.open("#{file_num}.txt", 'w') {|f| f.write data.join} 
+1
source

Instead of opening your interface inside the intrusion block, open the file and assign it to a variable. When you click the file size limit, close the file and open a new one.

0
source

All Articles