[Updated] I wrote a short version without any auxiliary variables and put everything in the method:
def chunker f_in, out_pref, chunksize = 1_073_741_824 File.open(f_in,"r") do |fh_in| until fh_in.eof? File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out| fh_out << fh_in.read(chunksize) end end end end chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)
Instead of a linear loop, you can use .read(length)
and loop only for the EOF
token and file cursor.
This ensures that short files will never be larger than the desired size.
On the other hand, he never cares about line breaks ( \n
)!
Numbers for chunk files will be generated from integer division of the current file cursor position using chunksize formatted with "% 05d", resulting in 5-digit numbers with a leading zero ( 00001
).
This is only possible because .read(chunksize)
. In the second example below, it cannot be used!
Update: Line break detection
If you really need complete lines with \n
, use this modified piece of code:
def chunker f_in, out_pref, chunksize = 1_073_741_824 outfilenum = 1 File.open(f_in,"r") do |fh_in| until fh_in.eof? File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out| line = "" while fh_out.size <= (chunksize-line.length) && !fh_in.eof? line = fh_in.readline fh_out << line end end outfilenum += 1 end end end
I had to introduce the helper variable line
, because I want the hard file size to always be lower than the chunksize
limit! If you do not do this advanced scan, you will also receive file sizes above the limit. The while
statement only successfully checks the next iteration step when the line is already written. (Working with .ungetc
or other complex calculations will make the code more unreadable and no shorter than this example.)
Unfortunately, you must have a second EOF
check, because the last iteration of the blocks will basically result in a smaller fragment.
Two auxiliary variables are also needed: line
described above, outfilenum
needed, because the resulting file sizes basically do not match the exact chunksize
.