How to generate 1 million random integers and write them to a file?

I tried to perform some tests on my external sorting algorithms, and I thought that I should generate a huge amount of random numbers and put them in a file.

Here is how I do it:

import tempfile, random nf = tempfile.NamedTemporaryFile(delete=False) i = 0 while i < 1000: j = 0 buf = '' while j < 1000: buf += str(random.randint(0, 1000)) j += 1 nf.write(buf) i += 1 

I thought I needed to speed up the generation process by reducing File IO operations, so I use buf to store as many numbers as possible, and then write buf to a file.

Question:

I still have a feeling that the generation and recording process was slow.

Am I misunderstood something?

EDIT:

In C ++, we can simply write int or float to a << file without converting them to a string.

So, can we do the same in Python? I mean write an integer to a file without converting it to str.

+4
source share
6 answers

Operating systems are already optimized for such I / O. Thus, you can directly write numbers to a file and get very good speed:

 import tempfile, random with tempfile.NamedTemporaryFile(delete=False) as nf: for _ in xrange(1000000): # xrange() is more efficient than range(), in Python 2 nf.write(str(random.randint(0, 1000))) 

In practice, numbers will be written to disk only when the size buffer file is full. The code in question and the above code take the same time on my machine. Therefore, I would advise you to use my simpler code and rely on built-in optimizations of the operating system.

If the result fits into memory (this applies to 1 million numbers), you can really save some I / O by creating the final line and then writing it in one go:

 with tempfile.NamedTemporaryFile(delete=False) as nf: nf.write(''.join(str(random.randint(0, 1000)) for _ in xrange(1000000))) 

This second approach is 30% faster on my computer (2.6 s instead of 3.8 s), probably due to a single write call (instead of millions of write() calls - and probably a lot less actual disk writes).

The "large number of entries" approach of your question falls in the middle (3.1 s). However, this can be improved: it is clearer and more pythonic to write like this:

 import tempfile, random with tempfile.NamedTemporaryFile(delete=False) as nf: for _ in xrange(1000): nf.write(''.join(str(random.randint(0, 1000)) for _ in xrange(1000))) 

This solution is equivalent, but faster than the code in the original question (2.6 s on my machine instead of 3.8 s).

So the first, simple approach above may be quick enough for you. If this is not the case, and if the entire file can fit in memory, the second approach will be very fast and simple. Otherwise, your original idea (fewer records, larger blocks) is good, as it is about as fast as a “single record”, and is still pretty simple when written as above.

+7
source

Do not use string concatenation in a loop. Use str.join .

Details of the CPython implementation: if s and t are both strings, some Python implementations, such as CPython, can usually perform in-place optimizations for assignments of the form s = s + t or s + = t. When applicable, this optimization makes quadratic run time much less likely. This optimization depends on both version and implementation. For performance-sensitive code, it is preferable to use the str.join () method, which provides consistent performance of linear concatenation across versions and implementations.

Your code will look like this:

 buf = ''.join(str(random.randint(0, 1000)) for j in range(1000)) 

Note that since you did not specify a separator, it will look like this:

 3847018274193258124003837134.... 

Change '' to ',' if you want the numbers to be (for example) separated by commas.

I also don’t think you need to buffer yourself, since writing to a file should already be buffered.

+3
source

If you only need to create some random numbers and you are on Linux, try the shell command

 for i in {1..1000000}; do echo $[($RANDOM % 1000)]; done > test.in 

ok, I check this code below, it takes about 5 seconds to finish

 import tempfile, random nf = tempfile.NamedTemporaryFile(delete=False) for i in xrange(0, 1000000): nf.write(str(random.randint(0, 1000))) 
+2
source

I'm not sure about Python, but + = is usually an expensive operation since it copies the string to new memory.

Using some string builder or array you are joining to is probably much faster.

+1
source

Like this

 import random import struct with open('binary.dat','wb') as output: for i in xrange(1000000): u = random.randint(0,999999) # number b = struct.pack('i', u) # bytes output.write(b) 

This will create 4 million bytes of data. 1 million 4-byte values.

You can read the struct and various packaging options here: http://docs.python.org/library/struct.html .

+1
source

Completing a million things will be relatively slow. In addition, depending on how much you produce numbers, you can invest in a more robust random number generator. This is a personal favorite: http://en.wikipedia.org/wiki/Mersenne_twister

-2
source

All Articles