Fastest way to generate a large random string with lower latin letters

Question

Fastest way to generate a large random string with lower latin letters

I am trying to solve this problem from judge Timus Online. To solve this problem, you need to create a sequence of 1,000,000 lowercase Latin letters and write it to stdin in 1 second.

It is easy to solve this problem using C ++ or Java. I have a python solution:

import os from random import randint s = ''.join(chr(97 + randint(0, 25)) for i in range(1000000)) os.write(1, bytes(s, 'utf8'))

1.7s required:

 $ time python3.3 1219.py > /dev/null real 0m1.756s user 0m1.744s sys 0m0.008s

And I got a "time limit exceeded." So the question is: "How to do it faster?"

UPD1 : Using randint(97, 122) reduces the time by 16 ms. Now it's 1.740s

UPD2: The solution from @Martijn Pieters takes 0.979s, but it also does not pass the test.

UPD3 Martijn Pieters offered very good solutions, but it is still slow:

 from sys import stdin from random import choice from string import ascii_lowercase s = ''.join([choice(ascii_lowercase) for _ in range(1000000)]) stdout.write(s)

Takes 0.924 s

 from sys import stdout from random import choice from string import ascii_lowercase for _ in range(1000000): stdout.write(choice(ascii_lowercase))

accepts 1.173s

 from sys import stdout from random import choice from string import ascii_lowercase bal = [c.encode('ascii') for c in ascii_lowercase] out = stdout.buffer for _ in range(1000000): out.write(choice(bal))

Does 1.155s

 from sys import stdout from random import choice from string import ascii_lowercase bal = [c.encode('ascii') for c in ascii_lowercase] stdout.buffer.write(b''.join([choice(bal) for _ in range(1000000)]))

Does 0.901s

UPD4

Some guy just solved a problem on Timus. Hope he shares his decision :)

UPD5 Thanks to Ashwini Chaudhary for sharing her Python 2.x with us:

 from random import choice from string import ascii_lowercase lis=list(ascii_lowercase) print ''.join(choice(lis) for _ in xrange(1000000))

It takes 0.527s on my computer, and it passes tests on Timus. But the problem with Python3.x still remains.

UPD6 Thanks Markku K. this code:

 import os from random import random from string import ascii_lowercase bal = [c.encode('ascii') for c in ascii_lowercase] os.write(1, b''.join([bal[int(random() * 26)] for _ in range(1000000)]))

Takes 0.445s but failed a test

+10

performance python python-3.x random stdin

ilalex Apr 30 '13 at 21:08

source share

7 answers

Use string.ascii_lowercase instead of chr to generate lowercase characters:

 from sys import stdin from random import choice from string import ascii_lowercase s = ''.join([choice(ascii_lowercase) for _ in range(1000000)]) stdout.write(s)

In addition, writing to stdout directly seems faster, coding itself in python is no faster than all this is handled in C.

I also use list comprehension; str.join() is required to scan through the input sequence twice, once to determine the length of the output, once to actually copy the input elements to the output string. List comprehension then produces a slower generator-to-list code.

Just using choice(ascii_lowercase) over your method of generating each character from an integer is twice as fast:

 >>> timeit.timeit('f()', 'from __main__ import yours as f', number=3) 11.299837955011753 >>> timeit.timeit('f()', 'from __main__ import mine as f', number=3) 5.330044150992762

You can try to avoid overhead ''.join() by writing single characters directly to stdout :

 from sys import stdout from random import choice from string import ascii_lowercase for _ in range(1000000): stdout.write(choice(ascii_lowercase))

Next, try writing raw bytes:

 from sys import stdout from random import choice from string import ascii_lowercase bal = [c.encode('ascii') for c in ascii_lowercase] out = stdout.buffer for _ in range(1000000): out.write(choice(bal))

but in my tests this did not improve compared to ''.join() .

Then we move on to encoding ASCII characters in bytes once, and then using bytes.join() :

 from sys import stdout from random import choice from string import ascii_lowercase bal = [c.encode('ascii') for c in ascii_lowercase] stdout.buffer.write(b''.join([choice(bal) for _ in range(1000000)]))

bal is a list of ASCII lowercase characters encoded in bytes, from which we arbitrarily select 1 million elements, append them to a large byte string, and then write that one goes to the stdout binary buffer.

Joining bytes as "slow" as version string:

 >>> timeit.timeit('f()', 'from __main__ import bytes as f', number=3) 5.41390264898655

but we encode 26 characters, not 1 million, so that the recording scene is faster.

+4

Martijn Pieters Apr 30 '13 at 21:20

source share

My solution that has just been made (python 2.7, Runtime: 0.984):

 from random import choice from string import ascii_lowercase lis = list(ascii_lowercase) print ''.join(choice(lis) for _ in xrange(1000000))

Accessing list items is faster than strings.

 In [13]: from random import choice In [14]: from string import ascii_lowercase In [15]: lis = list(ascii_lowercase) In [16]: %timeit ''.join(choice(lis) for _ in xrange(10**5)) 1 loops, best of 3: 128 ms per loop In [17]: %timeit ''.join(choice(ascii_lowercase) for _ in xrange(10**5)) 1 loops, best of 3: 134 ms per loop

And you do not need stdout or stdin here, as most online judges will check something like this on your script:

 $python script.py <in.txt >out.txt

Thus, you can use print instead of stdout and raw_input() instead of stdin , although for huge inputs stdin.readline is faster than raw_input() .

Update 1 :

Using @Markku tip runtime has been reduced to 0.64 in py2.7:

 from random import random from string import ascii_lowercase lis = list(ascii_lowercase) print "".join( [lis[int(random() * 26)] for _ in xrange(1000000)] )

+2

Ashwini Chaudhary Apr 30 '13 at 22:03

source share

I get a huge speed improvement by changing from randint (0.25) to int (random () * 25) in your original solution. On my machine, the time ranged from 2 seconds to about 0.6 seconds. If you look at the random.py code, you will see that randint is full of checks that you don't need or need.

update:. You need int (random () * 26). Thanks Ashwini

+2

Markku K. Apr 30 '13 at 22:41

source share

Try turning part of it into C ++ or another compiled language. It is almost guaranteed to make it faster. Unfortunately, Python is not too fast, especially when it comes to such things. Try C ++, C or Pascal .

EDIT: Also see Python Performance Tips

+1

refi64 Apr 30 '13 at 21:11

source share

Generate and write in pieces larger than 2.

Maybe use a string or an array of 26 lowercase letters and randomly select then instead of generating characters.

0

Lee Meador Apr 30 '13 at

source share

Use random.choices ?

In Python 3.6:

 import random
 import string

 % timeit '' .join (random.choices (string.ascii_lowercase, k = 10 ** 6))
 1 loop, best of 3: 235 ms per loop

0

ostrokach Jun 01 '17 at 15:54 on

source share

jfs · Accepted Answer · 2013-04-30 23:49

Here's the Python 3 code, which generates 1,000,000 "random" lowercase letters in 0.28 seconds (see also the 0.11 second solution at the end, the @Ashwini Chaudhary code from the question takes 0.55 seconds on my machine, @ Markku K. code is 0.53 ):

 #!/usr/bin/env python3 import os import sys def write_random_lowercase(n): min_lc = ord(b'a') len_lc = 26 ba = bytearray(os.urandom(n)) for i, b in enumerate(ba): ba[i] = min_lc + b % len_lc # convert 0..255 to 97..122 sys.stdout.buffer.write(ba) write_random_lowercase(1000000)

% len_lc crosses out the distribution (see the end on how to fix it), although it still satisfies the conditions (ascii, lower case, frequencies 1, 2, 3 letter sequences):

 $ python3 generate-random.py | python3 check-seq.py

where check-seq.py :

 #!/usr/bin/env python3 import sys from collections import Counter from string import ascii_lowercase def main(): limits = [40000, 2000, 100] s = sys.stdin.buffer.readline() # a single line assert 1000000 <= len(s) <= 1000002 # check length +/- newline s.decode('ascii','strict') # check ascii assert set(s) == set(ascii_lowercase.encode('ascii')) # check lowercase for n, lim in enumerate(limits, start=1): freq = Counter(tuple(s[i:i+n]) for i in range(len(s))) assert max(freq.values()) <= lim, freq main()

Note: on acm.timus.ru generate-random.py is given "Exceeding the output limit".

To improve performance, you can use bytes.translate() method ( 0.11 seconds):

 #!/usr/bin/env python3 import os import sys # make translation table from 0..255 to 97..122 tbl = bytes.maketrans(bytearray(range(256)), bytearray([ord(b'a') + b % 26 for b in range(256)])) # generate random bytes and translate them to lowercase ascii sys.stdout.buffer.write(os.urandom(1000000).translate(tbl))

How to fix `% len_lc` skew

256 (the number of bytes) is not evenly divided by 26 (the number of Latin letters is lower), so the formula min_lc + b % len_lc makes some values less rare than others, for example:

 #!/usr/bin/env python3 """Find out skew: x = 97 + y % 26 where y is uniform from [0, 256) range.""" from collections import Counter, defaultdict def find_skew(random_bytes): char2freq = Counter(chr(ord(b'a') + b % 26) for b in random_bytes) freq2char = defaultdict(set) for char, freq in char2freq.items(): freq2char[freq].add(char) return {f: ''.join(sorted(c)) for f, c in freq2char.items()} print(find_skew(range(256))) # -> {9: 'wxyz', 10: 'abcdefghijklmnopqrstuv'}

Here, the input range(256) evenly distributed (each byte occurs exactly once), but the 'wxyz' letters on the output are less common than the other 9 vs. 10 occurrences. To fix this, you can discard unaligned bytes:

 print(find_skew(range(256 - (256 % 26)))) # -> {9: 'abcdefghijklmnopqrstuvwxyz'}

Here, the input of uniformly distributed bytes in the range [0, 234) output evenly distributed over the strings of letters ascii.

bytes.translate() takes a second argument to specify the bytes to delete:

 #!/usr/bin/env python3 import os import sys nbytes = 256 nletters = 26 naligned = nbytes - (nbytes % nletters) tbl = bytes.maketrans(bytearray(range(naligned)), bytearray([ord(b'a') + b % nletters for b in range(naligned)])) bytes2delete = bytearray(range(naligned, nbytes)) R = lambda n: os.urandom(n).translate(tbl, bytes2delete) def write_random_ascii_lowercase_letters(write, n): """*write* *n* random ascii lowercase letters.""" while n > 0: # R(n) expected to drop `(nbytes - nletters) / nbytes` bytes # to compensate, increase the initial size n -= write(memoryview(R(n * nbytes // naligned + 1))[:n]) write = sys.stdout.buffer.write write_random_ascii_lowercase_letters(write, 1000000)

If a random generator ( os.urandom here) creates long sequences of bytes that are outside the aligned range ( >=234 ), then the while can execute many times.

Time productivity can be improved by another order if instead of random.getrandbits(8*n).to_bytes(n, 'big') "rel =" NOFOLLOW "> os.urandom(n) . The first uses Mersenne Twister as the main generator , which may be faster than os.urandom() , which uses sources provided by the operating system, which is more secure if you use a random string for secrets.

Fastest way to generate a large random string with lower latin letters

How to fix % len_lc skew

More articles:

How to fix `% len_lc` skew