All but one of these ideas use O (N) memory, but if you use array.array or numpy.ndarray we are talking about N * 4 bytes, which is significantly less than the whole file. (I just use a simple list, if you need help converting to a more compact type, I can show it too.)
Using a temporary database and index list:
with contextlib.closing(dbm.open('temp.db', 'n')) as db: with open(path) as f: for i, line in enumerate(f): db[str(i)] = line linecount = i shuffled = random.shuffle(range(linecount)) with open(path + '.shuffled', 'w') as f: for i in shuffled: f.write(db[str(i)]) os.remove('temp.db')
These are 2N single-line disk operations and 2N single-dBm key operations, which should be 2NlogN single-disk operations equivalent to operations, so the total complexity is O (NlogN).
If you use a relational database like sqlite3 instead of dbm, you donβt even need an index list because you can just do this:
SELECT * FROM Lines ORDER BY RANDOM()
This has the same temporal complexity as above, and the complexity of the space is O (1) instead of the O (N) theory. In practice, you need a DBMS that can feed you a row at a time from a set of 100M rows without saving that 100M on each side.
Another option, without using a temporary database, is in O (N ** 2) theory, but in practice, perhaps faster if you need enough memory for the line cache:
with open(path) as f: linecount = sum(1 for _ in f) shuffled = random.shuffle(range(linecount)) with open(path + '.shuffled', 'w') as f: for i in shuffled: f.write(linecache.getline(path, i))
Finally, by doubling the size of the index list, we can exclude temporary storage on disk. But in practice, this can be much slower because you are doing a lot more random access reading which drives are not so good.
with open(path) as f: linestarts = [f.tell() for line in f] lineranges = zip(linestarts, linestarts[1:] + [f.tell()]) shuffled = random.shuffle(lineranges) with open(path + '.shuffled', 'w') as f1: for start, stop in shuffled: f.seek(start) f1.write(f.read(stop-start))