Reading a very large single line text file

I have a 30 megabyte .txt file with one line of data (30 million digits)
Unfortunately, every method I tried ( mmap.read() , readline() , allocating 1 GB of RAM, for loops) takes 45 minutes to read the file completely. Every method I found on the Internet seems to work on the fact that each line is small, so the memory consumption reaches as large as the largest line in the file. Here is the code I used.

 start = time.clock() z = open('Number.txt','r+') m = mmap.mmap(z.fileno(), 0) global a a = int(m.read()) z.close() end = time.clock() secs = (end - start) print("Number read in","%s" % (secs),"seconds.", file=f) print("Number read in","%s" % (secs),"seconds.") f.flush() del end,start,secs,z,m 

In addition to dividing numbers from one line into different lines; which I would prefer not to do, is there a cleaner method that won't take most of an hour?

By the way, I do not have to use text files.

I have: Windows 8.1 64-bit, 16 GB RAM, Python 3.5.1

+6
source share
3 answers

I used the gmpy2 module to convert a string to a number.

 start = time.clock() z=open('Number.txt','r+') data=z.read() global a a=gmpy2.mpz(data) end = time.clock() secs = (end - start) print("Number read in","%s" % (secs),"seconds.", file=f) print("Number read in","%s" % (secs),"seconds.") f.flush() del end,secs,start,z,data 

It worked after 3 seconds, much slower, but at least it gave me an integer value.

Thanks to everyone for the invaluable answers, however I am going to note this as soon as possible.

+1
source

Reading the file is fast (<1s):

 with open('number.txt') as f: data = f.read() 

Converting a 30 millionth string to an integer, which is slow:

 z=int(data) # still waiting... 

If you store the number as raw binary data of a large or small binary code, then int.from_bytes(data,'big') is much faster.

If I did my math correctly (Note _ means "last line response" in the Python interactive interpreter):

 >>> import math >>> math.log(10)/math.log(2) # Number of bits to represent a base 10 digit. 3.3219280948873626 >>> 30000000*_ # Number of bits to represent 30M-digit #. 99657842.84662087 >>> _/8 # Number of bytes to represent 30M-digit #. 12457230.35582761 # Only ~12MB so file will be smaller :^) >>> import os >>> data=os.urandom(12457231) # Generate some random bytes >>> z=int.from_bytes(data,'big') # Convert to integer (<1s) 99657848 >>> math.log10(z) # number of base-10 digits in number. 30000001.50818886 

EDIT : FYI, my math was wrong, but I fixed it. Thanks for the 10 upvotes without noticing: ^)

+11
source

A 30 MB text file should not take much time, modern hard drives should be able to do this in less than a second (not counting the access time)

Using a standard python file, IO should work fine in this case:

 with open('my_file', 'r') as handle: content = handle.read() 

Using this on my laptop gives me much less time than a second.

However, converting these 30 MB to an integer is your bottleneck since python cannot represent this with a long type.

You can try with a decimal module, however it is mainly intended for floating point arithmetic.

In addition, there is, of course, a multi-digit number that can be faster (and since you probably want to work a bit with the number later, it would be wise to use such a library).

+3
source

All Articles