I have some problems trying to split large files (say, about 10 GB). The basic idea is to simply read the lines and group each, say 40,000 lines into a single file. But there are two ways to “read” files.
1) The first one is to immediately read the WHOLE file and turn it into a LIST. But for this you will need to load the WHOLE file into memory, which is painful for a file that is too large. (I think I have asked such questions before) In python, approaches to reading the WHOLE file I tried include:
input1=f.readlines() input1 = commands.getoutput('zcat ' + file).splitlines(True) input1 = subprocess.Popen(["cat",file], stdout=subprocess.PIPE,bufsize=1)
Well, then I can simply group 40,000 lines into a single file: list[40000,80000] or list[80000,120000] Or the advantage of using a list is that we can easily point to specific lines.
2) The second method is to read line by line; process the line while reading it. These read lines will not be stored in memory. Examples include:
f=gzip.open(file) for line in f: blablabla...
or
for line in fileinput.FileInput(fileName):
I am sure that for gzip.open this is not a list, but a file object. And it seems that we can only process line by line; then how can I do this “split” job? How can I point to specific lines of a file object?
thanks
壮志 饥 餐 法轮 肉 笑谈 渴 饮 台独 血
source share