Open / read command in Tcl 8.5 for large files

Sorry if the title does not fit my question, I'm still not sure how to put it.

In any case, I used Tcl / Tk on Windows ( wish ) for a while and did not encounter any problem on the script that I wrote until recently. The script is supposed to split a large txt file into smaller files that can be imported into excel (I'm talking about file decay, possibly 25M lines, which is about 2.55 GB).

My current script looks something like this:

 set data [open "file.txt" r] set data1 [open "File Part1.txt" w] set data2 [open "File Part2.txt" w] set data3 [open "File Part3.txt" w] set data4 [open "File Part4.txt" w] set data5 [open "File Part5.txt" w] set count 0 while {[gets $data line] != -1} { if {$count > 4000000} { puts $data5 $line } elseif {$count > 3000000} { puts $data4 $line } elseif {$count > 2000000} { puts $data3 $line } elseif {$count > 1000000} { puts $data2 $line } else { puts $data1 $line } incr count } close $data close $data1 close $data2 close $data3 close $data4 close $data5 

And I change the numbers inside the if to get the right number of lines in the file, or add / remove any elseif where necessary.

The problem is that with the last file I received, I only have about half the data (1.22 GB instead of 2.55 GB), and I was wondering if there is a line in which Tcl ignores the limit that it can to read. I tried to find it, but I did not find anything (or something that I could understand well, I am still quite an amateur in Tcl ^^;). Can anybody help me?

EDIT (update): I found a program to open large text files and was able to get a preview of the contents of the file directly. There are actually 16,756,263 lines. I changed the script to:

 set data [open "file.txt" r] set data1 [open "File Part1.txt" w] set count 0 while {[gets $data line] != -1} { incr count } puts $data1 $count close $data close $data1 

to block the script and it stopped here: enter image description here

There, a character that the text editor does not recognize in the middle line is displayed as a small square. I tried using fconfigure as an evil otto sentence, but I'm afraid I don't quite understand how channelID , name or value works to avoid this character. Um ... help?

reEDIT : I was able to figure out how fconfigure works! Thanks to the evil Otto! I'm not sure how I can β€œchoose” your answer, as this is a comment instead of the correct answer ...

+7
source share
2 answers

Is it possible that there is any binary data in the file.txt file? Under windows, tcl will mark eof if it reads ^Z (by default, eofchar ) in the file. You can disable this with fconfigure :

 fconfigure $data -eofchar {} 

See the docs for more details.

+2
source

I ran your script on a Mac that is based on Unix, and noticed the following:

  • incr count should be at the beginning of the loop - a minor point.
  • More importantly, File.txt contains 25M lines, but you split up unevenly: the first four contain 1M, and the rest are in File5.txt. If you want to split files evenly, then break points should be 20M, 15M, 10M and 5M.
  • Other than that, I did not notice data loss. I do not have a windows machine to try.
+1
source

All Articles