Sort large text data

I have a large file (100 million lines of tab-delimited values ​​- about 1.5 GB). What is the fastest way to sort based on one of the fields?

I tried the hive. I would like to know if this can be done faster with python.

+8
python sorting bigdata
source share
4 answers

Do you consider using the * nix sort program? under the original conditions, this is likely to be faster than most Python scripts.

Use -t $'\t' to indicate that it is -kn , -kn to indicate a field, where n is the field number, and -o outputfile if you want to output the result to a new file. Example:

 sort -t $'\t' -k 4 -o sorted.txt input.txt 

input.txt in its 4th field and displays the result on sorted.txt

+16
source share

you want to create an in-memory index for the file:

  • create empty list
  • open file
  • read it line by line (using f.readline() and save in the list a tuple consisting of the value you want to sort (extracted using line.split('\t').strip() ) and the line offset in the file ( which you can get a call to f.tell() before calling f.readline() )
  • close file
  • sort list

Then, to print the sorted file, open the file again, and for each item in your list, use f.seek(offset) to move the file pointer to the beginning of the line, f.readline() to read the line and print line,

Optimization: you can save the length of the string in the list so you can use f.read(length) at the printing stage.

Sample code (optimized for reading, not speed):

 def build_index(filename, sort_col): index = [] f = open(filename) while True: offset = f.tell() line = f.readline() if not line: break length = len(line) col = line.split('\t')[sort_col].strip() index.append((col, offset, length)) f.close() index.sort() return index def print_sorted(filename, col_sort): index = build_index(filename, col_sort) f = open(filename) for col, offset, length in index: f.seek(offset) print f.read(length).rstrip('\n') if __name__ == '__main__': filename = 'somefile.txt' sort_col = 2 print_sorted(filename, sort_col) 
+7
source share

Separate files that can be sorted in memory. Sort each file in memory. Then merge the resulting files.

Combine by reading a portion of each of the files that you want to merge. The same amount from each file leaves enough memory space for the combined result. After combining, save it. Re-adding blocks of merged data to a file.

This minimizes the I / O file and moves around the file on disk.

+3
source share

I would save the file in a good relational database, index it in the field you are interested in, and then read the ordered elements.

+1
source share

All Articles