Python - finding elements in hundreds of large, gzip files

Question

Python - finding elements in hundreds of large, gzip files

Unfortunately, I work with an extremely large enclosure that extends to hundreds of .gz files - actually 24 gigabytes (packed). Python is really my native language (hah), but I was wondering if I did not encounter a problem that would require learning a “faster” language?

Each .gz file contains one document in plain text, is about 56 MB gzipped and about 210 MB is unpacked.

On each line there is an n-gram (bigrams, trigrams, quadrigrams, etc.), and on the right there is a frequency count. I basically need to create a file that stores the substring frequencies for each quadram along with all its number of lines in the chain (i.e. 4 unigram frequencies, 3 bigrams and 2 trigram frequencies for a total of 10 data points). Each type of n-gram has its own directory (for example, all bigrams are displayed in their own set of 33 .gz files).

I know a simple solution for brute force, and which module to import to work with gzipped files in Python, but I was wondering if there is something that does not take me a week of processor time? Any advice on speeding up this process, as if slightly, would be greatly appreciated!

+4

python large-files gzip corpus

Georgina May 27 '11 at 3:19

source share

1 answer

DS. · Accepted Answer · 2011-05-27T05:01:38+0000

This will help get an example of a few lines and the expected result. But from what I understand, here are some ideas.

Of course, you don’t want to process all the files every time you process one file or, even worse, one 4-gram one. Ideally, you could go through each file once. Therefore, my first suggestion is to maintain an intermediate list of frequencies (these sets of 10 data points), where they first take into account only one file. Then, when you process the second file, you update all frequencies for the objects you encounter (and presumably add new elements). Then you will continue this way, increasing the frequencies when you find more suitable n-grams. At the end, write everything.

More specifically, at each iteration, I read the new input file into memory as a line map for a number, where the line is, say, n-gram, separated by a space, and the number is its frequency. Then I processed the intermediate file from the last iteration, which will contain the expected result (with incomplete values), for example. "abcd: 10 20 30 40 5 4 3 2 1 1" (the kind of guessing output you are looking for here). For each line, I look at the map all the subgrams on my map, update the account and write the updated line to a new output file. This one will be used in the next iteration until I process all the input files.

Python - finding elements in hundreds of large, gzip files

More articles: