This will help get an example of a few lines and the expected result. But from what I understand, here are some ideas.
Of course, you don’t want to process all the files every time you process one file or, even worse, one 4-gram one. Ideally, you could go through each file once. Therefore, my first suggestion is to maintain an intermediate list of frequencies (these sets of 10 data points), where they first take into account only one file. Then, when you process the second file, you update all frequencies for the objects you encounter (and presumably add new elements). Then you will continue this way, increasing the frequencies when you find more suitable n-grams. At the end, write everything.
More specifically, at each iteration, I read the new input file into memory as a line map for a number, where the line is, say, n-gram, separated by a space, and the number is its frequency. Then I processed the intermediate file from the last iteration, which will contain the expected result (with incomplete values), for example. "abcd: 10 20 30 40 5 4 3 2 1 1" (the kind of guessing output you are looking for here). For each line, I look at the map all the subgrams on my map, update the account and write the updated line to a new output file. This one will be used in the next iteration until I process all the input files.
source share