I have two files:
regions.txt: The first column is the name of the chromosome, the second and third are the start and end positions.
1 100 200 1 400 600 2 600 700
coverage.txt: the first column is the name of the chromosome, the second and third are the start and end positions, and the last column is the score.
1 100 101 5 1 101 102 7 1 103 105 8 2 600 601 10 2 601 602 15
This file is very large, it is about 15 GB with about 300 million lines.
Basically I want to get the average of all the points in the coverage.txt file that are in each region in region.txt.
In other words, start with the first line in region.txt if there is a line in the coverage area. txt, which has the same chromosome, start-coverage is = start-region, and end coverage is <= end-region, then save your score in a new array. After completing the search in all coverages.txt, print the region of the chromosome, the beginning, the end and the average value of all the points found.
Expected Result:
1 100 200 14.6 which is (5+7+8)/3 1 400 600 0 no match at coverages.txt 2 600 700 12.5 which is (10+15)/2
I built the following MATLAB script, which takes a lot of time, since I have to iterate over span.txt many times. I do not know how to make a quick awk-like script.
My matlab script
fc = fopen('coverage.txt', 'r'); ft = fopen('regions.txt', 'r'); fw = fopen('out.txt', 'w'); while feof(ft) == 0 linet = fgetl(ft); scant = textscan(linet, '%d%d%d'); tchr = scant{1}; tx = scant{2}; ty = scant{3}; coverages = []; frewind(fc); while feof(fc) == 0 linec = fgetl(fc); scanc = textscan(linec, '%d%d%d%d'); cchr = scanc{1}; cx = scanc{2}; cy = scanc{3}; cov = scanc{4}; if (cchr == tchr) && (cx >= tx) && (cy <= ty) coverages = cat(2, coverages, cov); end end covmed = median(coverages); fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed); end
Any suggestions for using alternatives using AWK, Perl or, ..., etc. I will also be happy if someone teaches me how to get rid of all loops in my matlab script.
thanks