Get regions from a file that is part of regions in another file (no loops)

I have two files:

regions.txt: The first column is the name of the chromosome, the second and third are the start and end positions.

1 100 200 1 400 600 2 600 700 

coverage.txt: the first column is the name of the chromosome, the second and third are the start and end positions, and the last column is the score.

 1 100 101 5 1 101 102 7 1 103 105 8 2 600 601 10 2 601 602 15 

This file is very large, it is about 15 GB with about 300 million lines.

Basically I want to get the average of all the points in the coverage.txt file that are in each region in region.txt.

In other words, start with the first line in region.txt if there is a line in the coverage area. txt, which has the same chromosome, start-coverage is = start-region, and end coverage is <= end-region, then save your score in a new array. After completing the search in all coverages.txt, print the region of the chromosome, the beginning, the end and the average value of all the points found.

Expected Result:

 1 100 200 14.6 which is (5+7+8)/3 1 400 600 0 no match at coverages.txt 2 600 700 12.5 which is (10+15)/2 

I built the following MATLAB script, which takes a lot of time, since I have to iterate over span.txt many times. I do not know how to make a quick awk-like script.

My matlab script

 fc = fopen('coverage.txt', 'r'); ft = fopen('regions.txt', 'r'); fw = fopen('out.txt', 'w'); while feof(ft) == 0 linet = fgetl(ft); scant = textscan(linet, '%d%d%d'); tchr = scant{1}; tx = scant{2}; ty = scant{3}; coverages = []; frewind(fc); while feof(fc) == 0 linec = fgetl(fc); scanc = textscan(linec, '%d%d%d%d'); cchr = scanc{1}; cx = scanc{2}; cy = scanc{3}; cov = scanc{4}; if (cchr == tchr) && (cx >= tx) && (cy <= ty) coverages = cat(2, coverages, cov); end end covmed = median(coverages); fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed); end 

Any suggestions for using alternatives using AWK, Perl or, ..., etc. I will also be happy if someone teaches me how to get rid of all loops in my matlab script.

thanks

+6
source share
4 answers

Here is the Perl solution. I use hashes (aka dictionaries) to access different ranges through the chromosome, thereby reducing the number of iterations of the loop.

This is potentially effective since I am not doing a full loop on regions.txt on every line of input. Efficiency can perhaps be increased even further by using multithreading.

 #!/usr/bin/perl my ($rangefile) = @ARGV; open my $rFH, '<', $rangefile or die "Can't open $rangefile"; # construct the ranges. The chromosome is used as range key. my %ranges; while (<$rFH>) { chomp; my @field = split /\s+/; push @{$ranges{$field[0]}}, [@field[1,2], 0, 0]; } close $rFH; # iterate over all the input while (my $line = <STDIN>) { chomp $line; my ($chrom, $lower, $upper, $value) = split /\s+/, $line; # only loop over ranges with matching chromosome foreach my $range (@{$ranges{$chrom}}) { if ($$range[0] <= $lower and $upper <= $$range[1]) { $$range[2]++; $$range[3] += $value; last; # break out of foreach early because ranges don't overlap } } } # create the report foreach my $chrom (sort {$a <=> $b} keys %ranges) { foreach my $range (@{$ranges{$chrom}}) { my $value = $$range[2] ? $$range[3]/$$range[2] : 0; printf "%d %d %d %.1f\n", $chrom, @$range[0,1], $value; } } 

Call example:

 $ perl script.pl regions.txt <coverage.txt >output.txt 

Example output:

 1 100 200 6.7 1 400 600 0.0 2 600 700 12.5 

(because (5 + 7 + 8) / 3 = 6.66 ...)

+4
source

I used to upload files to R and compute them, but given that one of them is so huge, this will become a problem. Here are a few thoughts that could help you solve it.

  • Consider splitting coverage.txt into chromosomes. This would make the calculations less demanding.

  • Instead of coverage.txt over coverage.txt , you first read regions.txt in its entirety (I assume it is much smaller). For each region you keep an account and a number.

  • The coverage.txt process line by line. For each line, you determine the chromosome and the region to which this particular site belongs. This will require some work, but if regions.txt not too large, it may be more efficient. Add a score to the region score and add a number per unit.

An alternative, most efficient method requires that both files are sorted first by the chromosome, then by position.

  • Take the line from regions.txt . Record the chromosome and position. If there is a line in the previous loop, go to 3; otherwise go to 2.

  • Take the line from coverage.txt .

  • Check if it is in the current area.

    • yes: add a rating to the region, increase the number. Move to 2.
    • no: divide the account by number, write the current area for withdrawal, go to 1.

This last method requires some fine-tuning, but it will be most effective - it should go through each file only once and does not require storing almost anything in memory.

+1
source

Here is one way: join and awk . Run as:

 join regions.txt coverage.txt | awk -f script.awk - regions.txt 

The content of script.awk :

 FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6 cnt[$1 FS $2 FS $3]++ next } { if ($1 FS $2 FS $3 in sum) { printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3] } else if (NF == 3) { print $0 " 0" } } 

Results:

 1 100 200 6.7 1 400 600 0 2 600 700 12.5 

Alternatively, here is a single line:

 join regions.txt coverage.txt | awk 'FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6; cnt[$1 FS $2 FS $3]++; next } { if ($1 FS $2 FS $3 in sum) printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]; else if (NF == 3) print $0 " 0" }' - regions.txt 
+1
source

Here is a simple MATLAB way to map your coverage to regions:

 % extract the regions extents bins = regions(:,2:3)'; bins = bins(:); % extract the coverage - only the start is needed covs = coverage(:,2); % use histc to place the coverage start into proper regions % this line counts how many coverages there are in a region % and assigns them proper region ids. [h, i]= histc(covs(:), bins(:)); % sum the scores into correct regions (second output of histc gives this) total = accumarray(i, coverage(:,4), [numel(bins),1]); % average the score in regions (first output of histc is useful) avg = total./h; % remove every second entry - our regions are defined by start/end avg = avg(1:2:end); 

Now this works, assuming that the regions do not overlap, but I assume that it is. In addition, each entry in the coverage file must be in a particular region.

It is also trivial to “block” this approach over coverages if you want to avoid reading in the entire file. You only need the bins file, your region, which is supposedly small. You can process coatings in blocks, gradually add to total and calculate the average value at the end.

0
source

Source: https://habr.com/ru/post/927633/


All Articles