How to crop a file - delete columns with the same value

I need your help trimming a file by deleting columns with the same value.

# the file I have (tab-delimited, millions of columns) jack 1 5 9 john 3 5 0 lisa 4 5 7 

 # the file I want (remove the columns with the same value in all lines) jack 1 9 john 3 0 lisa 4 7 

Could you give me any guidance on this issue? I prefer sed or awk solution or maybe perl solution.

Thanks in advance. Best,

+7
source share
8 answers
 #!/usr/bin/perl $/="\t"; open(R,"<","/tmp/filename") || die; while (<R>) { next if (($. % 4) == 3); print; } 

Well, it was supposed to be the third column. If this value is:

 #!/usr/bin/perl $/="\t"; open(R,"<","/tmp/filename") || die; while (<R>) { next if (($_ == 5); print; } 

When editing an OP question, desires become clear. What about:

 #!/usr/bin/perl open(R,"<","/tmp/filename") || die; my $first = 1; my (@cols); while (<R>) { my (@this) = split(/\t/); if ($. == 1) { @cols = @this; } else { for(my $x=0;$x<=$#cols;$x++) { if (defined($cols[$x]) && !($cols[$x] ~~ $this[$x])) { $cols[$x] = undef; } } } next if (($_ == 5)); # print; } close(R); my(@del); print "Deleting columns: "; for(my $x=0;$x<=$#cols;$x++) { if (defined($cols[$x])) { print "$x ($cols[$x]), "; push(@del,$x-int(@del)); } } print "\n"; open(R,"<","/tmp/filename") || die; while (<R>) { chomp; my (@this) = split(/\t/); foreach my $col (@del) { splice(@this,$col,1); } print join("\t",@this)."\n"; } close(R); 
+3
source

Here is a quick perl script to find out which columns can be cut.

 open FH, "file" or die $!; my @baseline = split /\t/,<FH>; #snag the first row my @linemap = 0..$#baseline; #list all equivalent columns (all of them) while(<FH>) { #loop over the file my @line = split /\t/; @linemap = grep {$baseline[$_] eq $line[$_]} @linemap; #filter out any that aren't equal } print join " ", @linemap; print "\n"; 

You can use many of the above recommendations to actually delete columns. Perhaps my favorite option would be cut, in part because the aforementioned perl script could be changed to give you the exact command (or even run it for you).

 @linemap = map {$_+1} @linemap; #Cut is 1-index based print "cut --complement -f ".join(",",@linemap)." file\n"; 
+5
source

If you know which column to delete beforehand, then cut will be useful:

 cut --complement -d' ' -f 3 filename 
+3
source

As I understand it, you want to go through each row and check if there are any values ​​in any column, in which case you can delete this column. If so, I have a suggestion, but not a finished script, but I think you can figure it out. You should look cut . He extracts parts of the line. You can use it to extract, i.e. There is one column, then run uniq on the output, and then, if there is only one value after the unique value, this means that all the values ​​in this column are identical. This way you can collect the number of columns that don't matter. You will need a shell script to find out how many columns you have (I think using head -n 1 and counting the number of delimiters) and start this procedure in each column, keeping the column numbers in the array, then at the end delete the columns that not of interest. Provided it is not awk or perl, but should work and use only traditional Unix tools. Well, you can use them in a perl script if you want :)

Well, I, if I misunderstood the question, maybe cutting will still be useful :) this seems to be one of the lesser-known tools.

+2
source

As far as I can tell, you need to make this a multi-pass program to satisfy your needs without blowing memory. To get started, load one line of the file into an array.

 open FH,'datafile.txt' or die "$!"; my @mask; my @first_line= split(/\s+/,<FH>); 

Then you will want to read sequentially in other lines

 while(my @next_line= split(/\s+/,<FH>)) { /* compare each member of @first_line to @next_line * any match, make a mark in mask to true */ 

When you get to the bottom of the file, go back up and use the mask to determine which columns will print.

+1
source

You can select a column to cut, for example

 # using bash/awk # I had used 1000000 here, as you had written millions of columns but you should adjust it for cols in `seq 2 1000000` ; do cut -d DELIMITER -f $cols FILE | awk -vc=$cols '{s+=$0} END {if (s/NR==$0) {printf("%i,",c)}}' done | sed 's/,$//' > tmplist cut --complement -d DELIMITER -f `cat tmplist` FILE 

But it can be REALLY slow because it is not optimized and reads the file several times ... so be careful with huge files.

Or you can read the whole file once with awk and select the columns with dumps and then use cut.

 cut --complement -d DELIMITER -f `awk '{for (i=1;i<=NF;i++) {sums[i]+=$i}} END {for (i=1;i<=NF; i++) {if (sums[i]/NR==$i) {printf("%i,",c)}}}' FILE | sed 's/,$//'` FILE 

NTN

+1
source

Not fully tested, but it seems to work for the provided test suite, please note that it destroys the original file ...

 #!/bin/bash #change 4 below to match number of columns for i in {2..4}; do cut -f $i input | sort | uniq -c > tmp while read ab; do if [ $a -ge 2 ]; then awk -vfield=$i '{$field="_";print}' input > tmp2 $(mv tmp2 input) fi done < tmp done $ cat input jack 1 5 9 john 3 5 0 lisa 4 5 7 $ ./cnt.sh $ cat input jack 1 _ 9 john 3 _ 0 lisa 4 _ 7 

Using _ to make the conclusion clearer ...

+1
source

The main problem here is that you said “millions of columns” and did not indicate how many rows. To check each value in each row relative to its copy in each column, you look at a large number of checks.

Of course, you can reduce the number of columns along the way, but you still need to check each of them to the last row. So ... a lot of processing.

We can make a "seed" hash to start with the first two lines:

 use strict; use warnings; open my $fh, '<', "inputfile.txt" or die; my %matches; my $line = <$fh>; my $nextline = <$fh>; my $i=0; while ($line =~ s/\t(\d+)//) { my $num1 = $1; if ($nextline =~ s/\t(\d+)//) { if ($1 == $num1) { $matches{$i} = $num1 } } else { die "Mismatched line at line $."; } $i++; } 

Then, using this "seed" hash, you can read the rest of the lines and remove inappropriate values ​​from the hash, for example:

 while($line = <$fh>) { my $i = 0; while ($line =~ s/\t(\d+)//) { if (defined $matches{$i}) { $matches{$i} = undef if ($matches{$i} != $1); } $i++; } } 

You can imagine a solution in which you deleted all the lines that have already been proven to be unique, but for this you need to create an array of strings or create a regular expression, and I'm not sure that it will not take as much time as it simply passes through the string.

Then, after processing all the lines, you will have a hash with the values ​​of the duplicated numbers so that you can reopen the file and print:

 open my $fh, '<', "inputfile.txt" or die; open my $outfile, '>', "outfile.txt" or die; while ($line = <$fh>) { my $i = 0; if ($line =~ s/^([^\t]+)(?=\t)//) { print $outfile $1; } else { warn "Missing header at line $.\n"; } while ($line =~ s/(\t\d+)//) { if (defined $matches{$i}) { print $1 } $i++; } print "\n"; } 

This is a rather difficult operation and this code has not been verified. This will give you a hint for the solution, it will take some time to process the whole file. I suggest running some tests to see if it works with your data and configure it.

If you have only a few suitable columns, it is much easier to just extract them from a row, but I hesitate to use split on such long rows. Something like:

 while ($line = <$fh>) { my @line = split /\t/, $line; for my $key (sort { $b <=> $a } keys %matches) { splice @line, $key + 1, 1; } $line = join ("\t", @line); $line =~ s/\n*$/\n/; # awkward way to make sure to get a single newline print $outfile $line; } 

Note that we will have to sort the keys in descending order, so that we trim the values ​​from the end. Otherwise, we ruin the uniqueness of subsequent numbers of arrays.

In any case, this may be one of the ways. This is a pretty big operation. I would keep backups .;)

+1
source

All Articles