Removing String Parts with Sed

I have data rows that look like this:

sp_A0A342_ATPB_COFAR_6_+_contigs_full.fasta sp_A0A342_ATPB_COFAR_9_-_contigs_full.fasta sp_A0A373_RK16_COFAR_10_-_contigs_full.fasta sp_A0A373_RK16_COFAR_8_+_contigs_full.fasta sp_A0A4W3_SPEA_GEOSL_15_-_contigs_full.fasta 

How can I use sed to remove parts of a row after the 4th column (_ separately) for each row. Finally, having received:

 sp_A0A342_ATPB_COFAR sp_A0A342_ATPB_COFAR sp_A0A373_RK16_COFAR sp_A0A373_RK16_COFAR sp_A0A4W3_SPEA_GEOSL 
+6
linux unix bash sed
source share
6 answers

cut works better.

 cut -d_ -f 1-4 old_file 

It just means using _ as a separator and storing fields 1-4.

If you insist on sed :

 sed 's/\(_[^_]*\)\{4\}$//' 

This left side corresponds to exactly four repetitions of the group consisting of an underscore followed by 0 or more underscores. After that, we should be at the end of the line. All this is replaced by nothing.

+19
source share
 sed -e 's/\([^_]*\)_\([^_]*\)_\([^_]*\)_\([^_]*\)_.*/\1_\2_\3_\4' infile > outfile 

Match "any number not _ _", keeping what was agreed between \ (and \), and then "_". Do this 4 times, then match something for the rest of the line (to ignore it). Replace each match separated by the symbol "_".

+2
source share

Here is another possibility:

 sed -E -e 's|^([^_]+(_[^_]+){3}).*$|\1|' 

where -E, like -r in GNU sed, includes extended regular expressions for readability.

Just because you can do it in sed, but that doesn’t mean what you need. I like to cut much better for this.

+2
source share

AWK likes to play in the fields:

 awk 'BEGIN{FS=OFS="_"}{print $1,$2,$3,$4}' inputfile 

or, more generally:

 awk -v count=4 'BEGIN{FS="_"}{for(i=1;i<=count;i++){printf "%s%s",sep,$i;sep=FS};printf "\n"}' 
+2
source share
 sed -e 's/_[0-9][0-9]*_[+-]_contigs_full.fasta$//g' 

The still allowed answer is probably faster and usually better.

+1
source share

Yes, it’s better to cut, and matching each back is easier.

Finally, I got a match using the beginning of each line:

  sed -r 's/(([^_]*_){3}([^_]*)).*/\1/' oldFile > newFile 
+1
source share

All Articles