Removing strings with sed or awk

I have a data.txt file like this.

>1BN5.txt 207 208 211 >1B24.txt 88 92 

I have an F1 folder that contains text files.

The 1BN5.txt file in the F1 folder is shown below.

 ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C ATOM 422 C SER A 248 70.124 -29.955 8.226 1.00 55.81 C ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H ATOM 626 N MET B 87 1.054 -3.071 -5.633 1.00 10.00 N ATOM 627 CA MET B 87 -0.213 -2.354 -5.826 1.00 10.00 C 

The 1B24.txt file in the F1 folder is shown below.

 ATOM 630 CB MET B 87 -0.476 -2.140 -7.318 1.00 10.00 C ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N ATOM 644 CA ALA B 94 -2.560 -5.149 -4.675 1.00 10.00 C 

I only need rows containing 207,208,211 (6th column) in the 1BN5.txt file. I want to delete other lines in the 1BN5.txt file. For example, I only need lines containing 88.92 in the 1B24.txt file.

 Desired output 

1BN5.txt file

 ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H 

File 1B24.txt

 ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N 
+4
source share
7 answers

Here is one way: GNU awk . Run as:

 awk -f script.awk data.txt 

The content of script.awk :

 /^>/ { file = substr($1,2) next } { a[file][$1] } END { for (i in a) { while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,b) for (j in a[i]) { if (b[6]==j) { print line > "./F1/" i ".new" } } } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } } 

Alternatively, here is a single line:

 awk '/^>/ { file = substr($1,2); next } { a[file][$1] } END { for (i in a) { while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,b); for (j in a[i]) if (b[6]==j) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt 


If you have an older version of awk installed that is older than GNU Awk 4.0.0 , you can try the following. Run as:

 awk -f script.awk data.txt 

The contents of script.awk:

 /^>/ { file = substr($1,2) next } { a[file]=( a[file] ? a[file] SUBSEP : "") $1 } END { for (i in a) { split(a[i],b,SUBSEP) while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,c) for (j in b) { if (c[6]==b[j]) { print line > "./F1/" i ".new" } } } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } } 

Alternatively, here is a single line:

 awk '/^>/ { file = substr($1,2); next } { a[file]=( a[file] ? a[file] SUBSEP : "") $1 } END { for (i in a) { split(a[i],b,SUBSEP); while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,c); for (j in b) if (c[6]==b[j]) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt 


Note that this script does exactly what you are describing. He expects files such as 1BN5.txt and 1B24.txt to be in the F1 folder in the current working directory. It will also overwrite your original files. If this is not the desired behavior, release the system() call. NTN.

Results:

Content F1/1BN5.txt :

 ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H 

Content F1/1B24.txt :

 ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N 
+5
source

Do not try to delete lines from an existing file, try creating a new file with only those lines that you want to have:

 cat 1bn5.txt | awk '$6 == 207 || $6 == 208 || $6 == 211 { print }' > output.txt 
+1
source

Assuming gnu awk, run this command from the directory containing data.txt :

 awk -F">" '{if($2 != ""){fname=$2}if($2 == ""){term=$1;system("grep "term" F1/"fname" >>F1/"fname"_results");}}' data.txt 

this parses data.txt for file names and search terms, and then calls grep from inside awk to add matches from each file and term specified in data.txt to a new file in F1 called originalfilename.txt_results .

if you want to completely replace the source files, you can run this command:

 grep "^>.*$" data.txt | sed 's/>//' | xargs -I{} find F1 -name {}_results -exec mv F1/{}_results F1/{} \; 
+1
source

This solution performs some tricks with a record separator: "data.txt" uses > as a record separator, while other files use a new line.

 awk ' BEGIN {RS=">"} FNR == 1 { # since the first char in data.txt is the record separator, # there is an empty record before the real data starts next } { n = split($0, a, "\n") file = "F1/" a[1] newfile = file ".new" RS="\n" while (getline < file) { for (i=2; i<n; i++) { if ($6 == a[i]) { print > newfile break } } } RS=">" system(sprintf("mv \"%s\" \"%s.bak\" && mv \"%s\" \"%s\"", file, file, newfile, file)) } ' data.txt 
+1
source

This will move all files in F1 to tmp dir named "backup" and then recreate only the resulting non-empty files in F1

 mv F1 backup && mkdir F1 && awk ' NF==FNR { if (sub(/>/,"")) { file=$0 ARGV[ARGC++] = "backup/" file } else { tgt[file,$0] = "F1/" file } next } (FILENAME,$6) in tgt { print > tgt[FILENAME,$6] } ' data.txt && rm -rf backup 

If you want empty files to be trivial settings too, and if you want to keep the backup directory, just get rid of "&& rm.." at the end (at least, do this when testing).

EDIT: FYI is one case where you can argue that getline is not completely wrong, because it parses the first file, which is completely different from the rest of the files in structure and intent, so parsing one file is different from the rest isn 't will cause any headaches of service later:

 mv F1 backup && mkdir F1 && awk -v data="data.txt" ' BEGIN { while ( (getline line < data) > 0 ) { if (sub(/>/,"",line)) { file=line ARGV[ARGC++] = "backup/" file } else { tgt[file,line] = "F1/" file } } } (FILENAME,$6) in tgt { print > tgt[FILENAME,$6] } ' && rm -rf backup 

but, as you can see, this makes the script a little more complicated (although a bit more efficient, since now there is no test for FNR == NR in the main body).

+1
source

Definitely a job for awk :

 $ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H $ awk '$6==92||$6==88 { print }' 1B24.txt ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N 

Redirect to save output:

 $ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt > output.txt 
0
source

I do not think that you can only do this with sed . You need a loop to read the data.txt file. For example, using a bash script:

 #!/bin/bash # First remove all possible "problematic" characters from data.txt, storing result # in data.clean.txt. This removes everything except AZ, az, 0-9, leading >, and .. sed 's/[^A-Za-z0-9>\.]//g;s/\(.\)>/\1/g;/^$/d' data.txt >| data.clean.txt # Next determine which lines to keep: cat data.clean.txt | while read line; do if [[ "${line:0:1}" == ">" ]]; then # If input starts with ">", set remainder to be the current file file="${line:1}" else # If value is in sixth column, add "keep" to end of line # Columns assumed separated by one or more spaces # "+" is a GNU extension, so we need the -r switch sed -i -r "/^[^ ]+ +[^ ]+ +[^ ]+ +[^ ]+ +$line +/s/$/keep/" $file fi done # Finally delete the unwanted lines, ie those without "keep": # (assumes each file appears only once in data.txt) cat data.clean.txt | while read line; do if [[ "${line:0:1}" == ">" ]]; then sed -i -n "/keep/{s/keep//g;p;}" ${line:1} fi done 
0
source

All Articles