Get the word between quotation marks

I have lines like this:

Unable to find latest released revision of 'CONTRIB_046578'. 

And I need to extract the word between revision of ' and ' in this example the word CONTRIB_046578 and, if possible, count the number of occurrences of this word using grep , sed or any other command

+4
source share
6 answers

All you need is a very simple awk script to count the occurrence between quotes:

 awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file 

Using @anubhava test input file:

 $ cat file Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046570' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046579' $ $ awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file CONTRIB_046578 1 CONTRIB_046579 3 CONTRIB_046570 1 CONTRIB_046572 2 
+1
source

The cleanest solution with grep -Po "(?<=')[^']+(?=')"

 $ cat file Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'foo' Unable to find latest released revision of 'bar' Unable to find latest released revision of 'CONTRIB_046578' # Print occurences $ grep -Po "(?<=')[^']+(?=')" file CONTRIB_046578 foo bar CONTRIB_046578 # Count occurences $ grep -Pc "(?<=')[^']+(?=')" file 4 # Count unique occurrences $ grep -Po "(?<=')[^']+(?=')" file | sort | uniq -c 2 CONTRIB_046578 1 bar 1 foo 
+8
source

Here is one awk script that you can use to extract and count the frequency of each word in a single quote:

 awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile 

TESTING

 cat infile Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046570' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046579' 

OUTPUT:

  awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile CONTRIB_046579 3 CONTRIB_046578 1 CONTRIB_046570 1 CONTRIB_046572 2 
+1
source

Assumptions:

  • Each word can occur multiple times, and the OP wants to count the number of occurrences of each word.
  • There are no other lines in the file.

Input file:

 $ cat test.txt Unable to find latest released revision of 'CONTRIB_046578'. Unable to find latest released revision of 'CONTRIB_046572'. Unable to find latest released revision of 'CONTRIB_046579'. Unable to find latest released revision of 'CONTRIB_046570'. Unable to find latest released revision of 'CONTRIB_046572'. Unable to find latest released revision of 'CONTRIB_046578'. 

Shell script for filtering and word counting:

 $ sed "s/.*'\(.*\)'.*/\1/" test.txt | sort | uniq -c 1 CONTRIB_046570 2 CONTRIB_046572 2 CONTRIB_046578 1 CONTRIB_046579 
0
source
 sed 's/.*\'(.*?)\'.*/$1/' myfile.txt 
0
source

If the test file below is representative of the file in the actual task, then the following may be useful.

Based on the fact that each line of the test file is homogeneous, that is, it is well formatted and contains 8 columns (or fields), a convenient solution using the cut will look like this:

File:

 Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046570' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046579' 

Code:

 cut -d ' ' -f 8 file | tr -d "'" | sort | uniq -c 

Output:

 1 CONTRIB_046570 2 CONTRIB_046572 1 CONTRIB_046578 3 CONTRIB_046579 

Note on the code: the default separator used by cut to separate each field is tab , but since we need the separator to be the only space to separate each field, we specify the -d ' ' option. -d ' ' The rest of the code is similar to the other answers, so I will not repeat the above.

General note: this code will probably not achieve the desired result if the file was not well formatted, as I mentioned above.

0
source

All Articles