Get the word between quotation marks

Question

Get the word between quotation marks

I have lines like this:

Unable to find latest released revision of 'CONTRIB_046578'.

And I need to extract the word between revision of ' and ' in this example the word CONTRIB_046578 and, if possible, count the number of occurrences of this word using grep , sed or any other command

+4

linux unix grep awk sed

user1921608 Dec 21 '12 at 13:14

source share

6 answers

The cleanest solution with grep -Po "(?<=')[^']+(?=')"

 $ cat file Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'foo' Unable to find latest released revision of 'bar' Unable to find latest released revision of 'CONTRIB_046578' # Print occurences $ grep -Po "(?<=')[^']+(?=')" file CONTRIB_046578 foo bar CONTRIB_046578 # Count occurences $ grep -Pc "(?<=')[^']+(?=')" file 4 # Count unique occurrences $ grep -Po "(?<=')[^']+(?=')" file | sort | uniq -c 2 CONTRIB_046578 1 bar 1 foo

+8

Chris seymour Dec 21 '12 at 13:37

source share

Here is one awk script that you can use to extract and count the frequency of each word in a single quote:

 awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile

TESTING

 cat infile Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046570' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046579'

OUTPUT:

  awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile CONTRIB_046579 3 CONTRIB_046578 1 CONTRIB_046570 1 CONTRIB_046572 2

+1

anubhava Dec 21 '12 at 13:26

source share

Assumptions:

Each word can occur multiple times, and the OP wants to count the number of occurrences of each word.
There are no other lines in the file.

Input file:

 $ cat test.txt Unable to find latest released revision of 'CONTRIB_046578'. Unable to find latest released revision of 'CONTRIB_046572'. Unable to find latest released revision of 'CONTRIB_046579'. Unable to find latest released revision of 'CONTRIB_046570'. Unable to find latest released revision of 'CONTRIB_046572'. Unable to find latest released revision of 'CONTRIB_046578'.

Shell script for filtering and word counting:

 $ sed "s/.*'\(.*\)'.*/\1/" test.txt | sort | uniq -c 1 CONTRIB_046570 2 CONTRIB_046572 2 CONTRIB_046578 1 CONTRIB_046579

0

Andreas Fester Dec 21 '12 at 13:20

source share

 sed 's/.*\'(.*?)\'.*/$1/' myfile.txt

0

Bohemian Dec 21 '12 at 13:22

source share

If the test file below is representative of the file in the actual task, then the following may be useful.

Based on the fact that each line of the test file is homogeneous, that is, it is well formatted and contains 8 columns (or fields), a convenient solution using the cut will look like this:

File:

 Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046570' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046579'

Code:

 cut -d ' ' -f 8 file | tr -d "'" | sort | uniq -c

Output:

 1 CONTRIB_046570 2 CONTRIB_046572 1 CONTRIB_046578 3 CONTRIB_046579

Note on the code: the default separator used by cut to separate each field is tab , but since we need the separator to be the only space to separate each field, we specify the -d ' ' option. -d ' ' The rest of the code is similar to the other answers, so I will not repeat the above.

General note: this code will probably not achieve the desired result if the file was not well formatted, as I mentioned above.

0

Graeme walsh Jan 12 '14 at 10:52

source share

Ed morton · Accepted Answer · 2012-12-21T14:20:58+0000

All you need is a very simple awk script to count the occurrence between quotes:

 awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file

Using @anubhava test input file:

 $ cat file Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046570' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046579' $ $ awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file CONTRIB_046578 1 CONTRIB_046579 3 CONTRIB_046570 1 CONTRIB_046572 2

Get the word between quotation marks

TESTING

More articles: