Shell script to show the frequency of each word in a file and in a directory

Question

Shell script to show the frequency of each word in a file and in a directory

I came across a question in my interview

Shell script to show the frequency of each word in a file and in a directory

A - A1 - File1.txt - File2.txt -A2 - FileA21.txt -A3 - FileA31.txt - FileA32.txt B -B1 - FileB11.txt - FileB12.txt - FileB13.txt -B2 -FileB21.txt

I believe that I understood the question, understanding that directories A and B are two separate directories with A1, A2 and A3, which are subdirectories of A, and B1 and B2 are subdirectories of B. Therefore, I answered like this :.

 Find . '\(-name "A" –and –name "B"\)' –type f –exec cat '{}' \; | awk '{c[$1]++} END {for (i in c) print i, c[i]}'

But still I got feedback that the above script was not enough. What is wrong with this script?

+5

bash shell awk

user3624000 Aug 31 '15 at 16:49

source share

1 answer

Filipe gonçalves · Accepted Answer · 2015-08-31T18:18:11+0000

The main limitation is that the script assumes that there is exactly one word for each line. c[$1]++ simply increases the increment of the first field of each line.

The question does not say anything about the number of words in a line, so I assume that this was not the goal - you need to go through each word in a line. Also, what about empty lines? With an empty line, $1 will be an empty line, so your script will finish counting the “empty” words (which it will happily display as part of the output).

In awk, the number of fields per line is stored in the NF built-in variable; thus, it is easy to write code to scroll words and increment the corresponding score (and it has a good side effect of implicitly ignoring lines without words).

So, I would do something like this:

 find . -type f -exec cat '{}' \; | awk '{ for (i = 1; i <= NF; i++) w[$i]++ } END { for (i in w) printf("%-10s %10d\n", i, w[i]) }'

I removed the directory name restrictions in the find(1) argument for the sake of brevity and made it more general.

This is (possibly) the main problem with your solution, but the question is (intentionally) vague, and there are many questions that need to be discussed:

Is case sensitive? This decision considers the world and the world as different words. Is desirable?
What about punctuation? If hello and hello! be considered the same word? What about commas? That is, do I need to parse and ignore punctuation marks?
Speaking of which - what about what happens against what? Do we consider their different words? And is it against him? English is hard!
Most importantly (and related to the points above), what exactly defines a word? We assumed that the word is a sequence of non-spaces (the default value in awk). That's for sure?
If there are no words in the input words, what will we do? This solution does not print anything - maybe we should print a warning message?
Is there a fixed number of words per line? Or is it arbitrary? (For example, if there is exactly one word per line, your decision will be enough)

FWIW, always remember that your success in the interview is not binary yes / no. This is not so: Unfortunately, you cannot make X, so I am going to reject you. Or: Oh, the wrong answer, you are gone. More important than the answer is the process that takes you there, and regardless of whether you (a) know your assumptions; and (b) your limitations on the decision. The above questions show the ability to consider edge cases, the ability to clarify assumptions and requirements, etc., which is more important than getting the “right” script (and probably there is no such thing as a “Rule” Script).

Shell script to show the frequency of each word in a file and in a directory

More articles: