tool I can use to determine the most common words... ... so something with reasonable accuracy is good enough.
I suggest that you try using unix text tools first. From coursera Natural Language Processing Word Tokenization Lesson, Youtube Course here . A simple tutorial is here .
We use tr , uniq and sort for this purpose. If you used unix text tools before, this is the full command here.
tr -sc 'AZ' 'az' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r
Otherwise, an explanation of each part is provided below.
tr -sc 'A-Za-z' '\n' < filename.txt
This command accepts the file filename.txt, modifying each word, essentially adding a new line after each word.
tr -sc 'A-Za-z' '\n' < *.txt
Same as above, but all txt files in your directory.
tr -sc 'A-Za-z' '\n' < *.txt | sort
Run the command to sort. First, a lot of the words "."
tr -sc 'A-Za-z' '\n' < *.txt | sort | uniq -c
The result of sorting pipes for the uniq command and counting it.
tr -sc 'A-Za-z' '\n' < *.txt | sort | uniq -c | sort -n -r
Repeat the command again to see the most used, most common words.
The problem here is: "and" and "and" counted twice
tr -sc 'AZ' 'az' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r
or
tr '[:upper:]' '[:lower:]' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r
Change all your words to lowercase letters and to the same tube. This will give you the most common words in your files.