Text Mining - The Most Common Normalized Words

I am a researcher and have about 17,000 free text documents, of which about 30-40% are related to my result. Is there an open source tool that I can use to identify the most common words (or even phrases, but not necessary) that are associated with the result, normalizing the frequency of words that already occur? All documents are written by health workers, therefore it is important to normalize, since both documents will have a technical language, and also want to delete such words as "the", "it", etc.

What I'm trying to do is create a tool using regular expressions or NLP, which will then use these words to determine the result based on new documents. I do not plan to spend a huge amount of time setting up the NLP tool, so something with reasonable accuracy is enough.

I know SAS, SQL (using postgreSQL) and Python, but I can potentially get it in R. I have not done NLP yet. Is there any software that I could use that is not too steep in the learning curve? Thanks!

+4
source share
4 answers
tool I can use to determine the most common words... ... so something with reasonable accuracy is good enough. 

I suggest that you try using unix text tools first. From coursera Natural Language Processing Word Tokenization Lesson, Youtube Course here . A simple tutorial is here .

We use tr , uniq and sort for this purpose. If you used unix text tools before, this is the full command here.

  tr -sc 'AZ' 'az' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r 

Otherwise, an explanation of each part is provided below.

 tr -sc 'A-Za-z' '\n' < filename.txt 

This command accepts the file filename.txt, modifying each word, essentially adding a new line after each word.

 tr -sc 'A-Za-z' '\n' < *.txt 

Same as above, but all txt files in your directory.

 tr -sc 'A-Za-z' '\n' < *.txt | sort 

Run the command to sort. First, a lot of the words "."

 tr -sc 'A-Za-z' '\n' < *.txt | sort | uniq -c 

The result of sorting pipes for the uniq command and counting it.

 tr -sc 'A-Za-z' '\n' < *.txt | sort | uniq -c | sort -n -r 

Repeat the command again to see the most used, most common words.

The problem here is: "and" and "and" counted twice

 tr -sc 'AZ' 'az' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r 

or

 tr '[:upper:]' '[:lower:]' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r 

Change all your words to lowercase letters and to the same tube. This will give you the most common words in your files.

+2
source

GATE (General Text Engineering Architecture) is a useful tool here.

Creating annotations and composing phrases using annotations above the body using the GUI tool and then starting the Java Annotation Patterns (JAPE) engine is very useful for this.

http://gate.ac.uk/sale/tao/splitch8.html#chap:jape

and

http://gate.ac.uk/sale/thakker-jape-tutorial/GATE%20JAPE%20manual.pdf

or

http://gate.ac.uk

- useful links that you can view.

We used our system to extract signs and symptoms from the medical building using this tool in one of our applications.

Thanks.

0
source

NLP, of course, is not easy and may not be required in this particular case. Regarding normalization, maybe tf-idf would be enough?

0
source

Here you can find links to some useful R packages:

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

-1
source

All Articles