How to count the number of sentences in a text in R?

Question

How to count the number of sentences in a text in R?

I read the text in R using the readChar() function. I am trying to test the hypothesis that in sentences of the text there are as many letters "a" as there are occurrences of the letter "b". I recently discovered a package {stringr} , which really helped me do useful things with my text, for example, counting the number of characters and the total number of occurrences of each letter in the whole text. Now I need to know the number of sentences in the whole text. Does R have any function that can help me do this? Thank you very much!

+6

r text-mining

SavedByJESUS Sep 26 '12 at 8:51

source share

2 answers

What you are looking for is the tokenization of the sentence, and it is not as straightforward as it seems, even in English (sentences such as “I met Dr. Bennett, Mrs. Yohon’s ex-husband” may contain complete stops).

R is definitely not the best choice for natural language processing. If you are experienced Python , I suggest you take a look at nltk , which covers this and many other topics. You can also copy the code from this blog post , which performs sentence tokenization and word tokenization.

If you want to stick to R, I would suggest that you count the characters at the end of the sentence ( . , ? , ! ), Since you can count the characters. The way to do this with regex is:

 text <- 'Hello world!! Here are two sentences for you...' length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])

+6

gui11aume Sep 26 '12 at 9:16

source share

SavedByJESUS · Accepted Answer · 2012-09-26T15:37:59+0000

Thanks @ gui11aume for your answer. A very good package that I just found that can help make {openNLP} . This is the code for this:

 install.packages("openNLP") ## Installs the required natural language processing (NLP) package install.packages("openNLPmodels.en") ## Installs the model files for the English language library(openNLP) ## Loads the package for use in the task library(openNLPmodels.en) ## Loads the model files for the English language text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by @gui11aume x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language. x ## Displays the different sentences in the string vector (or text). [1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! " [2] "I can't wait to see them again." length(x) ## Displays the number of sentences in the string vector (or text). [1] 2

The {openNLP} package is really good for handling natural language in R, and you can find a good and short introduction to it here or you can check the documentation package here .

The package supports three more languages. You just need to install and download the appropriate model files.

{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai

How to count the number of sentences in a text in R?

More articles: