Search for incorrect word names in a character vector with R - "reverse" spellcheck

I use the text processing of a large database to create indicator variables that indicate the appearance of certain phrases in the observation comment field. Comments were introduced by technical experts, so the terms used are always consistent.

However, there are cases when the technicians mistakenly wrote the word, and therefore my grepl () function does not understand that the phrase (although incorrect) occurred in the observation. Ideally, I would like to represent each word in a function phrase that returns a few common spelling errors or typos of the specified word. Does such a function R exist?

With this, I could search for all possible combinations of these spelling errors of the phrase in the comment field and display them in another data frame. Thus, I could examine each origin in each case to determine whether the phenomenon of interest to me has been described by a technical specialist.

I have Googled around, but just found links to the actual spellchecker packages for R. What I'm looking for is a “reverse” spellchecker. Since the number of phrases I'm looking for is relatively small, I could really check for spelling errors manually; I just thought it would be nice if this ability were built into the R package for future text development efforts.

Thank you for your time!

+8
r spell-checking text-mining tm
source share
1 answer

As Gavin Simpson suggested, you can use aspell. I think you need to install aspell for this. On many Linux distributions, this is the default; I do not know about other systems or whether it is installed with R.

For usage examples, see the following function. It depends on your input and what exactly you want to do with the result (for example, the correct spelling error with the first sentence) that you did not specify:

check_spelling <- function(text) { # Create a file with on each line one of the words we want to check text <- gsub("[,.]", "", text) text <- strsplit(text, " ", fixed=TRUE)[[1]] filename <- tempfile() writeLines(text, con = filename); # Check spelling of file using aspell result <- aspell(filename) # Extract list of suggestions from result suggestions <- result$Suggestions names(suggestions) <- result$Original unlink(filename) suggestions } > text <- "I am text mining a large database to create indicator variables which indicate the occurence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent. " > check_spelling(text) $occurence [1] "occurrence" "occurrences" "occurrence's" 
+5
source share

All Articles