How to write a custom removePunctuation () function to better deal with Unicode characters?

Question

How to write a custom removePunctuation () function to better deal with Unicode characters?

In the source code of the tm text-mining R package in the transform.R file, there is a removePunctuation() function, currently defined as:

 function(x, preserve_intra_word_dashes = FALSE) { if (!preserve_intra_word_dashes) gsub("[[:punct:]]+", "", x) else { # Assume there are no ASCII 1 characters. x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x) x <- gsub("[[:punct:]]+", "", x) gsub("\1", "-", x, fixed = TRUE) } }

I need to analyze and publish some abstracts from a scientific conference (taken from their website as UTF-8). The abstract contains some Unicode characters that need to be removed, especially at word boundaries. There are regular ASCII punctuation characters, but also a few Unicode strokes, Unicode quotes, math characters ...

The text also has URLs, and there the punctuation must be preserved by punctuation characters inside the word. tm built-in removePunctuation() function is too radical.

So I need a special removePunctuation() function to remove according to my requirements.

My Unicode custom function now looks like this, but it does not work properly. I only use R rarely, so it takes some time to complete things in R, even for the simplest tasks.

My function:

 corpus <- tm_map(corpus, rmPunc = function(x){ # lookbehinds # need to be careful to specify fixed-width conditions # so that it can be used in lookbehind x <- gsub('(.*?)(?<=^[[:punct:]'"":±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ; x <- gsub('(.*?)(?<=^[[:punct:]'"":±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ; x <- gsub('(.*?)(?<=^[[:punct:]'"":±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ; x <- gsub('(.*?)(?<=^[[:punct:]'"":±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ; x <- gsub('(.*?)(?<=^[[:punct:]'"":±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ; # lookaheads (can use variable-width conditions) x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]'"":±]+)$',"\1 ", x, perl=TRUE) ; # remove all strings that consist *only* of punct chars gsub('^[[:punct:]'"":±</>]+$',"", x, perl=TRUE) ; }

It does not work as expected. I think that doesn’t mean anything. Punctuation is still in the matrix of terms-documents, see

  head(Terms(tdm), n=30) [1] "<></>" "---" [3] "--," ":</>" [5] ":()" "/)." [7] "/++" "/++," [9] "..," "..." [11] "...," "..)" [13] """," "(|)" [15] "(/)" "(.." [17] "(..," "()=(|=)." [19] "()," "()." [21] "(&)" "++," [23] "(0°" "0.001)," [25] "0.003" "=0.005)" [27] "0.006" "=0.007)" [29] "000km" "0.01)" ...

So my questions are:

Why doesn't calling my () {} function have the desired effect? How can my improve function?
Are Unicode regex patterns patterns like if \P{ASCII} or \P{PUNCT} supported in R perl-compatible regexes? I think they are not (by default) PCRE:: "Only support for various Unicode properties with \ p is incomplete, although the most important are supported."

+7

r unicode text-mining tm

knb Jan 11 '13 at 15:26

source share

2 answers

I had the same problem, custom function did not work, but actually the first line should be added below

Hi

Susanna

 replaceExpressions <- function(x) UseMethod("replaceExpressions", x) replaceExpressions.PlainTextDocument <- replaceExpressions.character <- function(x) { x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE) x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE) x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE) return(x) } notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)

+1

susqc Apr 05 '13 at 9:08

source share

Jochen · Accepted Answer · 2015-07-13T14:04:07+0000

As far as I like Susana, it breaks Corpus in newer versions of tm (no longer PlainTextDocument and does not destroy meta)

You will get a list and the following error:

 Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"

Using

 tm_map(your_corpus, PlainTextDocument)

will return your case to you, but with a broken $ meta (in particular, document identifiers will be absent.

Decision

Use content_transformer

 toSpace <- content_transformer(function(x,pattern) gsub(pattern," ", x)) your_corpus <- tm_map(your_corpus,toSpace,"„")

Source: Practical Data Science with R, Text Mining, Graham.Williams@togaware.co m http://onepager.togaware.com/

Update

This function deletes everything that is not alphanumeric (for example, UTF-8 emoticons, etc.)

 removeNonAlnum <- function(x){ gsub("[^[:alnum:]^[:space:]]","",x) }

How to write a custom removePunctuation () function to better deal with Unicode characters?

Update

More articles: