Regular expression to exclude hyphenated words in R

I use R to mean typing; after tokenization, I end with the char vector, in which punctuation marks, apostrophes and hyphens are stored. For example, I have this original text

txt <- "this ain't a Hewlett-Packard box - it an Apple box, a very nice one!"

After tokenization (which I do with help scan_tokenizerfrom the package tm), I get the following char vector

   > vec1
 [1] "this"            "ain't"           "a"               "Hewlett-Packard"
 [5] "box"             "-"               "it's"            "an"             
 [9] "Apple"           "box,"            "a"               "very"           
[13] "nice"            "one!"           

Now, to get rid of the punctuation marks, I do the following

vec2 <- gsub("[^[:alnum:][:space:]']", "", vec1)

This, I replace everything that is not alphanumeric, spaces and apostrophes "; however, this is the result

> vec2
 [1] "this"           "ain't"          "a"              "HewlettPackard" "box"           
 [6] ""               "it's"           "an"             "Apple"          "box"           
[11] "a"              "very"           "nice"           "one"    

I want to keep the portable words sych as "Hewlett-Pacakard" by getting rid of lone hyphens. Basically I need a regular expression to exclude the hyphen form word \w-\win the expression gsubfor vec2.

+4
6

" ", '^-$' ( .

vec2 <- vec1[!grepl( '^-$' , vec1) ]

" ", :

vec2 <- vec1[!grepl( '^[[:punct:]]$' , vec1) ]
+5
strsplit(gsub('[[:punct:]](?!\\w)', '', txt, perl=T), ' ')[[1]]
 #[1] "this"            "ain't"           "a"              
 #[4] "Hewlett-Packard" "box"             ""               
 #[7] "it's"            "an"              "Apple"          
#[10] "box"             "a"               "very"           
#[13] "nice"            "one"

, "":

strsplit(gsub('(?<!\\w)[[:punct:]](?!\\w)', '', txt,perl=T), ' ')[[1]]
#  [1] "this"            "ain't"           "a"              
#  [4] "Hewlett-Packard" "box"             ""               
#  [7] "it's"            "an"              "Apple"          
# [10] "box,"            "a"               "very"           
# [13] "nice"            "one!"

regex lookbehinds lookaheads. (?!\\w) (, ) , , . (?<!\\w) . , , . , lookbehind "" , "" , .

+2
strsplit(gsub("[^[:alnum:][:space:]'-]", "", txt), '\\s|\\ - ')
+2

,

> library(stringr)    
> txt <- "this ain't a Hewlett-Packard box - it an Apple box, a very nice one!"
> gsub("(?!\\b['-]\\b|\\s)[\\W_]", "", str_extract_all(txt, "\\S+")[[1]], perl=T)
 [1] "this"            "ain't"           "a"              
 [4] "Hewlett-Packard" "box"             ""               
 [7] "it's"            "an"              "Apple"          
[10] "box"             "a"               "very"           
[13] "nice"            "one"  

> strsplit(gsub('(?!\\b[[:punct:]]\\b|\\s)[\\W_]', '', txt,perl=T), ' ')[[1]]
 [1] "this"            "ain't"           "a"              
 [4] "Hewlett-Packard" "box"             ""               
 [7] "it's"            "an"              "Apple"          
[10] "box"             "a"               "very"           
[13] "nice"            "one" 
+2

strsplit (\b) (\W, [^[:alnum:]_])

strsplit(txt, "\\b | \\b|\\W |\\W$")
#[[1]]
# [1] "this"            "ain't"           "a"               "Hewlett-Packard"
# [5] "box"             ""                "it's"            "an"             
# [9] "Apple"           "box"             "a"               "very"           
#[13] "nice"            "one"            

"".

strsplit(txt, "\\b | \\b| ?\\W |\\W$")
#[[1]]
# [1] "this"            "ain't"           "a"               "Hewlett-Packard"
# [5] "box"             "it's"            "an"              "Apple"          
# [9] "box"             "a"               "very"            "nice"
#[13] "one"
+2

: -, , -, Unicode, , , , (., , http://www.fileformat.info/info/unicode/category/Pd/list.htm).

:

( ), :

vec1[!(vec1 %in% "-")]

( Unicode), :

vec1[!stringi::stri_detect_regex(vec1, "^\\p{Pd}$")]

The latter uses a Unicode character character Pdrepresenting a "punctuation mark or hyphen." This includes inextricable hyphens, em strokes, etc., ^and $at the beginning and end of the regular expression means that it will be a separate character.

+1
source

All Articles